Dense vector representations of text (and the deep learning models that output them) are changing how information retrieval (IR) is done. For years now dense vectors output by deep learning models have been leading IR benchmark tasks like MS-Marco and with the rise of Chat-GPT this year it looks like they will start to play a role in commercial web search too. Realizing value from these models in your search system is closer than it appears, with one interesting approach being hybrid search, combining vectors with traditional BM25 scoring.
This post outlines how we utilize a variant of BM25 for sparse vector scoring and combine it with cosine similarity between dense vectors to achieve a hybrid ranking algorithm with better search performance than either technique alone. All of the work was done in Elasticsearch and without fine tuning or re-training any models. The methods and results described here were empirically validated by participating in Spotify’s Podcast track at TREC 2021, where one of our hybrid rankers placed first for overall relevance based on precision.
Hybrid Search has Hybrid Vigor
Hybrid vigor is a biological concept that describes how an offspring may be more vigorous, with enhanced or improved characteristics, due to its diverse parents. Similarly, hybrid search is the combination of dense and sparse vectors to rank documents, and it demonstrates better relevancy performance than either dense or sparse alone.
Combining Scores
Ranking documents relies on scoring a document relative to a query and it is this scoring that is the biggest difference between dense and sparse vector representations of documents. Traditional sparse vector scoring algorithms like BM25 used in this work rely on the counting of tokens, which means the document scores don’t have a defined upper bound. Scoring dense vectors is different, instead measuring the distance between coordinates in semantic space, which is usually done with cosine similarity. Cosine similarity does have an upper bound of 1.0, and this variation in scale poses a challenge, that we proposed to alleviate with BM25-max-scaled (BM25-max).
BM25-Max-Scaled
The scores outputted by BM25 don’t have an intrinsic value themselves and will fluctuate across queries. A document score of 25.0 doesn’t always mean that document is a relevant match across different queries. Instead, the scores are only useful when compared within a query, a document scored 25.0 is twice as good as a document scored 12.5. Using the maximum value (aka the first result score) to scale all of the BM25 scores, provides a fast way to enforce an upper limit of 1.0, while also preserving the relative spacing in the newly scaled values. This differs slightly from the more widely used min-max normalization in machine-learning, as it doesn’t require finding the minimum, but it still achieves useful score scaling and aligns better with real world IR tasks where only the top-N is retrieved and there is no need to be exhaustive.
There are multiple normalization options available and we covered some at last year’s MICES EU talk on ranking factors (slides and recording available). Which technique works best for you depends on the scale you aim for, the input data and other factors. You could even normalize different ranking factors with different normalizers. At the end of the day, experimentation based on judgments will lead the way to knowing what works best in your scenario.
Combination versus Re-ranking windows
Re-ranking windows are a battle tested strategy from Learning to Rank (LTR) models to fine tune the top-N results with a more expensive operation like model inference. Dense vectors take more time to retrieve documents across an entire corpus compared to sparse vectors. Because of this, dense vectors have given rise to Approximate Nearest Neighbor (ANN) algorithms that utilize graphs, to greatly reduce retrieval time at the expense of some lost precision. This work doesn’t use an ANN approach, instead relying on brute force, to ensure the best documents are always retrieved for a query. But in practical applications an ANN approach would have to be used if dense vectors are utilized in first phase retrieval across the entire corpus. If dense vectors are only used in a re-ranking phase, then brute force calculations are palatable in terms of retrieval latency. We test both ordering permutations (sparse-rescored-by-dense and dense-rescored-by-sparse) to investigate if one is advantageous over the other. We used a window size of 1,000 for both orderings.
Summing BM25-max and Cosine Similarity
For hybrid rankings, the final ranking score for a document was computed as the sum of its first phase score plus its second phase score. Only sparse vector scores were scaled with BM25-max, the cosine similarity scores were left as is, to preserve the original scale of the deep learning model’s projection. Because cosine similarity has a minimum of -1, it is possible for documents to be pushed further down the results when dense vectors were used in the re-ranking phase, but this was not observed in practice. The combined hybrid scores have a theoretical maximum of 2.0 and minimum of -1.0, but most of the observed values were between 1.0 and 1.9.
This combination technique aims to preserve the relative spacing between results so that the score of a strong match would maintain a relatively large numeric difference from that of weaker match. This idea is aimed at improving on purely rank based combination strategies like Reciprocal Rank Fusion, that ignore relative scores in favor of rank order.
An experiment on hybrid search
Experimental Conditions
We devised four strategies (or “runs” in TREC parlance) to detangle our questions about the utility of BM25-max and transfer learning for out of domain tasks. The runs were:
-        sparse-only (tok)
-        dense-only (vec)
-        sparse-rescored-by-dense (tok_vec)
-        dense-rescored-by-sparse (vec_tok)
We used the MS-MARCO model from the S-BERT project to produce the dense vector representations of the documents.
Experimental results
Results from the training data evaluation and the conference test data evaluation are reported in Table 1.
| Strategy | nDCG@30 | P@30 | 
| tok | 0.28 | 0.32 | 
| vec | 0.28 | 0.34 | 
| tok_vec | 0.35 | 0.41 | 
| vec_tok | 0.32 | 0.38 | 
Experimental discussion
The combination rescore methods both outperformed the single representation methods in both training and test evaluation. This supports the conclusion of the original RRF paper, that any ensemble of rankers is better than any singular ranker it has as a component (Cormack, 2009)
The tok_vec strategy was the best performing run in both the training and testing data, consistently besting the other hybrid strategy, vec_tok. Following a strategy similar to tok_vec, where sparse vector scoring is done as the first phase and dense vector scoring in a re-score window is fortuitous as it is the easiest hybrid option to implement:
PUT trec
{
  "mappings": {
    "properties": {
      "title" : {"type": "text"},
      "episode": {"type": "text"},
      "text": {"type": "text"},
      "embed": {"type": "dense_vector", "dims": 3}
    }
  }
}
POST trec/_doc
{
  "title": "Great Steaks",
  "episode": "T-bones FTW",
  "text": "When it comes down to it, in my option, the best cut of steak in the cow is the T-Bone! You can't beat the juicy tenderness",
  "embed": [3.0, 2.1, 0.2]
}
POST trec/_doc
{
  "title": "Beef Eater Anonymos",
  "episode": "Why red meat is actually good for you",
  "text": "Sure a steak tasts great, but is it also great for you? I think it is and my guess Teddy Steaks here on the show today say YES!",
  "embed": [2.0, 2.9, 0.1]
}
GET trec/_search // first we get the maxScore
{
  "size": 1,
  "query": {
    "multi_match": {
      "fields": [
        "title",
        "episode",
        "text^3"
      ],
      "query": "tasty steaks"
    }
  }
}
GET trec/_search
{
  "query": {
    "script_score": {
      "query": {
        "multi_match": {
          "fields": [
            "title",
            "episode",
            "text^3"
          ],
          "query": "tasty steaks"
        }
      },
      "script": {
        "params": {
          "maxScore": 2.064089,
          "queryVector": [3.1, 2.5, 0.1]
        },
        "source": """
        double myScore = _score / params.maxScore;
        double sbertScore = cosineSimilarity(params.queryVector, 'embed');
        return myScore + sbertScore;
        """
      }
    }
  }
}
DELETE trecBecause dense vectors are not involved in the first phase of retrieval, implementations would not require using an ANN algorithm, which may worsen performance. Dense vector scoring described here are exact and represents an upper bound on dense vector performance before any approximations are incorporated for speed improvements.
Conclusions on hybrid search
Dense vectors are a valuable new technique for document representation, but they are not a stand-alone replacement for sparse vectors. Better ranking is achieved by combining the scores of sparse and dense representations in a hybrid implementation, compared to either in isolation. Using BM25-max is an efficient and elegant method for scaling raw BM25 scores to facilitate score combination in a way that preserves the relative strength of matches across queries. Building in BM25 normalization options into Lucene would further ease this combination scoring process and useful bounding properties to the way documents are scored.
If you need help building hybrid search or implementing vector search, talk to us.
Image from Nature Vectors by Vecteezy
 
	
          	
	