With the advent of vector search engines like Weaviate, Milvus, Vespa or Qdrant, neural search frameworks like Jina or Haystack and of course the availability of vector search capabilities in widely used search engines like Solr, Elasticsearch or Opensearch, the adoption of vectors in search applications increases. In a previous blog post we gave an introduction on how to apply vector search in an e-commerce scenario and the Chorus framework, an open source reference implementation for e-commerce, now includes vector search. But how can we compare vector search to the more traditional approach?
In this blog post we want to show how to evaluate relevance gains (or losses) when using vector search, to have actual proof of how useful this technique is. We will demonstrate this by following a practical example: we are assuming that our e-commerce company is currently using a traditional BM25-based technology as their search engine. This will be our so-called baseline system. Our challenger system will be based on a pre-trained model.
This post does not intend to compare the systems or technologies used but only the results they produce. For context we will introduce both briefly:
The challenger system will be based on a pre-trained model (
sentence-transformers/all-mpnet-base-v2). For a deeper look you can check out the above mentioned blog covering the hackday we ran at Jina AI in Berlin in late 2022 where we set up a system that uses this model – thanks again Jina AI for hosting and providing this learning opportunity. The data indexed by this system is the same data to make the results produced by the two systems comparable. No model fine tuning is applied and we went with the default model for the sake of simplicity as the process of evaluation is the focus of this blog post not the model selection.
As both Elasticsearch and Jina AI are solely the vehicles used to produce the results this is rather a comparison of techniques (BM25-based retrieval & vector based retrieval using a particular pre-trained model) than a comparison of technologies. Note that the approach shown also applies for evaluating any two search systems holding identical data.
Steps to Evaluate Search Relevance
In short, evaluating potential relevance improvements of one system over another means judging the results of your current system and comparing them with the results of the challenging system.
We want this process to be scientifically sound, therefore we need to follow a few steps to make this a solid approach:
- Choose a search metric to have something numeric that we can compare.
- Create a baseline relevance case: this will hold the queries that we want to use to grade our systems.
- Rate documents for your baseline relevance case.
- Compute the search metric for the search systems for comparison and evaluation.
The next sections give further detail on all the process steps.
Choose a Search Metric
Being scientifically sound means we want to rely on a reliable scoring system, not on our gut feeling. For this blog post, we’ll go with nDCG (normalized discounted cumulative gain). It measures how well our search engine is ranking the documents relative to the ideal ranking of these documents. It allows us to score documents for a given query on a scale from 0 to 3, 0 being a completely irrelevant result, 3 being a perfect match.
Create a Baseline Relevance Case
Next, we need to choose the set of queries that we will call the baseline relevance case. The queries of this baseline relevance case ideally represent typical queries and are taken from your search analytics data. Of course, it’s not feasible to use all the queries that ever happened as we will be manually judging the relevance of the results for each of them. Sampling them proportionally to their frequency is a technique to get to a representative set that reflects relative query importance. As soon as you have around 50 queries you’re good to go. The more queries you have the more this set represents the queries your system actually deals with.
We are using the set of queries from Chorus for this blog post, containing 135 individual queries.
Rate Documents for your Baseline Relevance Case
With the set of queries at hand we need to put on our judge wigs and rate some documents. As we have two search systems that we want to compare in this case we need to rate the first couple of documents we receive as a result from each of them for all queries defined in our baseline relevance case. We are using nDCG@10. The @ notation is common for all search metrics meaning that scoring is done at the rank named after the @. Following best practice approaches we judge to depth 20 as it is a rule of thumb to judge twice as deep as you are scoring.
If you’re interested in how to create a baseline for your search application we have a blog post outlining the steps and an example based on Chorus (the Solr edition).
Based on the data provided with Chorus, we use an extended set of ratings that contains the necessary grades for computing the metric.
Compute the Search Metric
Knowing the math behind metrics like nDCG is important as this helps you understand the advantages and disadvantages of these metrics and how they behave. The good news is that we have tools readily available to support us in actually computing these metrics. One of these tools is
trec_eval. It can compute several search metrics including nDCG based on two files:
- A file containing the results
- A file containing the ratings
Let’s look at the ratings file. Each line in the ratings file holds four pieces of information separated by tabs:
- The query identifier
- An unused column filled with 0s for each line
- The document id
- The rating of the document for the query
One line in our ratings file looks like this:
66 0 1229111 0
You can read this line as follows: “The document with the id 1229111 was rated ‘0’ for the query id 66”.
The file containing the results is a tab-separated file with six columns:
- The query identifier
- An unused column filled with Q0 for each line
- The document id
- The rank of the document for the given query
- The relevancy score produced by the search engine
- The name or id of the run
One line in our results file for one of the search systems looks like this:
66 Q0 79557177 1 132.76495 es
You can read the line as follows: “The document with the id 79557177 ranked at position 1 for the query id 66 and was scored 132.76495 in the run ‘es’”.
By submitting these two files to
trec_eval, it can compute our search metric nDCG@10 with the following command assuming that we have the results in a file called es_results and the ratings in a file called
trec_eval ratings.qrels es_result -m ndcg_cut
We need to run the
trec_eval command twice: once for each search system. And we get two nDCG scores back in return.
And now to the interesting question: which search system did better? Both systems’ nDCG scores were actually pretty close:
What does this actually mean?
BM25-based retrieval, the currently used technique in our fictitious example, appears to outperform the vector based retrieval with a slightly higher nDCG@10 of 0.6298 compared to 0.6123. The Student’s T-Test results in a t-score of ~0.4. This indicates that the compared groups (the results of the two runs) are similar. Thus the reported difference in nDCG@10 is statistically not relevant – e.g. both systems are performing around the same in our test and neither can be realistically considered as ‘better’ than the other.
Remember both systems were totally un-tuned, using defaults, and that we used Elasticsearch just as an easy way to run a BM25 system and Jina AI as an easy way to run the pre-trained model. Both platfoms are capable of much better performance in practice, and this was in no way a bake-off between them.
What can we make of comparisons like these? We can see this as an interesting foundation for deeper research on where these systems have specific strengths and weaknesses. We could drill down on query level, or do further explorations on query category levels (e.g. head queries vs. tail queries).
Latest developments in information retrieval show very promising results in combining sparse and dense vector scoring – a hybrid strategy. We also used a hybrid strategy in the past when participating in Spotify’s Podcast track at TREC 2021 – that way we could leverage the strengths of each scoring strategy.
Combining these techniques for this example may be the next entry in this blog series. This reveals if the combination of strategies outperforms both individual ones.
If you want to reproduce these results for yourself you can check out the accompanying repository and let us know what you think.
Image from Compare Vectors by Vecteezy