Blog

Vector Search: Navigating Recall and Performance

Vector similarity search (or “dense retrieval” when text embeddings are involved) has gained widespread popularity over the last few years. It has become a fundamental component in various applications, ranging from search engines to recommendation systems. As vector search is computationally intensive, it is often performed using approximate algorithms which avoid the brute-force comparison of a query vector to all corpus vectors. While approximate nearest neighbors (ANN) search dramatically improves query latency, it comes at the cost of some loss in accuracy or “recall”, where recall@N refers to the ratio of the top N true nearest neighbors returned within the top N approximate nearest neighbors.

Most ANN algorithms expose parameters that can be tuned to find an acceptable balance between query performance and accuracy. Surprisingly, we see clients that are not aware of the magnitude of “recall loss” their current ANN retrieval system incurs, nor have they systematically measured where on the “latency vs. recall” spectrum their current query system resides. Public benchmarks, such as those found on ANN Benchmarks, provide some valuable intuition. On the other hand, they also demonstrate that the numbers differ widely and are very specific to the ANN algorithm but also the dataset and the embedding model (or the resulting intrinsic dimensionality of the vectors).

In this post, we provide an example of how to measure the latency-vs-recall trade-off in semantic text retrieval, which you can quickly adapt to your own dataset and retrieval system. It reflects the approach we’ve taken evaluating similar setups for our clients in the past.

Setup

  • ANN Retrieval Backend: We use OpenSearch’s Lucene HNSW (Hierarchical Navigable Small Worlds) backend. The evaluation would be similar for any of the other ANN retrieval systems (“vector databases”).
  • Dataset: The corpus is a dataset of 6.7m English Wikipedia articles from 2023. We want to use real (i.e. not synthetically generated) data and a dataset which is large enough to generate meaningful results but still manageable to run reasonably quickly on a single machine. We use the 633 questions from WikiQA and use the good old all-MiniLM-L6-v2 text embeddings to create 384 dimensional vectors.
  • Evaluation: HNSW has three main tunable parameters: M (the number of connections between nodes), efConstruction (the “beam width” during graph construction) and efSearch (the beam width during search). We only vary the query-specific efSearch parameter (or “k” as it is named in OpenSearch) and measure the accuracy (or “recall” as it is called in ANN benchmarks) and query latency. The recall up to a position N is calculated as the percentage of true top N nearest neighbors returned within the top N approximate result.

Peculiarities of Lucene HNSW

When evaluating performance trade-offs, it helps to gain some understanding of the properties of the underlying algorithms. Lucene implements the widely adopted graph-based HNSW algorithm. In the case of HNSW, an “approximate” result does not refer to some inexactness in the similarity score but to a result where some vectors are completely missing from the result as they were not encountered during graph traversal.

During HNSW graph construction, a given vector (or rather a reference to it) is inserted into a multi-level (hierarchical) structure. The location and the connections of that vector in the graph will depend on different factors:

  • Its distance (as measured by some distance function) to other vectors already in the graph
  • The configured number of connections between nodes in the graph (the M parameter)
  • The “beam width” during graph construction, i.e. the traversal limit for searches for similar vectors (the efConstruction parameter)
  • The insertion level(s) as determined by some random distribution (in Lucene since 9.1 (See more info here) initialized with a fixed seed to guarantee a deterministic graph for the same vector insertion order and a given M and efConstruction).

During search, the graph is traversed, the distance of each candidate node is calculated, and the nearest vectors are returned. The efSearch parameter enforces an upper bound on the number of nodes to visit during this traversal and any nodes not encountered during traversal will not be considered in the result. Thus, it is possible for the graph traversal to miss relevant similar vectors for a given query vector by running into a local maximum or (at least for some less optimized versions of the algorithm (See more info here) for a graph becoming disconnected and relevant nodes unreachable.

The likelihood of missing out on relevant nodes depends on the properties of the dataset (distribution of relative distances between vectors and their dimensionality), the graph construction parameters (see above), the node insertion order (See more info here) and the efSearch parameter value. In our small experiment, for simplicity’s sake we keep all factors constant except the efSearch parameter. 

Evaluation Steps

  1. Embedding the articles: To keep the size of data manageable, we only embed the first 1000 characters of each article without any chunking or post-processing. The process of embedding the 6.7m article snippets with a batch size of 32 takes about 11h on a three- year old Macbook M1 Max (~180/s). We store the normalized vectors on disk for later use.
  2. Indexing to OpenSearch: We index the 6.7m embeddings along with all article titles and text (for possible later experiments) to a local OpenSearch 2.17.1. We use the Lucene KNN backend with the default settings ef_construction=100 and m=16. Indexing takes about 2h on the same M1 Max Macbook (~900/s).
  3. Retrieval: We embed the 633 questions and find the exact nearest neighbors from the 6.7m corpus embeddings (another ~90s on the same machine) and then query OpenSearch for the approximate nearest neighbors for the 633 questions for different values of “k” (representing the efSearch parameter of the HNSW algorithm). We calculate a recall and log the server-side search latency from the OS response’s “took” value.

To run this experiment yourself check out our Github repository and follow the steps in the README.md.

Results

kmean latency (ms)recall@1recall@3recall@5recall@10
105.310.560.520.520.51
207.690.640.620.620.61
4011.500.730.730.720.72
8017.270.820.810.810.80
16026.800.880.880.880.87
24034.960.900.910.910.90
32043.590.920.920.920.91
48056.670.950.940.940.93
64068.280.970.950.960.95

Or in a plot:

Vector Search Performance:  Graph showing Recall vs. Query Time for Values of k.
  • As expected, a higher k correlates with a higher recall but also higher latency.
  • Results show that a low number of k produces unacceptably low recall (e.g. for k=40 it is 0.73 = 27% of true top nearest neighbors missing!).
  • Even with k=240 (~35ms, recall@1 ~ 0.9) for about 10% of the queries, the true top nearest neighbor is missing from the result.
  • This number decreases to 3.5% for k=640 but with a doubling of latency from 36.8ms to 71.5ms.
  • Whether these numbers are acceptable or need mitigation measures depends on your domain and use case.

Looking more closely where documents are missing in the approximate results, it appears that this seems very query-specific. Some queries which seem to have an “unfortunate” embedding that is not well represented by the HNSW graph have multiple documents missing in their top results even at a high value of k.

Vector Search Performance: Graph showing missing documents.

Mitigations

There are ways to mitigate that loss of precision. For the specific setup in this experiment (OpenSearch with Lucene HNSW), here are some options:

  • Increase efSearch/k. Try even higher values of k if your overall system allows for a higher latency.
  • Sharding: Index your vectors into multiple shards. Each shard has its own (smaller) HNSW graph and k will be considered per shard, potentially increasing the recall.
  • Adjust the graph construction parameters efConstruction and M. A higher efConstruction (“beam width” during vector insertion) can increase the graph’s quality at the cost of slower index time. A higher M value will increase the number of connections between graph nodes, decreasing the likelihood of missing nodes during traversal at the cost of slower indexing time and an increase in required memory.
  • Use a more randomized insertion order (See more info here).
  • Combine multiple retrievers (e.g. hybrid keyword/vector retrieval and/or vector search on multiple fields) that increases the likelihood of the missing documents to turn up in another partial result (just remember to re-score their vector similarity and not assume it is 0).
  • Use exact NN search. For smaller datasets and/or use cases allowing for higher latencies it might be entirely feasible to use exact search (i.e. comparing the query embedding with every pre-computed corpus embedding).

Conclusions

It is important to understand the implications of using ANN for vector search and the possible trade-offs. We demonstrated how to measure the balance between result latency and accuracy for a given dataset and retrieval system by varying a single query parameter (k or efSearch). In our OpenSearch + Wikipedia example, even with relatively high k values there was a considerable loss in accuracy that one has to account for. In the case of HNSW, that’s a loss in recall, i.e. a vector close to the query disappearing from the result for a certain percentage of queries. We listed some possible mitigations and encourage you to measure where your current retrieval system is on the accuracy vs. latency spectrum.


Need help measuring and solving your recall challenges with vector search?

Partner with Open Source Connections to transform your search capabilities and empower your team to continuously evolve them. Our proven track record spans the globe, with clients consistently achieving dramatic improvements in search quality, team capability, and business performance. Contact us today to learn more.