Building Vector Search in Chorus: A Technical Deep Dive

March 22, 2023 Atita Arora
Category: Ecommerce

In my last blog we established how vectors function internally and how they could be a game-changer in the world of e-commerce. We hope the first blog helped set the tone for a more focussed discussion of the technical details of our implementation of vector search in Chorus, our open source reference implementation of e-commerce search.

So how does vector search in Chorus really work?

Chorus is a Dockerized reference platform for an e-commerce application. It includes:

a search engine based on Apache Solr or Elasticsearch (OpenSearch support will come soon)
Querqy+SMUI – to take care of query rewriting for search management
Quepid for query rating using human judgements, to help with relevance tuning

Going back to the 4 steps required to implement vector search:

1. Gathering your product dataset

We already have an Icecat dataset of products in the Chorus repository. It consists of data describing roughly 19,000 products, formatted as JSON.

2. Set up the process to transform your product information into embedding vectors

We added 2 new fields into our Solr 9.1 schema to store text vector and image vector data:

<field name="product_vector" type="knn_vector" indexed="true" stored="true"/>
<field name="product_image_vector" type="knn_vector" indexed="true" stored="true"/>

<fieldType name="knn_vector_768" class="solr.DenseVectorField" vectorDimension="768" similarityFunction="cosine" knnAlgorithm="hnsw"/>
<fieldType name="knn_vector_384" class="solr.DenseVectorField" vectorDimension="384" similarityFunction="cosine" knnAlgorithm="hnsw"/>

The CLIP model from OpenAI renders a 768 dimensional encoding while MiniLM renders a 384 dimensional encoding. This is why we used two different vector field types as we used two different transformer models to generate the embeddings, segregated by the dimension support needed by these two different models.

The embedding values are generated using an offline subroutine to transform the product title and brand to vectors for the text vector field and the product image to vectors for the image vector field.The script create_dataset.py available in the Chorus repository can be modified and used for this purpose.

3. Make changes to your search model algorithm to leverage vectors

To modify our search model algorithm we needed an encoding service to support the query text to encoding transformation in real time. This is a new component added to Chorus when it is used in vector search mode. The encoding service is currently implemented using FastAPI and this transformation is supported through Solr Paramset and Querqy Rewriter.

There are two query models supported by Querqy in this implementation:

a. Vector query as main query (Pure KNN search)

b. Vector query as a boost query – Hybrid Search (KNN + keyword-based search)

4. Testing and optimisation

For the final step we deployed everything and we are certainly iterating over testing and improving the search results.

Architecture for vector search in Chorus

What are some of the caveats of leveraging vectors in e-commerce search?

As potentially lucrative and shiny as it may seem, vector search does come with some caveats which should be carefully considered.

Model suitability for the business case

When it comes to addressing business problems there isn’t one model that solves it all. Off-the-shelf models have been highly successful in a wide range of tasks, but they also have certain limitations which require fine-tuning to optimize performance. These models are trained with data that may be biased. They may have limited knowledge of the world which could affect the quality of the results. Depending on the nature of the data or problem we’re trying to fix , we may need to rely on fine-tuning these models to fit better into the use case.

Index cost

We started with the step to add one or more vector fields in the existing search schema, which is used to store encodings – text or image. These encodings usually come in the flavor of high dimensional values (256,384,768,1536) and the respective cost of storing and using this index will depend on how the vector field is used and the number of dimensions selected.

Dimensions are not free hence it’s important to carefully evaluate the costs and benefits of adding a vector field beforehand.

Data quality

Like any other machine learning task , vector search relies on good quality data to capture the relevant semantic meaning of the product data. The business may need to invest in improving data quality to yield maximum benefits.

Infrastructural changes and compatibility

Vector search may not be compatible with existing e-commerce systems and processes. E-commerce companies may need to invest in additional software or hardware to integrate vector search into their existing systems.

For example, GPUs are known to significantly improve vector operations by handling huge amounts of data (required to support large numbers of dimensions), accelerating computations and enabling parallel processing but along comes the requirements for power upgrades, compatibility issues with other hardware and operating systems and an additional and big upfront expenditure to support them.

Finding a fast and scalable vector search service

As the magic factor in vector search is vectors of course, it may become the core need to invest and support a highly available and scalable encoding/vector service.

Overall, while vector search has the potential to improve search experience and positively impact revenue for e-commerce companies, the cost, complexity, data quality, and compatibility challenges may make some hesitant to adopt them.

Challenges of implementing vector search in Chorus

“If there is no challenge, there is no learning.”
Anon

There were certainly some challenges during the implementation which definitely helped with understanding the subject matter. Let us take a quick look at those that are worthy of a mention.

CLIP limited to 77 tokens

The e-commerce use cases are special as the product information is usually structured in distinct fields, although field values are often repeated over multiple fields – for example product title could be a combination of product name + brand or even additional fields like size. Images also play a very important role in providing the first impression of the product, which is one reason why leveraging the Multimodal model seems a natural and obvious choice for addressing needs of an e-commerce case.

The initial implementation was based entirely on the CLIP model where the limitation on the number of tokens processed is 77, which wasn’t good enough to even support one field’s data, let alone combining multiple field values in the text encoding field. This is when we decided to use two different models: leveraging CLIP for images only and MiniLM for text vectors which offered better possibilities by supporting up to 256 tokens.

This change had to be applied to the data vectorisation process and also the query vectorisation process, which certainly gave us some ideas for some future enhancements.

Data processing challenges

Although we used a relatively small dataset (19k) for this purpose, we certainly experienced our fair share of setbacks during data processing. For some parts it was a fairly simple solution like choosing smaller batches to process documents which cut the processing time by a factor of 10.

We also ran into data quality issues like special characters in the data, or unresolved image URIs which had to be addressed with a safety net implementation.

Magic Environment Variables

When we added the component for processing the text into an encoding (vector) – called the encoding service – the challenge here was to have a reliable and fast service. We leveraged FastAPI to implement this service but it was constantly slow until we bumped into this thread and discovered a magic environment variable called

OMP_NUM_THREADS=1

which took it to the next level in terms of response time. This was the most simple in terms of fixes but finding it in the first place was some significant effort.

Scope for future improvements to vector search in Chorus

Having a prototype of a working application got us thinking about what else we can improve and implement. Some ideas to look forward to:

Supporting custom trained models

E-commerce companies may need custom trained models to support needs like business specific personalized shopping experience, dynamic pricing , custom ranking or even to detect fraud in some cases. For this purpose we feel support for a custom model is essential and we’re working on it already.

Image vector search results quality

We have used a CLIP model to support this and yet the quality of search results varies from being pretty good for some queries and pretty bad for others so we figured we could certainly play with other models to improve it. We already have some barebone experiments with OpenAI models to get started on this.

Add fine-tune capability for Hybrid Search (BM25 + Vectors)

We can imagine some companies might hesitate to invest all-in into vectors search, hence hybrid search seems to be a safest bet to leverage and make this transition smoother. Our implementation already uses hybrid search where vector search is used as a boost function with keyword based search. We want to add examples of other hybrid options and offer capabilities to fine tune this combination.

Combining Vectors with manually manipulated search results

E-commerce search has to embrace a lot of manual manipulations of the search results for business reasons like promoted searches, legal obligations and brand conflict, which certainly adds to the complexity, hence we plan to offer an implementation that gives room for these business nuances. Of course in Chorus these search management features are already available, as you can see from the Meet Pete blog series.

Boosting re-ranking model with Vectors

This article encompasses numerous e-commerce-specific aspects, and therefore, we could not overlook the significance of ranking models. In fact, we are actively considering the inclusion of vector search in our future implementations to further support re-ranking models.

Discussions & Questions

We’ve already had some questions about our implementation:

KNN always gives you back K results. How do you know which results are still worth showing, some of them can be very messy?

The KNN (k-nearest neighbors) algorithm returns the K closest neighbors to the query in the feature space. However, not all of the K-nearest neighbors may be relevant or useful for a given task. As part of tis discussion we came up with a couple of approaches like using vectors in the low recall situation or as a hybrid with the keyword based approach , also here are some ways to determine if the results are still worth showing:

Use a distance threshold: You can set a distance threshold, beyond which, the neighbors are considered too far from the query point and are excluded from the results. This threshold can be determined based on the application requirements and the data distribution. We can also leverage Quepid to establish this threshold.
Rank by similarity score: Instead of ranking the neighbors based on their distance, you can rank them based on a similarity score. The similarity score can be computed using different similarity metrics such as cosine similarity, Jaccard similarity, or Euclidean distance. This way, the neighbors that are more similar to the query point are ranked higher.
Weight by distance: You can also weight the neighbors by their distance to the query point. For instance, you can assign a weight that is inversely proportional to the distance of the neighbor from the query point. This way, closer neighbors are given more weight, and they contribute more to the final result.
Use ensemble methods: Another approach is to use ensemble methods, where the results of multiple models or algorithms are combined to improve the accuracy of the final result. For instance, you can combine the results of KNN with other algorithms, such as Random Forest or Gradient Boosting, to improve the search results.

Overall, the approach used to filter and rank the results of KNN depends on the application requirements and the characteristics of the data. By using appropriate filtering and ranking methods, you can ensure that the KNN results are still worth showing and are useful for the given task.

How does the vector retrieval combine with filters? Are filters pre- or postfilters?

Vector retrieval can be combined with filtering in several ways. One approach is to perform the filtering before applying the vector retrieval technique. This approach reduces the number of documents to be considered for retrieval, which can speed up the retrieval process. Another approach is to incorporate the filtering criteria into the vector retrieval model itself. For example, the BM25 model includes a term-weighting component that can be used to down-weight terms that appear frequently across the entire collection or by leveraging an appropriate query relaxation technique, which can effectively filter out non-discriminative terms.

Overall, the combination of vector retrieval and filtering can lead to more efficient and effective information retrieval systems, as it allows for the retrieval of relevant documents from a smaller subset of the collection.

Conclusion

We hope that this blog has provided you with valuable insights into the technical details of building vector search in your own e-commerce search engine. Please do try out vector search in Chorus yourself – as always, we welcome your feedback and comments and look forward to hearing how you are using the framework in your own projects.

If you need help building effective e-commerce search, contact us today.