Last month I had the chance to present on E-commerce Search with Vectors at Haystack on Tour, a search event co-hosted by OpenSource Connections & edrone in Kraków, Poland. Each talk during the day touched on vector search in one way or another: a very active research area in Search and Information Retrieval that is developing very fast. You can watch the video of my talk here.
Unlike keyword based search relying on token and keywords, vector search leverages and focuses on the semantic meaning of the text being searched. In this blog we will try to achieve a better understanding of vector search and explore in more detail how it works, its advantages over traditional search engines and some real world examples using a working demo – which by the way is also open source and available to try.
Whether you are a data person or an engineer or just curious about about the latest developments in search technology, this post will provide some valuable information about the world of vectors.
What is vector search?
How vector search compares to keyword search
If we take a look at the history of the vector space model we discover that it is not something introduced recently. There are related citations available from the 1960s and 70s. In fact, the inverted index (used by traditional search engines) and the vector space model (used by vector search) are two common approaches for fetching relevant documents in information retrieval systems.
We can understand an inverted index as a derivative of a sparse vector model of N dimensions, where N is the size of the vocabulary (after stemming, removing stopwords etc.). Each dimension/axis represents a token in the index and each document is represented by an embedding or a vector of a non-zero whole number (e.g. term frequency) for the token present in the given document, or 0 when missing.
Vector search seems a very natural extension of this model, by normalizing it. One of the major differences here is that the vector space model has dense vectors and it has a fixed number of dimensions.
The significance of semantic meaning in the vector space model
During my talk I presented the picture above to explaining how capturing the “semantic meaning” would be a game changer in e-commerce search. We represent here a model vector space with different kinds of products. There are different cluster groups representing different segments of the products. For example, laptops have their own group, mobile phones and tablets are close to laptops signifying some similarity while other groups like baby products are closer to other baby toys.
This representation is based on the encoding generated by a transformer. There are various embedding techniques available that vary in their capabilities. For our prototype, we wanted to use the multimodal capabilities of CLIP, a neural network trained on a variety of (image, text) pairs. This means it can map text and image data into the same vector space. However due to token limitations we decided to use CLIP only for the image encodings while using MINILM for text encodings.
E-commerce search is largely affected by a vocabulary mismatch problem where the words or phrases used by customers in their search queries do not match the product information provided by the online retailers. This can lead to inaccurate or incomplete search results, making it difficult for customers to find what they are looking for and potentially leading to lost sales. Often this is addressed by using synonyms, stemming and other transformations, both at indexing and query time, usually manually maintained.
When performing vector search, the semantic meaning of the query is represented as a vector encoding, and this vector is compared to the vectors of the documents in the index. The documents that are closest in vector space to the query vector are considered the most relevant to the query. In other words, we can quantify the semantic similarity between a document and the query by calculating the vector similarity.
If a user searches for
baby backpack , the query encoding would semantically be represented between backpacks and baby products to essentially imply something like a “backpack meant for baby things” or “backpack for babies” e.g.
Our system can then fetch the relevant closest products.
Similarly for a query
gaming console, the encoding will fall somewhere between games and other electronic products, so the closest results to this space will be returned. By leveraging semantic meaning in this way, vector search is able to surface relevant results even when the query and the documents in the index use different words to express product details – e.g. when there is no product explicitly called “baby backpack”.
How e-commerce search with vectors could be a game changer
Vector search allows for more accurate search results, especially in the low recall situations arising due to long-tail queries, multilingual search or a zero hit query . These can lead to an elevated user experience and improve sales. Traditionally, e-commerce search engines revolve around keyword matching to find relevant products. However, this approach is limited by the language and terminology used by the customer, which may not take into account the nuances of a particular product.
On the other hand, Vector Search uses deep learning neural networks to represent products and user queries as vectors in a high-dimensional space. This allows for more accurate comparisons between products and queries, and can result in more relevant search results. E-commerce businesses can also leverage vector search to provide customers with more relevant recommendations and targeted advertising, leading to increased customer engagement and improved product development and marketing campaigns.
Four steps to vectorize your e-commerce search
There are 4 major steps involved in the journey of the vectorising your ecommerce business as below:
- Gathering your product dataset
- Set up the process to transform your product’s information text (and/or product images) into embedding vectors
- Make changes to your search model algorithm to leverage vectors
- Testing and optimisation
These may all seem very simple steps but each step wraps a process in itself. Hence, we decided to create a sample application using Chorus for you to try it out with real data and real queries. Chorus is an open source reference implementation for e-commerce search, that comes complete with a sample shop with a powerful search engine (read our series of posts about how Pete, the fictional e-commerce search product owner, solves a series of search problems using it).. After all, seeing is believing!
Do vectors really improve search results?
Using the sample ‘Chorus Electronics’ web shop, we tried some of the queries that might normally give us a hard time without being processed and manipulated using our usual techniques such as synonyms, spellchecker and query pre-processors like Querqy. We were happily surprised that none of it really mattered anymore after we sprinkled on some vector pixie dust!
Here are some examples of quick improvements over the default algorithm (which uses the standard Apache Solr edismax query parser).
The results above show that the default algorithm focused merely on the keyword
notebook and brought in a lot of accessories which are probably related to notebook but aren’t essentially “notebooks” – they’re accessories. Using a vector-based query we got relevant notebook computers as desired in the search results.
When we use the query
projector screen with the default algorithm , we get two projector screens and the rest are all projector lamps and other hardware related to projectors, while with a vector query we get the desired projector screens in the search results.
The query above is an example of how vectors can handle misspellings across fields. The default algorithm got us mixed and irrelevant results while the vector query got us results we might have expected if we had spelled our query
lexmark toner cartridge correctly. We are using image vectors this time.
The example above shows the hybrid usage of vectors with traditional BM25 keyword ranking. We used image vectors as a boost query for this one. As for the default algorithm, we still get a lot of irrelevant results like UPS, network adapters and LANtest devices, while using the vectors for boosting, we got all relevant network cables in the search results.
We can already see vector search is likely to become a key technology for improving search results in e-commerce. My next blog shows in more detail how we built these new vector search features in Chorus, our reference implementation of open source e-commerce search – including how you can try this out yourself!
Interested in building e-commerce search with vectors? Let us help.