We are very happy to present a guest blog on query expansion by Johannes Peter – Freelance Principal Consultant Data & Search and co-maintainer of the Querqy library for query rewriting.
Query rewriting can be seen as an umbrella term for many things that happen in search engines before the actual matching and scoring is done. The main goal of query rewriting is to improve the search quality and to address user needs, for instance by
- handling language irregularities (e. g. stemming or ASCII folding),
- compensating human mistakes (e.g. typos, word breaks) or
- considering the semantic meaning of terms (e.g. synonyms).
For all parts of query rewriting, two major questions arise: (1) how to identify rewriting actions and (2) how to manipulate the query accordingly. For instance, for a feature like spell correction, rewriting actions can be identified by examining a query for terms that do not exist in the index, as well as by finding terms which are very similar regarding their character sequence and which finally do exist in the index. Subsequently, the query can be manipulated by replacing the terms (there definitely exist more sophisticated approaches, but nevertheless this should be the most common one).
However, other parts of query rewriting require approaches that are more difficult to implement. For features like handling synonyms or word breaks, we cannot simply replace terms, we rather need to expand the query. Synonyms always must be treated as alternatives to terms. Additionally, for many word break cases, the correct variant is not indisputable, the variants therefore must be considered as alternatives as well e.g.
The scope of this article is to discuss different types of query expansion in the context of exemplary rewriting cases. These cases are very important to consider when developing a search engine for two reasons: first, the way of query expansion heavily affects the precision of matching and scoring and second, none of the subsequently described techniques are applied properly or applied at all by the existing Lucene-based search engines (Solr, Elasticsearch and OpenSearch). This leads to unexpectedly imprecise results for many queries, which consequently do not meet the expectations of modern state-of-the-art search solutions.
The article therefore:
- will help to understand why many queries in a Lucene-based search engine do not fully work as expected,
- can be used as a guide for implementing your own rewriting solution, and finally
- explains how the Querqy query rewriting library / plugin works internally and why it helps to improve search engines.
Query expansion types
Considering a query input as a conjunctive clause
Queries and their terms basically can be structured in a quite simple manner: terms can be considered as literals in either disjunctive or conjunctive clauses, and clauses can be nested in other clauses. A query input like
apple smartphone can be considered as a conjunctive clause, which means that in this case we require all terms to match in a document in order to be part of the result set:
apple AND smartphone
Considering the terms of a user input as a conjunction is a common approach to ensure that the number of relevant documents in the result is high. Users searching for
apple smartphone will not want to see apple tablets or smartphones by other manufacturers in the results. Furthermore, a high precision ensures that features like facets or non-relevance-based sorting work as expected.
It is worth mentioning that the described query structure can be easily transformed to the query syntax of Lucene or Elasticsearch (Query DSL). Each conjunctive clause can be transformed to a boolean query with
must clauses, whereas each disjunctive clause can be transformed to a boolean query with
should clauses. Regarding Elasticsearch, the given query example could be transformed into the following (simplified example only considering one field):
Expanding a single term by a single term
Let’s now assume that we want the query not only to match to smartphones, but also (more generically) to mobiles. We therefore consider the term
mobile as a synonym or alternative to the term
smartphone. As a first step, the rewriting actions must be identified by extracting different term sequences out of the query. For this simple query, three subsequences can be extracted:
apple smartphone. For each of the sequences, we apply a lookup in our synonyms and identify “mobile” as a synonym for the subsequence “smartphone”. Therefore, we must expand this subsequence. This can be easily done by expanding the term “smartphone” to a disjunctive clause and adding the term “mobile”:
apple AND (smartphone OR mobile)
Expanding multiple terms by a single term
As an additional synonym, we want
iphone additionally to match documents for all queries containing the subsequence
apple smartphone. First, we again must extract all subsequences of the given query. In continuation with the query above, we now identify the following subsequences:
apple mobile. The logic for the subsequence extraction works in a way that each term or clause embedded in a conjunctive clause is appended to a subsequence, while each term or clause embedded in a disjunctive clause forks the subsequence. Applying a lookup mechanism for synonyms, we identify
iphone as a synonym for the subsequence
apple smartphone. As
smartphone is part of a disjunctive clause,
iphone can be considered as an alternative for the entire query. The most straightforward way therefore should be simply to embed the existing query and
iphone into a disjunctive clause:
iphone OR (apple AND (smartphone OR mobile))
Even though this way appears to be the simplest way of manipulating the query, it has several disadvantages when it comes to later steps of the query processing. The main reason for this is that we do not only expand the query, but change its basic structure. The integrity of the original query, comprising two terms or clauses at the highest level, is not preserved. For instance, applying a relaxation technique like minimum-should-match on this query will lead to unexpected results. Furthermore, the way of scoring documents would also be affected, making scores less comparable across different queries. Thus, we should always aim to keep the integrity of the original query. Expanding the query should not change its original structure. On the basis of the given query, we can achieve this by applying the distributive law of Boolean algebra:
(apple OR iphone) AND (smartphone OR mobile OR iphone)
As we can see, the integrity of the original query is now preserved, as the highest level of the query still comprises two terms or disjunctive clauses embedded in a conjunctive clause (a single term can be seen as a disjunction containing only one literal). However, the price we pay for this is that we now have a repeated term in our query. This will complicate subsequent subsequence extraction and query manipulation. Even though the term
iphone exists in two subsequent disjunctive clauses, it must be considered as an alternative for both together, as there is no subsequence like
iphone iphone in this query, but simply
Furthermore, we need to consider repeated terms or clauses when we manipulate queries. Let’s take the query input
smart phone case as an example. As a rewriting first step, we identify
smartphone as a word break alternative for the subsequence
smart phone and expand it accordingly. The rewritten query looks as follows:
(smart OR smartphone) AND (phone OR smartphone) AND case
Additionally, we have defined the synonym
smartphone case. We now need to consider that
smartphone is a repeated term, so the query expansion needs to happen for all clauses containing it. The rewritten query now looks as follows:
(smart OR smartphone OR backcover) AND (phone OR smartphone OR backcover) AND (case OR backcover)
Even though repeating clauses comes with a price, its benefits become finally clear as soon as expansions need to be done on overlapping subsequences of a query. Let’s say we want to apply two synonyms on the user input
apple smartphone case. The subsequence
apple smartphone is supposed to be expanded by the synonym
iphone whereas the subsequence
smartphone case is supposed to be expanded by the synonym
backcover. As we never actually change the main structure of the query and only expand queries by enhancing clauses, we can add both synonyms independently of each other, resulting in the query
(apple OR iphone) AND (smartphone OR iphone OR backcover) AND (case OR backcover)
Expanding a single term by multiple terms
Let’s now think the other way round. We still consider
apple smartphone and
iphone to be synonyms, but the user input now is
iphone case and we want to expand the subsequence
apple smartphone. Still we do not want to touch the integrity of the query comprising two terms or disjunctions at the highest level, so we embed
iphone in a disjunction clause and expand it by an embedded conjunction with the terms
(iphone OR (apple AND smartphone)) AND case
As we now have a new query structure (embedded conjunction), the question arises how subsequences can be extracted out of such a query. The answer is the same as above: terms in conjunctions extend subsequences, while terms in disjunctions fork them. Therefore, the following subsequences can be extracted:
smartphone case, and
apple smartphone case.
Furthermore, the subsequence
smartphone case can be expanded by the synonym
backcover in the same way as before. Both terms need to be embedded in a disjunction, together with the repeated synonym:
(iphone OR (apple AND (smartphone OR backcover))) AND (case OR backcover)
By doing so, we achieve the same matching result as above: documents match as soon as they contain at least one of the following term combinations:
apple backcover or
apple smartphone case.
Expanding multiple terms by multiple terms
Expanding multiple terms by multiple terms might seem to be an edge case. Actually, real-world rule collections of modern search engines are full of them. The reasons for configuring such synonyms are heterogeneous. In several cases, they are (not surprisingly) about semantic relations between multiple terms, e.g.
two door fridge and
side by side. In other cases, they are about relations between single terms in a specific context, e.g.
electric shaver and
rechargeable shaver. Finally, they can also be about relaxing a term without deleting it, e.g.
beats by dr dre and
beats by dre. The reason for not deleting a term despite the need of relaxation e.g. can be about generic phrase boosts.
For expanding multiple words by multiple words, we simply combine the techniques of the previous examples: we repeatedly embed conjunctions. The query for the input
two door fridge and the synonym
side by side then looks as follows:
(two OR (side AND by AND side)) AND (door OR (side AND by AND side)) AND (fridge OR (side AND by AND side))
It is worth mentioning that expanding multiple terms by multiple terms is also important for word break issues. For instance, a query like
i phonese is not only required to be concatenated, but also to be split at a different position. The resulting query looks as follows:
(i OR (iphone AND se)) AND (phonese OR (iphone AND se))
Summary of the query expansion algorithm
On the basis of the discussed cases, we can reformulate the definition of queries: At the highest level, the query simply is a conjunction, which contains one or more disjunctive clauses. The embedded disjunctive clauses contain either literals (terms) or again conjunctive clauses (which again contain one or more disjunctive clauses and so on).
For synonym or word break (or other) lookups, we extract subsequences out of queries. Clauses in conjunctions append subsequences, while disjunctions fork them. If a query contains a repeated clause (e.g. previously added synonyms), the clause needs to be considered as an alternative for all disjunctions where it is included.
As soon as subsequences match to an input (e.g. synonym), the query gets expanded, whereas the integrity of the query (size of the conjunctions) remains. The query is only expanded by adding terms or clauses to the disjunctions. The main benefit of this is that subsequences can be expanded fully independently, e.g. in the context of multiple and especially overlapping expansions.
Queries get expanded by adding terms or clauses to each disjunction included in the matching subsequence. If the subsequence is expanded by multiple terms, the terms are embedded in a conjunction.
At a first glance, the single cases might appear to be edge cases, but in sum they are not. A state-of-the-art search engine will need to consider all of these cases to avoid bad and unexpected results for many queries. Therefore, search teams will be required to implement the described rewriting methods. Alternatively, the rewriting library / plugin Querqy can be used for Solr and Elasticsearch. It implements all the mentioned methods and provides additional query-based features, such as (de)compounding or term-specific boosting.
Limitations and outlook
This article mainly focuses on query expansion in a context where a high precision of queries, and a high precision of expansions (e.g. synonyms) matter. However, even in such contexts, the highest level of the query is not always defined as a strict conjunction as this might negatively impact the recall, especially for longer queries. In such cases, relaxation strategies are applied, which are perfectly compatible to the described expansion approach, as the highest level of the query will always have the same size as the original user input.
Relaxation in general is another important aspect of query rewriting. This topic is not only about how terms are required to match, but also about whether they are needed at all. In frequent cases, terms do not really contribute to a search and rather bias its results. However, this topic is quite comprehensive and goes far beyond removing stopwords (which should be the most well-known case of removing terms from queries). Therefore, this topic needs to be discussed separately.
So far, we have only discussed the issue of matching properly, but for a full picture of query expansion, the issue of scoring needs to be discussed as well. As a default, Lucene simply sums up scores for everything that matches. This means that a document matching to
apple smartphone is scored higher than a document matching to
iphone, even if the two term sequences are defined as synonyms.
Even though the current relevance discussions are mainly about learning to rank, the importance of a proper base scoring should not be underestimated, as the predictive approaches currently are mainly about re-ranking e.g. of the top 1000 matches. However, we could, for instance, have an index containing 4000 smartphone cases with varying product type designations. 2000 cases could have the terms
smartphone case as product type, the other 2000 the term
backcover. With the default Lucene scoring, the cases with
smartphone case will always be on top, irrespective of any re-ranking algorithm, as the bias in the base scoring exceeds its scope of re-ranking the top 1000 documents. The different aspects of scoring will be discussed in part 2 of this blog series.
Image by Vecteezy
About Johannes Peter
Johannes is a passionate leader, consultant and engineer in the areas of search, data and cloud. He is experienced in driving and implementing solutions from initial ideas towards systems running in productive environments and measurably contributing to business KPIs.
Before he became a freelancer, Johannes worked for several years for the MediaMarktSaturn Retail Group. As a Product Owner and Lead Engineer, he was the founder of a new search engine product team and the main driver of a new cloud-based search engine solution for all sales lines across Europe. Later he was commissioned to drive the webshop data strategy and the development of a data distribution and analytics infrastructure. Prior to joining MediaMarktSaturn he worked as a Consultant in the areas of Enterprise Search and Big Data.
Throughout his entire career, Johannes has been passionate about open source. In his early days, he contributed several features to Apache NiFi to improve its integration with Apache Solr. Later he became a main contributor and committer for Querqy, a query rewriting solution for Apache Solr and Elasticsearch used by various big retail companies worldwide.