Until Decompounding Do Us Part - Solving compound noun challenges with Querqy & LLMs

October 30, 2023 Daniel Wrigley
Category: Natural Language

Some languages have weird rules, and as a German native speaker working in search I know this well. For instance in German, you are technically allowed to join any number of nouns together to form new words. There are rules to follow for joining these nouns, but in general this provides a lot of flexibility. These new words are called compound nouns, and this makes the German language one with a potentially infinite number of words as new ones can be created on the fly.

Compounding is not exclusive to the German language and it proves to be a challenge across many domains in the world of search. You can even encounter compounding & decompounding challenges in languages that do not allow compounding, e.g. when users do not know how to spell certain brands or terms in a highly specialized domain. This means that missing out on synonyms results in your users not finding what they are looking for. In other words: you are leaving potential for user experience improvements untapped.

Below I will explain why compounding is a challenge in more detail and how you can tackle it successfully, followed by a practical example and a glimpse of how useful language models are in that area.

The Challenges with Compound Nouns

We can group the challenges with compound nouns related to search into three categories:

The compound noun is indexed and the user searches for its units rather than the compound form. Example: firefighter is what we have indexed and the user searches for the separate tokens fire fighter.
The units of a compound noun are indexed and the user searches for the compound noun. Example: fire fighter is what we have indexed and the user searches for firefighter.
The compound noun is indexed and the user searches for an alternative way of putting the units together in a query. Example: hubcap is what we have indexed and the user searches for cap for hub.

In all three cases, there is a mismatch between what the index holds and what users are searching for. This is a classic cause of a bad user experience, potentially leading to zero results or at least significantly lower recall, missing out relevant documents.

The Solution: Index-Based Query Rewriting

One approach to resolve this mismatch is to change the query in a way to close the gap between itself and the indexed values. What we often see in practice is an attempt to solve this by adding synonyms either during index time, query time or both. This method does not scale very well as you need to know upfront what compounds you need to consider and add new rules as your content changes.

A solution that is based on the index where the indexed values serve as our dictionary of valid words makes more sense as this is automatically updated as new content goes into our index or content gets removed. This dictionary then is used at query time within a process called query rewriting: take the user query as the input and change it according to certain rules. Rules can mean a variety of things: in this context, rules mean an algorithmic change of the query resulting either in a concatenation of query terms (i.e. creating a compound) or a split of one (or multiple) query term(s) (i.e. decompounding).

These operations should happen based on what is in our index. Let’s return to the fire fighter example:

Our user query is the compound noun firefighter. The indexed terms are fire and fighter. It’s now the job of the query rewriting pipeline to do lookups in the index that retrieve either good ways to split the query terms or to join the query terms. In this example, a good way to split the query is retrieved: splitting firefighter into the two terms fire and fighter. Ideally, we still keep the original user query as part of the resulting rewritten query as there is no guarantee that this split is exactly what the user meant: it is just our best guess based on what the query rewriting process finds in the index.

So the resulting query may become something like (firefighter OR (fire AND fighter)). This query matches any document either containing firefighter or both of the compound constituents fire and fighter. That way we still match any document containing the original user query and do not sacrifice precision by only searching for one of the compound units.

Compounding & Decompounding with Querqy – A Practical Example

Querqy is an open source library for query rewriting and currently, compounding & decompounding is supported in the search engine specific plugins for Solr, Elasticsearch, and OpenSearch via the Word Break Rewriter. (The ‘unplugged’ version, suitable for using independently of one of these engines, has not yet implemented the Word Break Rewriter to be used outside the domain of the mentioned search engines – contributions welcome)!

If you want to see this example in action without going through these steps you can watch my lightning talk at the past Haystack EU conference using similar examples.

Setup

I will show you how to apply compounding & decompounding with Querqy by using Elasticsearch as a search engine. Our es-tmdb Github repository that we use in our training courses comes with the Querqy plugin already integrated, so cloning this repository and firing up the containers provides you with the necessary components ready to go:

Clone the repository:

git clone https://github.com/o19s/es-tmdb.git

Change into the directory and start the containers:

cd es-tmdb && docker-compose up -d

Index some Data

We stick with the examples given in the challenges section of this blog and index two minimal documents in the index haystack:

PUT haystack/_doc/1
{
  "name" : "firefighter"
}

PUT haystack/_doc/2
{
  "name" : "fire fighter"
}

Query Data without compounding & decompounding

Now we can execute the three query patterns and see which documents are retrieved respectively:

Query with compound noun:

GET haystack/_search
{
  "query": {
    "match": {
      "name": "firefighter"
    }
  }
}

Retrieves only compound (id 1).

Query with compound units:

GET haystack/_search
{
  "query": {
    "match": {
      "name": "fire fighter"
    }
  }
}

Retrieves only compound units (id 2).

Query with alternative way of spelling (admittedly not the most likely way that users search but for the sake of sticking with English and the firefighter example I’ll use it nevertheless):

GET haystack/_search
{
  "query": {
    "match": {
      "name": "fighter of fire"
    }
  }
}

Retrieves only compound units (id 2).

Ideally we’d like to have both documents for all three query variants.

Configure the Word Break Rewriter

The following command adds a word break rewriter to the Elasticsearch installation:

PUT  /_querqy/rewriter/word_break
{
    "class": "querqy.elasticsearch.rewriter.WordBreakCompoundRewriterFactory",
    "config": {
         "dictionaryField" :  "name",
         "lowerCaseInput": true,
         "decompound": {
             "maxExpansions": 5,
             "verifyCollation": true
         },
         "reverseCompoundTriggerWords": ["of"],
         "morphology": "GERMAN"
    }
}

The most important parts of the configuration are:

dictionaryField: the index field you want to base the lookups on. You usually want to copy some of your high quality input text fields (title, description, categories, classes, keywords, etc.) into this field.
reverseCompoundTriggerWords: a list of words that are identified in queries to reverse compounding and create compounds like firefighter after a query like fighter of fire.
morphology: Some languages have specific compounding rules and the German morphology can be specified to apply these at query rewriting time.

The German morphology is currently the only one directly supported in Querqy. The rules implemented are the ones listed in Stefan Langer’s paper “Zur Morphologie und Semantik von Nominalkomposita”. (And by the way, Stefan Langer is the university lecturer who introduced me to search engines during my studies.) Contributions for additional language support in Querqy are welcome in case you are looking for a language specific set of rules to be applied.

Query Data with compounding & decompounding

To apply the configured query rewriter you need to change the queries slightly. Here are the three query examples now with compounding & decompounding:

GET haystack/_search
{
  "query": {
      "querqy": {
          "matching_query": {
              "query": "firefighter"
          },
          "query_fields": [ "name" ],
          "rewriters": [ "word_break" ]
      }
  }
}

GET haystack/_search
{
  "query": {
      "querqy": {
          "matching_query": {
              "query": "fire fighter"
          },
          "query_fields": [ "name" ],
          "rewriters": [ "word_break" ]
      }
  }
}

GET haystack/_search
{
  "query": {
      "querqy": {
          "matching_query": {
              "query": "fighter of fire"
          },
          "query_fields": [ "name" ],
          "rewriters": [ "word_break" ]
      }
  }
}

All three queries now retrieve both documents: challenge accepted and challenge solved!

Last but not least an example in “pseudo-English” to demonstrate the German morphology. Some nouns can be joined together by adding a `s`in between, e.g. Sitzungssaal (conference room). Applying this to the firefighter example:

GET haystack/_search
{
  "query": {
      "querqy": {
          "matching_query": {
              "query": "firesfighter"
          },
          "query_fields": [ "name" ],
          "rewriters": [ "word_break" ]
      }
  }
}

After retrieving no document at all without decompounding, we are able to retrieve the document with the id 2 for the query with applied decompounding using the German morphology rules.

Compound Nouns and LLMs

What about LLMs? Don’t they have some kind of built-in compounding/decompounding knowledge that we can leverage to tackle these challenges with the vast amounts of textual data they are trained on? The short answer to this is you still have to care about decompounding and can’t just hand it over to a language model to help you with that challenge.

Why is that the case? As Jo Kristian Bergum pointed out in his recent Haystack talk, language models are sensitive to spelling mistakes. You can see the three different cases initially stated as kind of spelling mistakes or maybe spelling variants of the same thing or concept. While we as humans can clearly see that a firefighter and a fighter of fire or other examples are synonymous, this is not clear to language models and processing these two inputs with language models will result in different vector representations. They may be similar, but still different.

If you want to consider using a large language model to do the compounding/decompounding for you, instead of building a dictionary field in your index and using query rewriting tools like Querqy, I’d like to mention one recent paper: “CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models” by Benjamin Minixhofer, Jonas Pfeiffer & Ivan Vulić. Their research shows that unfortunately, available language models are no “off-the-shelf” solution and require additional training steps to produce results that can be relied on for some languages. For instance, the proposed two step training process shows accuracy resulting at 96.6% for German.

When comparing the generalist approach to the dictionary-based approach of Querqy it is worth noting that you automatically build your own in-domain dictionary with Querqy. It uses your index after all! This may be especially useful for preserving or even increasing precision. Language models may identify compound units that generally exist but may not be present in your index, which can hurt retrieval quality.

Summary

While it is tempting to use language models for any linguistic challenge nowadays they are not a silver bullet. Decompounding is no exception as recent research shows.

Dictionary-based approaches such as the one Querqy provides still prove to be useful in increasing recall while preserving precision which makes it our “go-to-tool” for query compounding & decompounding.

Give it a try and let us know how that works for you!

If you need help to improve your search result quality with decompounding or additional processes, get in touch with us today.

Image from Free Stock photos by Vecteezy

Until Decompounding Do Us Part – Solving compound noun challenges with Querqy & LLMs