Our Solution to Solr Multiterm Synonyms: The Match Query Parser

January 23, 2017 OpenSource Connections
Category: Relevancy

You have probably heard us talk about Solr multiterm synonyms a lot! It’s a big problem that prevents a lot of organizations from getting reasonable search relevance out of Solr. The problem has been described as the “sea biscuit” problem. Because, if you have a synonyms.txt file like:

sea biscuit => seabiscuit

… you unfortunately won’t get what you expect at query time. This is because most Solr query parsers break up query strings on spaces before running query-time analysis. If you search for “sea biscuit” Solr sees this first as [sea] OR [biscuit]. The required analysis step then happens on each individual clause – first on just “sea” then on just “biscuit.” Without analysis seeing a “sea” right before a “biscuit”, query time analysis doesn’t recognize the synonym listed above. Bummer.

As a team that focuses on Solr relevancy, we often have to solve this problem. One solution we’ve used has been to inject the analysis into the query parsing stage. Indeed that’s what the hon-lucene-synonyms plugin does. A tremendous contribution from Nolan Lawson to the Solr community that tries to “do the right thing” for you.

It creates a query parser that

Runs an analyzer specified in your solrconfig.xml (Nolan gives a recommended analyzer)
Turns the result into a Solr edismax query

This works for many people. It’s sort of the “don’t make me think” option. However, in practice we’ve found it tends to generate rather complex queries. For example, in the README.md, a simple query for “dog” with synonyms “canis familiaris”, “hound”, “man’s best friend” and “pooch” is expanded to this complex beast:

(DisjunctionMaxQuery((text:dog))^1.0 ((+(DisjunctionMaxQuery((text:canis)) 
DisjunctionMaxQuery((text:familiaris))))/no_coord^1.0) ((+DisjunctionMaxQuery((text:hound)))/no_coord^1.0) 
((+(DisjunctionMaxQuery((text:man's)) DisjunctionMaxQuery((text:best)) 
DisjunctionMaxQuery((text:friend))))/no_coord^1.0) ((+DisjunctionMaxQuery((text:pooch)))/no_coord^1.0))

It gets nastier with the many options that the query parser offers. Such as phrase queries, other edismax features, controlling how synonyms are grouped, etc. It gets particularly complex when you search over multiple fields and want per-field boosts. As a team, we’ve spent an inordinate amount of time debugging weird relevance and performance problems related to these complex queries. Indeed as a result, OSC relevance engineers have done quite a bit of the maintenance on this plugin, adding features, documenting, and fixing bugs.

Hunting for a better ‘match’

Although we value the hon-lucene-synonyms work, we’ve often found a more surgical approach to query construction more useful. Something where we can build a query and understand how it’s going to map to underlying Lucene primitives. This is why the OSC relevance team is excited to announce the match query parser. Increasingly, this query parser is how we solve multiterm synonym problems instead of hon-lucene-synonyms. By itself it’s not the solution, it’s the lego block to build your own solution.

The match query parser gives the following options

A fieldType you’ve created in your schema to use for query-time analysis
A single field to search
How to search with that field with the tokens resulting from the analysis

It’s inspired by Elasticsearch’s match query. Instead of a single query parser that attempts to solve every use case, you’re expected to compose multiple configurable match queries together to solve your particular multi term synonym problem. Match query parser also allows you to create query-time only analyzers (via fieldTypes). It removes the limitation of only having one query-time method for analyzing your users queries. The consequence is, you can index text with a standard analyzer, then tell Solr at query time which analyzer to apply to the query string. This has a lot of benefits for indices where you don’t want to, for example, create multiple versions of your text just for different query-time synonym/analysis configurations.

Removing this limitation gives you a TON of power. It’s our new favorite way of solving multi term synonyms in a way that lets you manage complexity, relevance, and performance. The downside is, you are responsible for making more of your own decisions. Let’s see it in action:

Match Query Parser in Action

Let’s take the match query parser out for a spin. First let’s tackle the sea biscuit problem head on to see what kind of query it generates.

Update – please note: when implementing the Match Query Parser we included it in a repository for storing TMDB (The Movie Database) movies to play with it. As the Match Query Parser repository is archived the Match Query Parser is no longer part of the repository. You can still use the following configuration to set up a Solr installation making use of this query parser.

In the schema, you’ll note the fields I’ve created using the stock, index-time “text_general” analyzer that Solr ships with:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="overview" type="text_general" indexed="true" stored="true" />
<field name="title" type="text_general" indexed="true" stored="true" />
<field name="tagline" type="text_general" indexed="true" stored="true" />
<field name="cast" type="text_general" indexed="true" stored="true" multiValued="true" />
<field name="directors" type="text_general" indexed="true" stored="true" multiValued="true" />
<field name="genres" type="text_general" indexed="true" stored="true" multiValued="true" />

However, I’ve also defined a few field types that only have “query” analyzers defined. This includes for example:

<fieldType name="text_title_phrases" class="solr.TextField" positionIncrementGap="100" multiValued="true">
   <analyzer type="query">
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.SynonymFilterFactory" synonyms="title_phrases.txt" ignoreCase="true" expand="true"/>
     <filter class="solr.PatternReplaceFilterFactory" pattern="(_)" replacement=" " replace="all"/>
   </analyzer>
</fieldType>

First, you’ll notice that this fieldType exists solely to be used at query time. Much like Elasticsearch where analyzers are often created just to be bound to some query.

There’s nothing that surprising in the analyzer abore. The first two steps do a simple tokenization, followed by lowercasing. The synonym step is what’s interesting. Here we create a synonym file intended to transform common key phrases from user searches into a synonomous form. For example, sea biscuit or perhaps huckleberry finn:

sea biscuit => seabiscuit,sea_biscuit
seabiscuit => seabiscuit,sea_biscuit
huckle berry => huckleberry,huckle_berry
huckleberry => huckleberry,huckle_berry

This particular synonyms file is targeted at the key phrases you would expect to find in Movie Titles. It’s a synonym file only intended to be applied to the title field. If you recall in our earlier articles about semantic search with keyphrases and taxonomies, actual query logs are a better source for synonyms than a generic thesaurus. Here perhaps we’ve seen (or our code has seen) how users tend to search for seabiscuit and we’ve generate the synonyms file above.

The final step in the analyzer converts underscores from the generated token into spaces. This is just a bit of tidying from the synonym step to generate individual tokens with multiple words. You’ll see how a configuration option in the match query parser lets you treat these as phrases farther down.

So the ingredients are in place. We’ve got fields to search. We’ve got an analyzer to use on the user’s query. Let’s get to work.

Let’s say our user searches for “party with sea biscuit” What does it look like to search the title field, after applying the analyzer above? Well using the match query parser means simply specifying it in localparams, like so:

q={!match qf=title analyze_as=text_title_phrases search_with=phrase mm=1}party with sea biscuit

Pretty straightforward (if you know Solr localparams), this means

Use the match query parser
Search the title field (qf)
Analyze the user’s query as fieldType text_title_phrases (analyze_as)
Treat resulting, multiword tokens as phrase searches sea biscuit => "sea biscuit" (search_with)
Expect at least one of the user’s search terms to match (mm=1)

And what does this generate? A fairly sane query:

(DisjunctionMaxQuery((title:"party")) DisjunctionMaxQuery((title:"with")) 
DisjunctionMaxQuery((title:"seabiscuit" | title:"sea biscuit")))~1

The query is analyzed. The token in the first position, all by its lonesome is just “party,” next is “with,” finally we have a dismax, pick the best match, for either (seabiscuit | “sea biscuit”). Intuitively these two go together in the spot that the user typed in “sea biscuit” so they’re treated as a single unit with possible alternatives.

In other words, for each position in the resulting token stream, score using the highest matching token (DisjunctionMaxQuery). Expect at least one “position” in the original token stream to match (mm).

Ok, so this seems fairly straightforward. But what about searching another field? Seabiscuit is also a cast member, right? Well that’s easy peasy. We just need to use multiple queries. This can be accomplished several ways (perhaps as a bq on a base edismax query). However here, we’ll use the magic of nested queries to just issue two local params queries:

q=_query_:"{!match qf=title analyze_as=text_title_phrases search_with=phrase mm=1 v=$userQuery}" 
   OR _query_:"{!match qf=cast analyze_as=text_cast_phrases search_with=phrase mm=1 v=$userQuery}"
&userQuery=party with sea biscuit

Notice above how two different analyzers are used at search time for two different fields. With this query parser, we could search a single field with as many different analyzers as we wanted. Or we could search multiple fields with the same analyzer if we liked.

And how does the query above look? Well the query above starts to look a bit fugly. But we prefer it fugly, as we want careful, low-level control over how we’re searching. Here’s what the debug query looks like:

(DisjunctionMaxQuery((title:"party")) 
 DisjunctionMaxQuery((title:"with")) 
 DisjunctionMaxQuery((title:"seabiscuit" | title:"sea biscuit")))~1 (DisjunctionMaxQuery((cast:"party")) DisjunctionMaxQuery((cast:"with")) DisjunctionMaxQuery((cast:"seabiscuit" | cast:"sea biscuit")))~1

What’s cool is the extremely pluggable nature of how the query can be sliced and diced using custom analyzers. Want to only search the cast field with actual actors names? Add a keep words list to the end of the analysis chain to filter out anything not in your list of names. Want to expand to broader concepts to increase recall, expanding “sea biscuit” to “horse” to show users horses as well? Go for it – the world is your oyster with a bit more synonym expansion.

Indeed, you should be able to apply the lessons from our series of articles on building semantic search using keyphrases and taxonomies using Solr, and not that other search engine, to great effect!

Great Power comes Great Responsibility

One of the underlying reasons multiterm synonyms isn’t just “solved” is because everyone’s search problem is actually pretty different. In Relevant Search there’s a specific reason why this can be difficult and problem specific (check out the section on field centric vs term-centric search).

What’s more important is to give you the tools to solve your specific problem. That’s what we’ve tried to do with the Match Query Parser. It’s the swiss-army knife, composable approach we tend to find more practical than the “black box” approach that “hon-lucene-synonyms” gives you.

I hope you find it useful.

If you would like to talk about how our team, which builds exciting things like the Match Query Parser, can help you solve your specific, gnarly search relevance, recommendations, or other matching problem please do get in touch.