Blog

Solr Multiterm Synonyms: avoiding sow=false surprises

Many Solr teams have gotten stumped on the sow=false behavior introduced in Solr 7. To review, sow=false (split-on-whitespace=false) changes how the query is parsed. With sow=false the query is parsed using each field’s analyzer/settings, NOT assuming whitespace corresponds to an OR query. This (mostly) resolves the long-term query-time multi term synonyms issues that Solr has been plagued with. But it’s hard to create a universal solution to this problem, and the plethora of Solr field settings can create interesting behaviors.

The primary unexpected behavior we see is at times, the following Solr query produces different query structure depending on a number of factors:

qf=title description&defType=edismax&q=blue shoes&sow=false

Much of the time, we see this query produce:

(title:blue | description:blue) (title:shoes | description:shoes)

This query, as mentioned in Relevant Search, ch 6 is term-centric. It picks the highest relevance score per search term (that’s the | operator), then adds them together. The advantage to term-centric search, is that it biases results towards documents that have more of the user’s search terms.

However, many have noticed, the query can unexpectedly flip to a field centric behavior (same as Elasticsearch best fields). That is the structure flips to:

(title:blue title:shoes) | (description:blue description:shoes)

Here the query chooses the best field that matches the user’s query, not a document that matches both search terms.

Why does this happen? And in what cases?

After some debugging Solr, sow=false follows the following algorithm:

  1. Build field centric queries for each field passed into qf (title, description above) based on each field’s analysis & settings
  2. Attempt to weave together each field-centric query into a single term-centric query

The “weaving together” is the tough part. Solr evaluates whether to give up on creating a term-centric query only after each field generates a query. This means the specific Solr field configuration (analyzers, settings, etc) aren’t examined like Elasticsearch does in similar contexts to decide whether to allow/disallow term-centric behavior. The downside to Solr’s approach is that sometimes two fields, configured differently, will create consistent query structure some of the time, and inconsistent structures other times. So the flip between term/field centric can occur in surprising ways and depend on query-time factors.

If you’re curious, the logic used to decide whether edismax gives up on the weaving is in this function. You’ll notice this function is rife with escape clauses, where an inconsistent structure across the two queries is detected, and edismax gives up on the term-centric query.

Stopwords, and other settings creating surprising structure flips

As we mentioned, the weaving happens based on each field’s generated query, you can get different results depending on what the user searches for. For example, we’ve noted that the inclusion of a stopword by the user can shift the query to being field-centric. This can happen because the term is a stopword in one field, but not another. So if “the” is a stopword in title, but not the description, then suddenly a surprising query structure can be output by the user adding “the” to their search.

How do these query structures play with multi-term synonyms?

The query setting sow=false is only one half of the recent solution to Solr multi-term synonyms. The other part is the per-field setting autoGeneratePhraseQueries which instructs the query parser to turn multi-term synonyms (scifi turning into science fiction) into a phrase query (like “science fiction”). The setting is placed in the schema on the fieldType:

<fieldType name="text_general_multiterm_index_syn" class="solr.TextField" 
   autoGeneratePhraseQueries="true" enableGraphQueries="true"
   positionIncrementGap="100" multiValued="true">

As noted in chapter 6 of Relevant Search term-centric search can respect multi-term synonyms. As long as the query is analyzed consistently across fields, you can get a term-centric search with multi-term synonyms. For example in our Search Relevance Training, we highlight this structure:

Query:

q=best sci fi movie&qf=title overview&defType=edismax

with a query-time “sci fi” synonym entry:

science fiction,sci fi,scifi,sci-fi

produces

(title:best | overview:best)((overview:"science fiction" overview:sci-fi overview:scifi overview:"sci fi") | (title:"science fiction" title:sci-fi title:scifi title:"sci fi")) (overview:movie | title:movie)

Notice this is still term-centric. Each dismax clause still picks the best field match per search “term”. Here a “search term” is the outermost synonym phrase seen in the query. The structure between title and overview searches still matches.

With a field with a different query-time analysis, autoGeneratePhraseQueries setting, synonyms, stopwords, or other structure-changing issues, the result could become field-centric.

Multi-term synonyms, there’s still not a silver bullet!

It’s not hard to use your imagination to think of cases where the field has much more complex & ovelapping synonyms than what’s been presented here. Or where toggling different settings creates new, weird, and interesting query structures. (Indeed there’s more settings than what’s been described here!) While the sow=false setting along with auto-phrase query synonyms settings creates generally a saner query, there’s not a better way than considerable trial and error to see how your syntax, field settings, analysis chain, query fields, and user queries transform into Lucene syntax using edismax.

What we recommend

My motto with this stuff is “Keep It Simple Stupid”. I like the term-centric syntax that comes with autoGeneratePhraseQueries and sow=false. In our experience it provides pretty good default search results. But you must be certain that all the fields being searched are consistently analyzed, with the same stopwords, synonyms, and query settings to avoid sudden field-centric surprises.

But sometimes you want a specialized per-field query behavior. For example, you know a certain tag field should have its own unique synonyms. In these scenarios, some specialized query/index analysis has occurred to target a specific user intent. For these cases, we recommend using boost queries (bq) to apply additional per-field boosts as needed. To other Solr relevance engineers, a bq semantically says in the query DSL that the query is “attached” and distinct from the main query. We also recommend getting comfortable with Solr local params syntax to really fine tune the matching logic for those boosts. One of hossman’s man talks on Solr query syntax can be very helpful!

Keep in mind the use cases, data volume, data complexity, languages, and synonym (taxonomy?) issues can be quite different per application & user. So while this is a good pattern, there’s still no “one size fits all.” Many of us keep touting so-called cognitive search, but in reality before that’s even possible we’ve got to get the dang synonyms to work!

Shameless plug for search relevance training!

Did I mention we do Solr/Elasticsearch relevance training? We cover topics ranging from organizational models, semantic search, synonyms, and learning to rank 🙂 Get in touch if your team could use training from the team that wrote the book on search relevance. Or if you just want to chat about other Solr/Relevance topics!