Solr Synonyms and Taxonomies: Mea Culpa

November 21, 2017 Doug Turnbull
Category: Solr

I screwed up! Mea Culpa. Let he who is without synonyms throw the first rock!

In my talk Taxonomical Semantical Magical Search, I presented an index-time hypernym/hyponym expansion solution. It’s one I’ve blogged about for Elasticsearch

Basically, in this solution, we use synonyms to perform a semantic expansion at index time to broader concepts. For example, we perform the following:

tabby => tabby, cat, feline, animal

This had the feature of inflating the document frequency of animal very highly given it’s low specificity while keeping the document frequency of tabby lower. This works with the strength of TF*IDF based similarity: specific/rare things score higher when they match over common items.

At query time, if a user searches for cat, historically this solution has worked well because Solr/Lucene would translate synonym expansions to OR queries. In other words, the synonym file above is turned into

tabby OR cat OR feline OR animal

As tabby will have a low document frequency, it will score higher with a higher term specificity. Animal will score lower. The result is you have results sorted by semantic proximity to the user’s search.

But, as of Solr 6, they’re not turned into an OR query. Which means the advertised behavior in my talk is not accurate.

Instead of an OR query, they’re turned into a SynonymQuery.

For most people using true synonyms, SynonymQuery does something very useful. It takes all the synonyms at query time and treats them as if they’re the same term with the same document frequency. If there’s 20 occurrences of car, and 40 occurrences of automobile, you’re ok treating them as if they’re the same term, and searching with the document frequency of 40.

However when using the synonym filter for hypernym/hyponym expansion, if animal has a document frequency of 2000 and tabby a document frequency of 20, we use 2000 for the document frequency. Our term specificity is washed away.

I have created this patch to allow one to choose how to handle overlapping query terms (w/ SynonymQuery staying as the default). Your feedback is very welcome. I’d love to get this in to allow the generated query to be configurable! As in my experience a classic mistake with synonyms is thinking two things are synonyms when in reality they have a hypernym/hyponym relationship.

The Work Arounds

There are some workarounds. In all of them, we leave index time semantic synonym expansion alone. The term tabby still expands to tabby, cat, feline, and animal.

The first is to use the match query parser when boosting on synonyms. This is one reason I didn’t notice this issue, as I tend to use match qp (or other custom qparsers) for using synonyms with taxonomies. Only when colleagues unable to use plugins, began using this technique with edismax did I facepalm.

Indeed, not everyone can deal with plugins. The other solution is not pretty, but it involves creating a bq for the parent synonyms, another bq for the grandparent synonyms, and so on. This effectively recreates the original OR query. Albeit with lots of cruft and potentially duplicate field types.

For example, these two boost queries:

bq={!edismax qf=text_parent v=$q}&bq={!edismax qf=text_grandparent v=$q}

Where text_parent would have a synonym file that looked like:

tabby => cat
cat => feline
feline => animal

And text_grandparent would skip a generation:

tabby => feline
cat => animal

Here the parent query turns into a SynonymQuery one level up from the search term, broadening the least, bringing in only parents & direct siblings. While the grandparent query broadens extensively, bringing in a larger set of related concepts, but scored lower than parent queries due to the much higher document frequency.

This actually has advantages over the original solution, as you can fine tune scoring to decide how much to broaden the query. Perhaps you’d only like to go as far as parents, but want to prompt the user before expanding to grandparents?

The huge downside is maintenance and duplication of fields. A Match Query Parser or Elasticsearch solution would avoid the duplication by allowing a different query analyzer to be used for a field. And as these synonym files are often generated programatically from a taxonomy, the maintenance may not matter as much.

Synonyms are hard cause you probably really have a taxonomy!

Synonyms are one of the hardest thing to deal with in search. In a recent internal knowledge sharing call, our one hour call about synonyms, turned into a two hour call. Then we tacked on another hour to expand on the original two hours. I’m probably going to schedule yet another hour just to talk about this blog post and my screw up!

One big problem is that people assume they have synonyms, when often they really have a taxonomy! Jeans aren’t really exactly the same as pants, they’re a type of pants. If you decide that jeans truly are the same as pants (ie jeans,pants) then you’ll have to explain to customers when they search for “khakis” why blue jeans come to the top. Because somewhere else farther down in the synonym file there’s a khakis,pants line!

In reality khakis and jeans are hypernyms of pants. This is a different relationship, requiring different assumptions. Sussing out these relationships is often the interesting & challenging work in search. Especially when searchers and content creators rarely use the same words for items

Well that’s it for now! Perhaps you have some ideas, questions, or feedback you’d like to share with us? We’d love to hear from you – we love empowering other search teams to be their best!