Patterns for Elasticsearch Synonyms: Taxonomies and Managed Vocabularies

December 23, 2016 Doug Turnbull
Category: Solr

Last time on the Young and the Synonomous we discussed how users often think of key phrases like “heart attack” or “cardiac arrest” as single terms. We demonstrated how to implement keyphrases using Elasticsearch synonyms. In this article, I want to extend this discussion to show you how to build semantic search using curated taxonomies and managed vocabularies.

I want your takeaway to be to stop managing synonyms, and start curating taxonomies and managed vocabularies! You’ll see that for narrow, relatively static domains categorization of user queries and key phrases into a taxonomy can lead to search that anticipate user’s intent in ways most assume can only be done with expensive machine learning techniques.

First let’s define what we mean by these terms:

What’s a Taxonomy?

Taxonomies are hierarchical classifications of things. A well known taxonomy would be the Kingdom, Phylum, Class, Order, … taxonomic rank that we use to organize living things. So an Elephant is in Kingdom Animalia, Phylum Chordata, Class Mammalia and so on. Or as I like to think of taxonomies like a bit more informally as like a directory structure. File Elephant under:

Animalia\Chordata\Mammalia\Afrotheria\Proboscidea\Elephantidae

Then when you get to the “Elephant” level you suddenly become conscious of the fact that there’s many kinds of elephants. There’s actually African Elephants that are different than Asian Elephants. So if we list the taxonomy for an African Bush Elephant, we get:

Animalia\Chordata\Mammalia\Afrotheria\Proboscidea\Elephantidae\Loxodonta\Loxodonta africana

You’ll notice that depending on who you are, you come at the idea of “elephant” with your own level of specificity – and that level is mapped directly onto an entity in the taxonomy.

What’s a managed vocabulary?

I see a managed vocabulary (sometimes people say controlled vocabulary) as an extension of a taxonomy that also accounts for synonyms. One prominent example of a managed vocabulary is MeSH. MeSH categorizes diseases, drugs, and other entities in medicine. The entry for headache, gives you both taxonomic classification(s) and alternative names, in this case phrases like “cranial pain”, “head pain”, and “cephalalgia.”

How do these synonyms fit into to taxonomies? To revisit our elephant example, we might note alternate names for African bush elephant alongside or under the scientific name. Perhaps incorporating them as their own taxonomical entries:

Animalia\Chordata\Mammalia\Afrotheria\Proboscidea\Elephantidae\Loxodonta\Loxodonta africana\African Bush Elephant
Animalia\Chordata\Mammalia\Afrotheria\Proboscidea\Elephantidae\Loxodonta\Loxodonta africana\Elephant, Bush

What do taxonomies and managed taxonomies have to do with search?

A common refrain from my search relevance clients is “I want to get away from ‘strict matching’ to build semantic search.” If a user types in “birkenstocks” I want to make sure that search can understand they mean “sandals” – and by “sandals” I of course mean “shoe”. Conversely, if the user types in “sandals” we want to show them “birkenstocks” even if sandal is never mentioned.

On the surface of it, this sounds like a problem of direct synonymy. This exact thing equates to this other exact thing. Pure synonyms, however, only increase recall. They make “sandals” and “birkenstocks” equivalent or “shoes” the same as “sandals.” This leads to a problematic situation where a search for “birkenstocks” brings back “tennis shoes” because well “shoes” are the exact same as “sandals” are the exact same as “birkenstocks.”

In other words, direct synonyms often increase recall but decrease precision. Users expect results that match their level of specificity (birkenstocks when searching for birkenstocks; sandals when searching for sandals). Managed vocabularies help get the value of synonyms without sacrificing precision. This is because taxonomies help by modeling hypernym/hyponym relationships. Sandal is a hypernym (a word with broader meaning) than birkenstock. Birkenstocks are hyponyms (more specific) of sandals. The search engine should get this and rank results accordingly, giving exact matches before hypernyms, and hypernyms before hyponyms.

Perhaps broader than the precision/recall tradeoff is a general pattern in how users search. In my experience, users often start broad (“laptop bag”) to explore a data set. Most likely, nothing matches at such a broad level of specificity, but there ARE many items that are hyponyms of “laptop bag,” including laptop backpacks, messenger bags, kid laptop bags, and so forth. Thus this gives users an option to get the “30,000” ft view of the data, or refine with broader/narrower terms to get more or less specific and precise with their terminology. It’s often said that a good search user experience puts users on the “scent of information” and using managed vocabularies is a great way to do that.

Building a managed vocabulary from search behavior

Ok, so how do you implement this magic for your search?

Well step one, it’s not magic–there’s skill involved (“taxonomist” is a job title after all!). You need to generate and/or recycle a managed vocabulary. Something that captures descriptive terms that map between how content creators describe items and how your searchers describe items. For example, a common problem in legal search is mapping between lay-speak and legalese. My Dad is likely to use the phrase “dog catcher” while the law is likely to write about “animal control officers.”

In general, surveying dads is not a scalable way to create a taxonomy. Instead, an alternative is to analyze your common & problematic search terms and hunt for key phrases, as discussed in the prior article. If we notice with legal search that users search for “dog catcher,” “police officer,” “DA,” “real estate taxes,” “property taxes,” “income taxes,” we can begin to see how our users structure laws in their own words. We might decide that our users fall into two general categories (1) those that search for topics in criminal law (2) those that search with taxation questions. Within those, there appear to be finer breakdowns – all the way down to individual search terms, such that we end up with an initial managed vocabulary as follows (I’ll start using underscores like I did in the key phrases article):

criminal_law\legal\district_attorney
criminal_law\animal_enforcement\animal_police\dog_catcher
criminal_law\human_enforcement\police_office
taxation\real_estate\property_tax
taxation\real_estate\property_tax\real_estate_tax
taxation\income\income_tax

There’s an art here. We’re deciding ourselves what the right hypernym/hyponym relationships are. Sometimes we create our own artificial categories of items that seem to go together for our users. As we’ll see in the next section, the ranking function we’re going to engineer into Elasticsearch will rank based on how close an item is in a taxonomical hierarchy, preferring hyponyms over hypernyms/siblings when there’s no exact match. So getting this right means also fine tuning precision and ranking: if a user searches for “property tax” – should the other real estate taxes come before the income tax?

We’re not quite done yet, next, we reflect on our content to map the phrases used in the content for these entities. For example, we notice that the law talks about “district attorney” not “DA” and “animal control officer” not “dog catcher.”

criminal_law\legal\district_attorney
criminal_law\legal\district_attorney\da
criminal_law\animal_enforcement\animal_control_officer\dog_catcher
criminal_law\human_enforcement\police_office
taxation\real_estate\property_tax
taxation\real_estate\property_tax\real_estate_tax
taxation\income\income_tax

This is very much an iterative process: get basic keyword search operational. Watch how people search. See if you can structure the key phrases in their search into taxonomies. Rinse repeat. Next up, we’ll see how to transform this into an asset that can be used to improve search relevance.

Using Managed Vocabularies in Elasticsearch

So ok, nothing you’ve seen thus far looks like anything that has anything to do with Elasticsearch. I know, sorry for all the setup. But long story short, we can take a line from our managed vocabulary:

criminal_law\animal_enforcement\animal_control_officer\dog_catcher

and turn it into a this line in an Elasticsearch synonym filter

dog_catcher => dog_catcher, animal_control_officer, animal_enforcement, criminal_law

And done….

Who the what? Ok let’s break this down. Now from the previous article on keyphrases you’ll recall we talk about how to turn noun phrases into their own tokens. These key phrases represented ideas that really best thought of as tokens in their own right – words that go together naturally. We discussed autophrasing with a synonym filter like so:

"autophrase_syn": {
    "type": "synonym",
    "synonyms": ["heart attack => heart_attack",
                 "myocardial infarction => myocardial_infarction",
                 "cardiac arrest => cardiac_arrest",
                 "acute heart attack => acute_heart_attack"]
},

The autophrasing here, also applies to all the little phrases in our managed vocabulary, to generate (programatically):

"autophrase_syn": {
    "type": "synonym",
    "synonyms": ["dog catcher => dog_catcher",
                 "animal control officer" => "animal_control_officer",
                   ...
                 "income tax" => "income_tax"
]},

Now here comes the magic. After autophrasing, the next step is to expand leaf managed vocabulary into their hypernyms, in a synonym token filter such as:

"vocab_syn": {
    "type": "synonym",
    "synonyms": ["dog_catcher => dog_catcher, animal_control_officer, animal_enforcement, criminal_law ",
                   ...
                 "da => da, district_attorney, legal, criminal_law"
]},

For taxonomies, as we’ll see, we also want to only keep those key phrases that we care about, giving us this analyzer that combines the two synonym filters above with a keepwords filter.

"settings": {
        "analysis": {
            "analyzer": {
                "taxonomy_text": {
                    "tokenizer": "standard",
                    "filter": ["autophrase_syn", "vocab_syn"]
                }
            },
}

Really these two steps are all you need to get a taxonomy effect. It also so happens that this structures indexes/queries so that vanilla Elasticsearch ranking can be made rank based on taxonomic similarity. How? Simply put, what this synonym mapping does is make the top level terms in the taxonomy very common (criminal_law, taxation). In other words, their document frequency increases tremendously. This makes them lower value terms when scoring. The more specific terms, like “dog_catcher” will occur rarely, and matches with them will have a higher score.

Moreover, if you search for “dog catcher”, hypernyms of dog catcher get second billing as users would expect. This query is expanded into a search for all four search terms in the synonym filter above dog_catcher OR animal_control_officer OR …. If no exact matches for “dog catcher” occur (which as we discussed is likely with legal text) search falls back to matches of 3 out of 4 terms or animal control officers. If there’s no animal control officers, then search falls back to just animal enforcement, and so on.

Voilà, you’ve built a semantic search system that can seemingly do smart things like fallback to broader categories.

Other techie details

This isn’t open and shut, and there’s still plenty of room for more decisions. You might want to also rank more specific terms (hyponyms) so that items are ranked so that closer hyponyms come above deeper hyponyms. For example, a search for “shoe” ought to yield generic “sandals” before “birkenstocks.” This doesn’t happen as often, but if you did care you could try to use the field length normalization in TF*IDF to get this effect. Field normalization tends to bias results towards shorter fields. “Shorter” could mean that “Shoe” expanded to 4 taxonomic terms is shorter than “birkenstock” with its 6 terms. There’s work here that involves, very roughly

Using keepwords to only keep keyphrases into a separate taxonomical field, thus only considering taxomical terms for length
Customize the underlying similarity to count the number of terms – not the number of unique positions – when doing field norms

That’s a google search term salad that should get you started if you’re interested.

I often also create multiple taxonomies because there’s more than one kind of classification that’s going on. A taxonomy for types of people/roles in the law. Another one perhaps oriented at legal problems people might describe. This lets me call these out at query time as different factors to boost. If the user mentioned a specific role, like district attorney, I may want to assert that if this sort of entity occurs in the query, boost it more than other factors. Using keepwords with a copy field or subfield (instead of text you could have text.extracted_roles) lets you query text.extracted_roles with a high boost. The boost will only matter in the off chance that a user actually mentions this type of entity in their search string.

Another topic I noodle with a lot is using a hierarchical classifier keyed into the clickstream to automate this process with machine learning. It’s not as “silver bullet” as you think. You’d perhaps never see any of these associations that share no terms (such as dog catcher / animal control officer). Trey Granger and others have done work to use user behavior to augment synonyms. I wonder if it can be used to augment managed vocabularies. I’m sure someone has worked on it.

This seems like a lot of work…

If you’re thinking, this approach seems like a lot of work, you’re right. This is an approach I wouldn’t use for everything. But it’s not as much work as you might think. Consider

There might already be publicly available taxonomies for your domain that can be a starting point (MeSH, etc)
You might not need a complex taxonomy if you focus more on the most common or problematic searches
You can fall back to general search, which means you’re using this to fine tune specific use cases of search
You might have already been trying to patch over search with synonyms, this isn’t that much more work
This gives non-techie search managers a great way to manage relevance: a way to control recall (synonyms) AND precision (hypernym/hyponyms)

Finally, if you’re in a field that relies on careful search management via a taxonomy, seriously consider hiring a professional taxonomist. It’s a field unto itself. I’m not even touching on the many many technologies and approaches to managing taxonomies. You should go check them all out!

In a future article, I may talk about the pitfalls and opportunities of tagging items using taxonomies or keyphrases. This is a topic that requires skills beyond a taxonomists. In fact, you may need to hire a librarian! After all, who else understands systematically organizing information by topic?

Would you like to know more?

Get in touch! If this all interests you, and you would like to share your own ideas – don’t hesitate to reach out directly. And as always, let us know if we can help make a smarter search or recommendation system for you – it’s what I do with my crack team–day in day out!

You may also be interested to know that we cover many of the above topics in our Think Like a Relevance Engineer training, available for Elasticsearch and Solr.