Patterns for Synonyms in Elasticsearch: Keyphrases

Doug TurnbullDecember 2, 2016

As I almost exclusively help folks with Solr/Elasticsearch search relevance, I often encounter the “giant list of synonyms” that a client has. This list often creates odd side effects with matching. I want to begin to discuss patterns that I’ve found useful when managing Solr/Elasticsearch synonyms. Here I just use ES for examples, but quite a lot here also applies to Solr with the caveat of dreaded multiterm synonyms .

In this article, I want to talk about thinking about synonyms in terms of keyphrases, not as the individual tokens they represent. Most users think of phrases like “heart attack” as a unit, not as the word “heart” before the word “attack”.

First let’s examine the problem more closely. Let’s take a health care example (all code in this gist). It’s not uncommon for folks to start with a synonym list like below. Here we equate all the different names for “heart attack” as equivalent:

 "settings": {
        "analysis": {
            "analyzer": {
                "syn_text": {
                    "tokenizer": "standard",
                    "filter": ["health_synonym"]
                }
            }, 
            "filter": {
                "health_synonym": {
                    "type": "synonym",
                    "synonyms": ["heart attack, myocardial infarction, mi, cardiac arrest, heartattack"]
                }
            }
        }        
    }

This creates a syn_text analyzer which tokenizes text on whitespace then applies the provided synonyms the text. This looks very much like where 90% of my clients start with manually maintained synonyms. The problem is this doesn’t work as most expect. Using elyzer you can see what happens to when this analyzer is applied to the text “heart attack”

$ elyzer --es localhost:9200 --index syntest --analyzer syn_text --text "heart attack"
TOKENIZER: standard
{0:heart}	{1:attack}	
TOKEN_FILTER: health_synonym
{0:heart,myocardial,mi,cardiac,heartattack}	{1:attack,infarction,arrest}	

If that’s not infinitely clear, it means that with your text, “heart attack” gets turned into a whole bunch of tokens (heart, myocardial, mi, etc) in the 0th position, and a whole bunch of tokens in the 1st position (attack, infarction, …). Which means the text turns into, as you’d expect, the two seperate tokens [myocardial] [infarction]. Perhaps unexpectedly, the phrases [cardiac] [attack], [myocardial] [arrest], and even [heartattack] [attack] now appear next to each other.

These will match phrase queries such as “heartattack attack” and “myocardial arrest.” Or even single term queries for just “heart” or “cardiac.” Most developers, thinking of “heart attacks” as a unit unto itself that should only match when “heart attack” itself is searched for, find this behavior unexpected.

Extracting keyphrases with autophrasing

What you probably expect is that phrases like “myocardial infarction” will be treated as a single unit – it’s own keyphrase, not really broken down into constituent parts. Indeed, that’s the strategy behind Lucidwork’s auto phrase token filter for Solr. It catches phrases like “myocardial infarction” and ensures they’re treated as a single unit. Perhaps as the single token [myocardial_infarction]. This makes sure that before the synonym step, you don’t inject multiple tokens – only the full phrase as a single token. Thus avoiding weird spurious matches on half of the ‘key phrase’ (such as just attack).

Entity Extraction with Shingle Autophrasing/Keepwords

You can get Solr/Elasticsearch into viewing multi-word chunks as a single token with shingles. A shingle filter lets us group every possible 1 - N set of words as a single token in analysis. So the phrase “myocardial infarction is no fun” would get emitted as [myocardial], [myocardial_infarction], [myocardial_infarction_is].... Then following this step you can optionally use a keepwords filter to only keep the phrases you want, perhaps narrowing this text back down to the token just [myocardial_infaction]. As we write about in Relevant Search what we’re really doing is careful feature extraction. This can serve as a poor person’s entity extraction system – or perhaps use the output of an external machine learning system crafted to look for key medical phrases based on statistically interesting phrases.

What does this look like? You’ll want to take your synonyms list and generate two filters, one for synonyms and the other with those synonyms listed as keepwords, as follows:

"filter": {
   "health_synonym": {
      "type": "synonym",
      "tokenizer": "keyword",
      "synonyms": ["heart attack, myocardial infarction, mi, cardiac arrest, heartattack, acute heart attack"]
   },
   "keep_health_entities": {
      "type": "keep",
      "keep_words": ["heart attack", "myocardial infarction", "mi", "cardiac arrest", "heartattack", "acute heart attack"]
   }

Notice we’re setting tokenizer to “keyword” for the synonym filter. This will keep the phrases “heart attack” as a single token, and also only expand synonyms that it sees as a full token. In other words turning the full token [cardiac arrest] into [heart attack].

You’ll also want to add the shingle filter to generate shingles ranging from length 1-4:

"4_shingle": {
        "type": "shingle",
        "max_shingle_size": 4,
        "min_shingle_size": 2,
        "output_unigrams": true
    },

Now, we’re going to string these three together so that first we generate shingles, followed by health synonyms, followed by keepwords. In other words: generate candidate keyphrases by shingling, expand them with synonyms, then cull out any non-synonyms with keepwords. As in the following analyzer:

"analyzer": {
    "syn_text": {
        "tokenizer": "standard",
        "filter": ["4_shingle", "health_synonym", "keep_health_entities"]
    }
},      

The keepwords step, in addition to serving as entity extraction, cleans up lots of spurious tokens generated the initial shingles step. Because we’re removing most of the text, this approach is best used on a copy of the text (Solr copyField). In Elasticsearch, we can just use subfields to analyze the text differently:

 "mappings": {
    "article": {
        "properties": {
            "text": {
                "type": "string",
                "analyzer": "english",
                "fields": {
                    "entities": {
                        "type": "string",
                        "analyzer": "syn_text"
                    }}}}}},

So we’ll have text (the full English, exact text) and text.entities which will contain our keyphrases expanded into synonyms.

I’m stepping over a lot of artful thought you might need to consider for your case. First, I’m doing barebones analysis. I’m not stemming, for example, When you manage synonyms, you should consider whether you’d like to add at least a minimal amount of possessive/plural stemming, so you don’t have to consider many alternate forms in your synonyms. Keep in mind though that the Solr/Elasticsearch stemmers are heuristics. Odd english plural forms like “shoes” don’t always get caught, so you need to test and account for alternate forms to some extent.

Anyway, now that there’s two fields, how would they be queried? Well the query strategy here is to query against the plain text and boost by medical entity matches. Something like the following query is a good starting point:

POST syntest/_search
{
    "query": {
        "bool": {
            "should": [
            {"match": {
                "text.entities": "heart attack"
            }},
            {"match": {
                "text": "heart attack"
            }}]}}}

So the nice thing we’ve done here is pulled out entities that likely matter to our users, and can use them explicitly in a query. This lends a signal to the final ranking function asking: are medical entities the user might talk about, or their synonymous phrases, mentioned in the text? If so – boost!

We could expand this to get more specific about what kind of entities are being talked about, in case you wanted to modify how ranking worked based on those entities. Perhaps diseases ought to be separated from treatments, for example.

One problem with what I’ve presented is that the synonyms are dumped in a different field. So there’s one relevance score in the above query that matches synonyms, the other that doesn’t. The weights of these two queries is hard to balance, and comes with other problems. For example a user searching for the phrase “bad cardiac arrest” won’t match the text “bad heart attack.” The synonym isn’t expanded in main text, and the non-medical term “bad” is stripped out of the text.extracted field.

I’ve found that a better pattern, that I write about in Relevant Search is to have one high-recall “base” query. Something that is indeed full of all the problematic noisy synonyms. Then within this, reshuffle the higher precision, secondary key phrase matches query to the top of the search results.

I’m not going to extensively recreate this, but imagine you have a text field with naive synonym expansion like the first example in this article. In other words “bad heart attack in Paris” is analyzed as:

$ elyzer --es localhost:9200 --index syntest --analyzer syn_text --text "he had a blue heart attack"
TOKENIZER: standard
{0:he}	{1:had}	{2:a}	{3:blue}	{4:heart}	{5:attack}	
TOKEN_FILTER: synonym
{0:he}	{1:had}	{2:a}	{3:blue}	{4:heart,myocardial,mi,cardiac,heartattack}	{5:attack,infarction,arrest}	
doug@76$~/ws $ elyzer --es localhost:9200 --index syntest --analyzer syn_text --text "bad heart attack in paris"
TOKENIZER: standard
{0:bad}	{1:heart}	{2:attack}	{3:in}	{4:paris}	
TOKEN_FILTER: synonym
{0:bad}	{1:heart,myocardial,mi,cardiac,heartattack}	{2:attack,infarction,arrest} {3:in}	{4:paris}

There’s our noisy, weird synonyms. Yes this will match many a spurious search term, including just “heart” by itself, confusing those heart burn sufferers. But that’s ok, because this is just our starting point for a general working set of reasonably relevant documents.

We use our entities field to bring the good stuff to the top. You then have text.entities as we’ve just introduced. Then you repeat the query, perhaps with a significant boost on text.entities to pull up key medical phrases and their exact synonyms:

POST syntest/_search
{
    "query": {
        "bool": {
            "should": [
            {"match": {
                "text.extracted": "bad heart attack"
                "boost": 10
            }},
            {"match_phrase": {
                "text": "bad heart attack",
                "boost": 10
            }}
            {"match": {
                "text": "bad heart attack"
            }}]}}}

Here there’s three signals to balance at ranking. First is the base, high recall match. Again, this is low value and shouldn’t be boosted much. Second there’s the two queries pointing at higher precision matches: full phrase match against text and entity match. We boost those higher (the exact boost is a matter of trial and error with something like Splainer or Quepid).

Autophrase with synonym step

Perhaps even better is to just autophrase the key phrases with an additional synonym step. For example turn heart attack into heart_attack and then expand heart_attack into various synonym forms. This simplifies the shingling, but comes with an extra step where you need to list your synonyms. You’ll probably want to generate these filters programmatically from your synonyms list to avoid errors.

Let’s start with the initial synonym step for autophrasing:

"autophrase_syn": {
    "type": "synonym",
    "synonyms": ["heart attack => heart_attack",
                 "myocardial infarction => myocardial_infarction",
                 "cardiac arrest => cardiac_arrest",
                 "acute heart attack => acute_heart_attack"]
},

Then we follow with appropriate synonyms :

"health_synonym": {
    "type": "synonym",
    "tokenizer": "keyword",
    "synonyms": ["heart_attack, myocardial_infarction, mi, cardiac_arrest, heartattack, acute_heart_attack"]
}

And our analyzer becomes:

"analyzer": {
    "syn_text": {
        "tokenizer": "standard",
        "filter": ["autophrase_syn", "health_synonym"]
    }
},    

This nicely tokenizes “bad heart attack in paris” accordingly:

$ elyzer --es localhost:9200 --index syntest --analyzer syn_text --text "bad heart attack in paris"
TOKENIZER: standard
{0:bad}	{1:heart}	{2:attack}	{3:in}	{4:paris}	
TOKEN_FILTER: synonym
{0:bad}	{1:heart,myocardial,mi,cardiac,heartattack}	{2:attack,infarction,arrest}	{3:in}	{4:paris}	
doug@76$~/ws $ elyzer --es localhost:9200 --index syntest --analyzer syn_text --text "bad heart attack in paris"
TOKENIZER: standard
{0:bad}	{1:heart}	{2:attack}	{3:in}	{4:paris}	
TOKEN_FILTER: autophrase_syn
{0:bad}	{1:heart_attack}	{2:in}	{3:paris}	
TOKEN_FILTER: health_synonym
{0:bad}	{1:heart_attack,myocardial_infarction,mi,cardiac_arrest,heartattack,acute_heart_attack}	{2:in}	{3:paris}	

When we used shingles, we pulled entities in a different field. There’s no reason you couldn’t use that approach here as well, but here I’m just going to apply this analyzer directly to the text. This demonstrates that we can search for the phrase “bad cardiac arrest” and we’ll still match this text:

POST syntest/_search
{
    "query": {
        "match_phrase": {
            "text": "bad cardiac arrest"
        }}}

As users expect this matches “bad heart attack in Paris.” Optionally, this analyzer could be applied to a subfield/copyField to differentiate from exact text and text with synonym keyphrases expanded.

Intent & Entity Focussed – What kinds of things do users search for

One of the biggest questions when dealing with synonyms is not to refer to an industry or language standard for synonyms. Instead you need to map between how users talk about content vs how content creators have written content. This varies tremendously between domains, for example I’m inclined to search a set of laws for laws about a “dog catcher.” Yet legislators likely write about “animal control officers.”

So when you work on synonyms, don’t worry about a globally pure expression of synonymy. Focus on how your users search, how your content is organized/written, figure out how to map between the two. This isn’t easy, and is often the hard app/domain-specific work you need to do. And as an aside – there’s not usually a generic, machine learning silver bullet to this that doesn’t also require work, domain expertise, monitoring & curation.

In a future article, I’ll expand on these ideas. You’ll see that taxonomies and controlled vocabularies can be incorporated to create semantic search that can order by concept specificity. For example, a search for “heart attack” that can first show you heart attack results. If none of those occur, then conditions having to do with the heart, followed by general circulatory diseases, followed by all diseases, and so on.

I’ll also discuss in a future article one concern with these approaches: choosing between query and index synonym expansion. In the above examples, I’m doing both. The strategies here look quite differently at high scale. In those cases, leaving the indexed text largely alone and expanding queries to synonyms at search time is much better – though with other consequences.

That’s not all

These patterns are just the tip of the iceberg on building a good search application. Regardless of your search technology, you need to understand how synonymy works. You need to map between how users seek content and how content describes important entities. And there’s much, much, room to extend what’s been described above that I’ll be writing about in the future!

Anyway, please do get in touch if you need help with Solr and Elasticsearch relevance. Our services focus on improving technical and business outcomes through better search and recommendations.




More blog articles:


Let's do a project together!

We provide tailored search, discovery and analytics solutions using Solr and Elasticsearch. Learn more about our service offerings