Recently while experimenting with word2vec-based features with Learning to Rank, I was exploring using collocations to improve the accuracy of my embeddings. If you read the original word2vec paper Mikolov et. al. make the observation:

Word representations are limited by their inability to represent idiomatic phrases that are not compositions of the individual words. For example, “Boston Globe” is a newspaper, and so it is not a natural combination of the meanings of “Boston” and “Globe”.

Indeed this is something we note in search engines. And one of our relevance best practices is to identify idiomatic phrases and other entities and treat them uniquely in a search solution. Treating them as a single phrase at query time or a single token at index time. Something like this entry in a synonyms file:

`boston globe => boston_globe`

Identifying idiomatic phrases can be a simpler way of seeming like your search system understands entities, without building a full-blown entity-extraction system. As always in search relevance, solve the simpler problem first, then move on to the next level of sophistication only if it’s warranted as needed in your evaluation data.

## Where do you find idiomatic phrases?

The challenge becomes, where do you find these idiomatic phrases? More than one client of ours has been frustrated by maintaining a list of this manually. Well luckily machine learning, err, let’s just say “statistics” to the rescue. In Natural Language Processing (NLP) there is an area that can help us called *collocations*. With collocations, you can use relatively straightforward statistics to determine whether a given phrase is statistically significant, and occurring more often than random in our corpus.

### Chi-Squared Latte

A straightforward, and rather effective way to score collocations is with a chi squared statistic. Chi squared is a fairly robust statistic, that helps overcome the tendency for language to be not normally distributed. You may know about Zipf’s law. That is, the most common words occur much, much more frequently than rare words. This occurs too with two-word phrases. if you zoom into phrases that begin with “New” the subsequent terms will also be distributed occuring to Zipfs law.

Given some labels of a day’s weather like `"rainy", "not rainy", "cloudy", "not cloudy"`

, we might want to know if there’s some relationship between cloudiness and rain.

If we know that out of 365 days, there were 100 rainy days, 265 not rainy days, 125 cloudy days, and 240 not cloudy days. In other words, the probability of rain is 100 days / 365 days or ~0.27. Similarly for cloudy days there’s a probability of ~0.34. And so on…

Let’s suppose we care about understanding the relationship between cloudiness and raininess. We start by assuming they’re completely unrelated. Like the probability of your Uncle Phil showing up and Spongebob being on TV. Completely random, unrelated, independent events occur at a probability P(Event1) * P(Event2). Or P(Uncle Phil) * P(Spongebob).

Applying that here, we can presuppose the events rainy/not rainy and cloudy/not cloudy are unrelated calculating their *expected probability*. Or simply P(rainy) * P(cloudy) or 0.27*0.34 ~ 0.09. We can repeat this to come up with expected probabilities for the intersection of every event:

rainy | not rainy | |
---|---|---|

cloudy | 0.09 | 0.25 |

not cloudy | 0.18 | 0.48 |

We want to test if the association is NOT random chance. That is there’s some kind of relationship here. That’s where the chi-squared statistic comes in. Because we actually happen to have observed how many days were actually BOTH rainy & cloudy at our weather station, and we can list those in a similar table:

rainy | not rainy | |
---|---|---|

cloudy | 90 | 50 |

not cloudy | 5 | 220 |

Using this, we can compute the actual probabilities, and compare them to random chance:

rainy | not rainy | |
---|---|---|

cloudy | 90/365 | 50/365 |

not cloudy | 5/365 | 220/365 |

Calculating actual probabilities:

rainy | not rainy | |
---|---|---|

cloudy | 0.25 | 0.14 |

not cloudy | 0.01 | 0.60 |

And intuitively, if we want to know if these things are related, we’d like to know how far these probabilities are from expected probabilities. That’s what chi-squared is, it computes a statistic from the difference of the actual (P) from the expected (E) probabilities of each cell in this table, sums them up. In the equation below, i,j are simply rows/columns of each cell in the table.

This would give us a chi-squared of

`365* ( (0.25 - 0.09)^2 / 0.09 + (0.14 - 0.25)^2 / 0.25 + … ) = 172.65`

You can consult a chi squared distribituon which graphs the likelihood of observations occurring independently as a function of the chi-squared statistic. You’ll note the odds of this data happening by chance is infinitesimally small, so we can presume there’s a significant relationship between raininess and cloudiness.

Now, it just so happens to turn out that for these 2×2 tables (called confusion matrices) you can simplify all of that math to a simpler equation. Although you lose much of the intuition of what’s above, it’s a more convenient calculation.

rainy | not rainy | row sum | |
---|---|---|---|

cloudy | 90 | 50 | 140 |

not cloudy | 5 | 220 | 225 |

col sum | 95 | 270 |

We can use this equation to compute chi-squared using Counts(C) at each cell, divided by row and column sums:

Which would be:

`365*(90*220 - 50*5)^2 / (140*225*95*270)`

Or 172.65

### Get to the NLP already!

With this context set, we can start to examine our collocations in the same way. We can examine, out of all 2-word phrases (bigrams) we know some will begin with “New” and some will end with “York”. So we can repeat the table from above, but over all bigrams in a corpus (N=25156)

w1=New | w1!=New | |
---|---|---|

w2=York | 90 | 51 |

w2!=York | 205 | 25000 |

Words beginning with New = 295Words ending with York = 141

In the same way, we want to know if there’s a significant relationship between a phrase starting with one word “New ?” and ending with another word “? York”.

By random chance, we expect a word ending with New and ending with York to be about `(295 / 25346) * (141 / 25346)`

or 0.00006. As you might guess, two words randomly occuring together as a phrase, in a large corpus with a vocabulary of 100s of thousands, is vanishingly small. But we see in our data it occurs about 90 / 25346 or 0.003, which is much higher than expected! Completing the rest of the math, the final chi-squared value is 4840, very significant!

It’s fun to think through some of the statistics here with some interesting conclusions about text

**Most of the corpus is NOT ‘New York’**As you’d expect, most of the corpus is dominated by bigrams with neither w1 nor w2 (the bottom right is always very large). This is different than weather data, for example, where this might not be true!**Expected number of bigrams is very small**– The expected number of times a term might occur randomly with another term is probably less than 1 bigram. So any occurence of our bigram, “New York”, is non trivial! (Although it’s recommended to have a minimum value for each cell in chi-squared tests).**It’s hard to detect negatively correlated bigrams**. Because bigram occurrences are expected to be rare, we can know “New York” is a collocation. It’s hard to know if “Banana Aardvark” never occurs in language in a statistically significant way until you get to a lot of text.**w1 without w2 is most of phrases starting with w1**, the expected number of times a term would occur with w1 but not w2, is basically all of the occurrences of w2 minus a fractional number. So if 141 bigrams end with “York” then we would expect about 140+ bigrams to start with New and end with York- The last point is a bit more complex, because some words (like New) occur more often. So the expected value of words ending in York, but NOT starting with New might be closer to say 138 139 or so, but still not much less than the total occurrences of words ending in York!

The statistics here are not ground-breaking, but the simple results of knowing whether a query term should be treated as a phrase or not, can be a significant step for a search application. If you’re interested in learning about other approaches to Collocations, I’d recommend the book Foundations of Statistical Natural Language Processing by Manning and Schütz. It just so happens the chapter on collocations is free. Other scoring methods such as Mutual Information are equally as intuitive and produce reasonable results.

## Hands on with Movie Collocations

I used Elasticsearch to index bigrams from the TMDB movie dataset. Here the `overview`

field has a subfield (`overview.bigrams`

) stemmed and tokenized into bigrams. We then run a direct terms aggregation to dump the most frequent bigrams. We score using chi-square function, dump each scored bigram into a heap, and output the top N bigrams. It’s worth noting, I do this manually, but in real-life you should explore the significant terms aggregation which includes a chi-squared scoring method, if you wanted significant bigrams out of Elasticsearch.

There’s one subtle difference between using an aggregation for this and what was presented in the last section. Here we are using document frequency, which is not a direct count of the total bigrams in the corpus. In my testing, it doesn’t seem to matter. Indeed I might argue it’s a stronger signal, as it means the bigram is mentioned in very different contexts.

Without further adieu, here were the top bigrams out of the TMDB dataset:

`2345200.0 cerebr palsi => cerebr_palsi2345200.0 charlton heston => charlton_heston2345200.0 dolph lundgren => dolph_lundgren2345200.0 dziga vertov => dziga_vertov2345200.0 mardi gra => mardi_gra2345200.0 sanjai dutt => sanjai_dutt2345200.0 tel aviv => tel_aviv2345197.999988487 wim wender => wim_wender2345193.999768454 sci fi => sci_fi2345187.9998823116 são paulo => são_paulo2345185.999949258 barack obama => barack_obama2345185.999949258 notr dame => notr_dame2345181.9998963834 ingmar bergman => ingmar_bergman2345179.9999232474 lionel barrymor => lionel_barrymor2345169.9999680202 jimini cricket => jimini_cricket2345165.999978254 scantili clad => scantili_clad2345147.99924611 sherlock holm => sherlock_holm2345136.000081872 tourett syndrom => tourett_syndrom2345132.0001449804 prakash raj => prakash_raj2345106.0003807857 qing dynasti => qing_dynasti2345094.000745787 daffi duck => daffi_duck2345088.0008596377 rosalind russel => rosalind_russel2345078.0009624045 eiffel tower => eiffel_tower`

Now this isn’t magic, it’s just statistics! These are all fantastic candidates for good collocations like actor and place names. The next step is to scan through these and remove any candidates that are poor candidates for being treated as a single phrase in the search engine – and as you’d expect, you’ll find more as you move down the list.

Above you see them collapsed to a single term, which seems to imply that these phrases are dealt with at index time. This expansion could also occur purely at query time, by using two synonym filters. One that goes down to the single term, as above. And a second phrase that expands back to the phrase. Such as:

`eiffel_tower => eiffel tower`

This with query auto-phrasing would turn the search `paris eiffel tower`

to `paris "eiffel tower"`

making it seem a bit like the search engine understood that `eiffel tower`

was some kind of entity.

## Why not ‘Just Search Bigrams’ and skip all this math?

Something you might ask about – that was a lot of work. Why not directly search bigrams (or use pf2/pf3 in Solr). Certainly the phrase’s TF*IDF (or BM25…) score gets at something like “statistical significance” in search! What’s the point in jumping through all these statistical hoops when we can search with sub phrases directly to do the same?

It turns out that TF*IDF (specifically IDF) scores exactly the opposite of what we want, and leads to prioritizing irrelevant phrases over statistically significant ones. In other words `gluten free snacks`

might key in on “free snacks” as the most relevant phrase with TF*IDF, not the statistically significant collocation “gluten free”.

Let’s look at why:

Remember from the table above, most of the action will be between the bottom right (phrases with neither of our words) and the upper left (phrases with both of our words):

High DF, Low TF*IDF:

w1=Gluten | w1!=Gluten | |
---|---|---|

w2=Free | 90 | 51 |

w2!=Free | 205 | 25000 |

Low DF, High TF*IDF:

w1=Free | w1!=Free | |
---|---|---|

w2=Snacks | 20 | 51 |

w2!=Snacks | 205 | 25070 |

The upper left is the document frequency of our phrase. For a statistically significant collocation (“gluten free”), we want more count in the upper left. Unfortunately, this is the opposite of what TF*IDF would want. TF*IDF scorer rarer phrases (free snacks) as more important. The more common a phrase is, the lower the relevance score that would come out of the search engine. On the flipside, higher document frequency is a *positive* indicator for collocations! So statistically interesting phrases like “william shatner” are going to occur more than “funny william”. A search for the “funny william shatner” would prioritize the bigram “funny william” (low doc freq) above “william shatner” (high doc freq).

Having a rare, but not really significant phrase, come to the top is quite the opposite of what a user would expect. And brings up an interesting philosophical points about TF*IDF based statistics and the nature of collocations. Common phrases are more likely to be significant collocations – more likely to be idiomatic. They should be treated as a single unit/concepts, with its own measure “term frequency” and “document frequency”. Rare phrases, are less likely to be significant collocations. And should NOT be treated and scored using a term frequency / document frequency.

A big takeaway for me is to avoid combining pf2/pf3/bigrams without disabling TF*IDF or somehow only expanding subphrases that themselves are collocations.

This reiterates the definition of a good collocation from the beginning of the article, a phrase that stands alone as its own concept, that’s not a simple combination of it’s constituent words.

Word representations are limited by their inability to represent idiomatic phrases that are

not compositions of the individual words. For example, “Boston Globe” is a newspaper, and so **it is not a natural combination of the meanings of “Boston” and “Globe”. **

A collocation is a word unto itself in the linguistic sense. It needs to be treated as its own concept. Not a simple combination of the underlying terms. I guess this is finally where German has things easier with constantly compounding words into words like `bostonglobe`

🙂

### Collocate to Charlottesville for some relevance training with me?

I hope you found this useful, if you enjoy learning about relevance, check out our training occurring August if you’d like to hang out! We’ll be doing a reprise of our “Think Like a Relevance Engineer” training course as well as our “Hello LTR” Learning to Rank training that we did at Haystack. Hope to see you there!