Deep Dive into Elasticsearch Cross Field Search

vectors are fun

Crossing streams — problem. Crossing fields? No problem!

Elasticsearch’s cross-fields queries are a great feature. They let you blend multiple fields scores together on a search-term by search-term basis to give you the best score for that term. I covered the motivation for cross-field queries in a previous blog post. In this blog post I want to dive a layer deeper. How exactly do cross field multi_match queries work? How can you tune their scoring behavior?

Why Cross Fields?

Let’s recap why cross fields are so powerful. Their power is heavily related to another common pattern in both Solr and Elasticsearch – custom all fields. Both of the solutions solve similar problems, so an appreciation for both helps you get to the bottom of working with all fields.

What is this “all fields” pattern? Let’s examine briefly the problem they solve. Consider a case where academic articles are broken up into fields that reflect the sections in that document: abstract, introduction, methodology, and conclusion sections. What’s might be your starting point is a multi_match query over those fields

GET academia/article/_search{ "query": {     "multi_match": {        "query": "click tracking",        "fields": ["abstract", "introduction", "methodology", "conclusion"]}}}

This ends up doing a field-by-field search of the whole query string and taking the highest scoring field. This is a multi_match “best fields” query. In Lucene query syntax, in expands to these field-by-field (or field-centric) searches:

 abstract:(click tracking) | introduction:(click tracking) ...

These field-centric searches fail for a number of reasons. Many of which are written about in this amazing book :-p. However, the most important reason for our purposes is that each of these individual field TF*IDF scores don’t truly reflect criteria users care about. They don’t map to signals users think about when ranking. Its often not the data model users expect to be searching against. They have their own mental model of the text that has little do to with how a parser or your database organizes data.

However, if you can build fields that reflect the user’s mental model of the content’s structure, TF*IDF scores against these fields can correspond to user notions of relevance. One way to do this is to combine fields together into a single scorable unit (an all field). In other words, your users don’t care about a relevance score that measures the signal “this methodology section is about click tracking” more likely they want something more general – a signal measuring “this articles is truly about ‘click tracking’”.

Measuring Term Rareness

Behind the scenes there are mechanical reasons field-centric scores doesn’t measure the more general signal. One reason is an inverse document frequency (IDF – roughly 1 / document frequency) that doesn’t map to users expectations. Remember IDF is intended to map to user’s notions of term rareness, a higher IDF (lower document frequency) means the term is rare. Unfortunately, IDF very often fails to accurately measure rareness as users expect field-by-field. For example, if the term ‘click’ is fairly common throughout the corpus but for some reason rare for the abstract field, you could see surprising results. The user could see articles shoot straight to the top simply due to click matching abstract matching – not because the article is truly about “click tracking”. To the field-by-field search, click is seen as a rare in the abstract field, thus the abstract:click match is far too much of a special snowflake to be ignored! Unfortunately, users don’t quite see this as important.

This is the nuts-and-bolts motivation for combining fields together into a custom all field. A field that more closely maps to our user’s notion of the structure of the corpus means the IDF for “click” will more closely reflect our user’s expectations of “click”s rareness. In an all field, “click”s idf is computed across a combination of the articles parts, not for individual fileds. For that reason, its not rare to create an all field like all_text below, built up by copying other fields (using Elasticsearch’s copy_to feature):

{"mappings": {   "article": {                            "properties": {         "all_text": {            "type": "string",            "analyzer": "english",                                                          },          "abstract": {            "type": "string",            "analyzer": "english",                 "copy_to": "all_text"          },          "introduction": {            "type": "string",            "analyzer": "english",            "copy_to": "all_text"          }          ...

As it captures all the text, searching all_text more reliably produces meaningful TF*IDF scores that correspond to most of our users simpler view of these documents – a blob of all the article’s text. Therefore the TF*IDF score is more likely to be more closely atuned to criteria users care about. The score corresponds to the general signal “this document is about click tracking” not some combination of per-field TF*IDF scores possibly skewed to terms where one term happens to be particularly rare.

Cross Fields

The downside to this approach is that it involves duplicating every text field in the index. Moreover, its static. You must build it at index time. You can’t decide, ad-hoc, to create an “all field” out of only two text fields after the index is built. Cross fields compensates for many of the failings of all search. It allows you search on a term-by-term basis with a blended query that attempts to fix the document frequency math that mangles field-centric searches.

Let’s remember how we issue cross field queries. Recall, they’re a particular type of multi_match query:

GET academia/article/_search{ "query": {     "multi_match": {        "query": "click tracking",        "type": "cross_fields",        "fields": ["abstract", "introduction", "methodology", "conclusion"]}}}

Which is also explained to us as the following “parsed” psuedo-Lucene query:

 blended(abstract:click introduction:click ...)   blended(abstract:tracking introduction:tracking ...)

Last time we simply mentioned how cross-fields worked at a high-level. The query calculates a new document frequency (remember IDF = 1/ document frequency) to use when querying. If the document frequency for abstract:click was 7 and the document frequency for introduction:click was 20, then the larger of the two was taken. This way, we side step the issue of low document frequencies for particular fields (like abstract:click above) driving up documents and creating unintuitive search results. Instead, we work with document frequencies closer to the user’s notion of those terms rareness in the entire corpus – the combination of all fields.

There’s a bit more to the story. If you really want to leverage cross fields at the next level down, you’ve got to understand how they work under the hood. The mechanics of cross fields along with the knobs-and-dials available to you can help you leverage this great Elasticsearch feature even further to exactly get the search results you want.

Cross Fields Under The Hood

With all of that context, let’s get into how cross fields work. This is great feature, and you’ll need some context to know how to tune cross field queries to your requirements!

Elasticsearch is open source, and you can see exactly how Crossfields work by examining the source code. Cross fields are implemented as a custom Lucene query known as BlendedTermQuery. So if you’d like to follow along at home, poke around in that file. It’s after all a lot of fun to unwind how these features work! Hopefully I’m not the only crazy person that reads Lucene code on the beach :).

You might remember from previous blog post that custom Lucene queries work by creating custom weights and scorer. These delegate classes in turn allow you to modify the matching and scoring behavior anyway you’d like. BlendedTermQuery is different. Its a facade. Instead, the main action for BlendedTermQuery happens in its rewrite method – a method whereby Lucene asks a query to return an equivalent, possibly more efficient version of itself. BlendedTermQuery takes this opportunity to rewrite itself into a dismax query. If you follow BlendedTermQuery’s rewrite and blend methods, you get the following general steps:

  1. Gather term statistics from every search term
  2. Blend together document frequencies between the fields (other global term stats like total term freq are also blended together)
  3. Monkey-patch a TermQuery for each term search (ie abstract:click) with the blended doc frequency
  4. Return a Dismax query of each term’s field search.

That sounds like a lot of Lucene-ese. The bottom line is instead of acting like an honest query, Blended term query patches itself, turning this query:

GET academia/article/_search{ "query": {     "multi_match": {        "query": "click tracking",        "type": "cross_fields",        "fields": ["abstract", "introduction", "methodology", "conclusion"]}}}

into something more like the following psuedo-Query DSL, with a patched document frequency:

"dis_max": [   {     "term": {       "abstract" : "click"       "patched_df": 20     }   },   {     "term": {       "description" : "click"       "patched_df": 20     }   }   ...

Dismax also looks like the following in Lucene query syntax (| reflects the dismax behavior)

abstract:click(patchedDF=21) | introduction:click(patchedDF=20)

This means a couple of different things:

First, there’s no real blended term query in the sense that there’s something that iterates the inverted index and scores docs. BlendedTermQuery just transforms itself into a Dismax query over a bunch of term queries. In other words, there’s actually a set of real honest-to-goodness term queries here, with the only difference being that each has a patched document frequency. Other factors for each field like term frequency and norms (bias based on length) continue to play a role in the scoring of each individual term search. Only IDF has just been adjusted to map to user’s notions of how rare a term is over all fields-not just one field.

Second, the monkey patching of document frequency (steps 2 and 3 above) doesn’t actually give every term query the same document frequency. If you examine BlendedTermQuery, instead of assigning each term query the same document frequency, a slight bias is given towards the field with the most mentions of the term. In other words, description gets a document frequency of 20 while abstract gets a 21, as shown in this table:

FieldOriginal DFPatched DF

If you notice, the document frequencies have effectively been very very slightly inverted! Now the lower DF for description will bias the resulting score towards that field. Why do this? Well if this term happens more documents in one field, one could argue its more likely that field is generally more relevant to the search term. If you searched for “Doug Turnbull” and the name occurred very commonly in a name field, but rarely in a text body field, you may want to bias search results where the search term is more common – NOT where its most rare as document frequency would. “Doug Turnbull” clearly belongs in the name bucket and not the other!

Cross Fields Is Just Dismax!

Finally, at the end of the day you need not be intimidated by Cross Fields. Understand that

blended(abstract:click introduction:click ...) 

really is just psuedo-code for “monkey-patch the IDF, then turn it into a big dismax query”:

abstract:click(patchedDF=21) | introduction:click(patchedDF=20)

If you know dismax, you can figure this out! Dismax (the | operator) picks the maximum field score of the underlying queries. So now that document frequency has been monkey patched, TF and field norms are the main determiners of the field score. The result of the dismax operation is likely going to be the shortest field with the most mentions of the term. If it happens to be abstract:click, then that score is taken for the term “click”.

Just like in other forms of dismax, you can tune the query! You can use tie_breaker to add in scores from the other fields that didn’t win the dismax competition (winner + tie_breaker * (sum of other fields)), such as:

abstract:click + tie_breaker * ( introduction_click ... )

Setting tie_breaker to 1 has the additional effect of turning the query into a boolean query (each clause is an “OR”). Additionally, you can use field-boosts to give different fields boosts. For example, we could give abstract a higher boost (5 here), aand set the tie_breaker to 0.5:

GET academia/article/_search{ "query": {       "multi_match": {            "query": "click tracking",            "type": "cross_fields",            "fields": ["abstract^5", "introduction", "methodology", "conclusion"]            "tie_breaker": 0.5,}}}

The bottom line is, once you understand that this query is just dismax with a more intuitive document frequency, tuning becomes an exercise in using the tools you already know and love when tuning dismax. You can boost, use a tie_breaker, and use all the other features you know and love.

The result is search with quite a few nice properties. You get the ideal scole for ecah search-term, with consistent per-term document frequency that reflects what users expect, and scoring you can control with tie_breaker and boosts.

For some context, you could compare this to Solr’s (e)dismax query parser. Avid readers of this blog know that Solr’s dismax suffers from winner takes all problems. Solr’s dismax behaves similarly to the above, except that document frequency is NOT monkey patched. Meaning that document frequency has an unduly strong influence on the final score – usually creating odd field “winners” in the dismax calculation that don’t jive with what users expect. For example, that match of “doug turnbull” in the description field could shoot straight to the top due to low document frequency – despite a name field with many “dougs” perhaps being a more appropriatte match.

In other words – I wish Solr would examine Cross Field search and incorporate it as a feature! Looking forward to the cross-fields query parser 🙂

Well that’s about it for today’s deep dive! As always, get in touch if there’s any way we can help you with your Solr or Elasticsearch problems. We love turning dull search into semantic search!

Cake image by Flickr User poppet with a camera