Blog

Elasticsearch Cross Field Search Is A Lie

At OpenSource Connections, We Do What We Must, Because We Can

In Elasticsearch, searching across multiple fields can be confusing to beginners. This is a tough first step in creating a relevant search solution, so it’s important to get this right. In particular, it can be hard to wrap your head around multi_match’s cross field search and where exactly it fits into a querying strategy.

The idea behind cross field search is that it attempts to do more sensible searching across fairly disjoint fields – disjoint fields we might consider parts of a whole. The example often given is first and last name fields. We tend to consider these two fields a decomposition of a name. When we search for “Thomas Tucker”, “Thomas, William”, or “Frank Oz” we typically just want to search for these terms ( [thomas], [tucker], [thomas], [william], etc) in either of the name fields.

With normal dismax type queries (“best fields” or “most fields”), we sort of do this. However, we take each term to each field in isolation. The field with the best score dominates the resulting score. This is often termed “field-centric” search, and is the default mode of operation for most query strategies.

Let’s walk through an example to understand field-centric search, and why it might be a problem for our use case.

A query such as:

'query': {    'multi_match': {         'query': 'tucker thomas',        'fields': ['first_name', 'last_name']    }}

Will be interpreted by Elasticsearch as the following Lucene query:

(first_name:thomas first_name:tucker) | (last_name:thomas last_name:tucker)

We know that by default, multi_match does a best fields search. This is precisely what we see here – each field scored for each field independently with the best field score taken. When we take apart the scoring, we can see why the best fields strategy might be a problem here. “Thomas” is a common first name, however “Tucker” is a particularly rare first name. That means “Tucker” will receive a very low document frequency, and thus will be scored very highly by TF*IDF. Now let’s look at the last_name field. Thomas and Tucker are fairly common last names, thus they don’t receive a particularly high score. In best field search, the best fields score becomes the final score. Thus the score for “Tucker” in first_name will win (it’s the best field). The resulting score is the score for a first_name that matches “Tucker”. We’ll now have search results that look like

1. Tucker Fredrickson
2. Tucker Turnbull
3. Tucker Smith

This is the destructive winner-takes-all pattern in dismax-type search we’ve blogged about before.

Switching to “most fields” helps some. Most fields sums, it doesn’t take a max of the two field’s scores. We’ll add the miniscule score from the last name queries for these terms. This will act as a tie breaker for the first name search:

1. Tucker Thomas
2. Tucker Turnbull
3. Tucker Smith

Close, but no cigar. The problem is, in most use cases, we probably want “last name” matches to be more important than first name matches. In our fairly unintelligent name search, we’d prefer to see results that look like:

1. Tucker Thomas
2. Bob Thomas
3. Tim Thomas

How can we solve this problem?

At the root of the problem is that our search is focused on the individual fields with their own particular odd statistics. Tucker is a fairly rare first name, but common last name. Bob is a very common first name but rare last name. We’d like to combine what we know about both fields. However, because of the stark differences in term distribution, document frequency, and field length, field scores between our name fields are not portable. Indeed this is true for all fields. Field scores live in their own little scoring universes. Scoring the fields together is not simple.

So the problem can be restated as creating a sane way to score the fields together. We’d like to think in terms of the document frequency and other statistics of the two fields together, as if they were merged together, not how they behave in isolation. One common technique that people use to do this is to create what’s known as an all field. In Elasticsearch, this would involve using copy_to to append fields together into a larger field (perhaps full_name) via the mapping API.

Indeed, this is a strategy that’s actually codified in Solr with the text field often baked into the default configuration. The advantage of the all field approach is it does the best job of accurately combining the statistics of the two fields together. It does this by literally combining two fields into one. Obviously, this causes the index statistics to truly reflect the combination of fields. Unfortunately, there’s a fairly big downside. These all fields create a great deal of duplication and bloat. The relevancy use case better be very valuable indeed to create so much overhead.

Elasticsearch, however, has an alternative solution that attempts to blend fields together at query time. This is where cross_field comes in. Cross field search attempts to gather term statistics on the two fields together at query time to attempt to score the two fields as if they were one. For statistics like term frequency, Elasticsearch can sum the term frequency in each field as its scored. However, other statistics are not as simple. Document frequency is global to the index, and therefore a bit harder to compute when considering individual terms at query time. We can’t take the sum of the term’s document frequency at query time, as it will double count cases where both fields have the term. The best cross field can reasonably do is to take the maximum of the two fields’ document frequency and hope this comes close. For example if the names in our search engine look like:

 first_name last_name doc 1 douglas turnbull doc 2 thomas turnbull doc 3 douglas thomas

Here, the document frequency of “thomas” in blended, cross-field search will be “1” (the max document frequency of first_name and last_name). In reality, if the fields are appended together in an all field, the document frequency will be “2”. The upside of cross-field search is that we’ve eliminated bloat and overhead from an all field in our index. The downside is that by taking a maximum of document frequency, we may undercount how common a term truly is between the two fields (and therefore overcount it as rare in scoring).

So certainly, “cross field” search is a bit of a lie. But it comes with a number of advantages that give it a great deal of value. In particular, we mentioned in our use case that perhaps last name should be considered more valuable than first name. With cross field search, we can continue to leverage the field weighting available to us in multi_match. So a reasonable solution to our name search problem might take the shape of

'query': {    'multi_match': {         'query': 'tucker thomas',        'fields': ['first_name', 'last_name^10']        'type': 'cross_fields'    }},

Unfortunately, this search strategy would not always be a complete replacement for all fields. If you want to truly blend multiple fields into a whole scorable unit, then all fields is your best bet. The all field will give you deep control over how the field is turned into tokens (and therefore scored) in a way that requires far less head scratching.

However, if you want a really good approximation, then cross_fields does a reasonably good job. I’m also hopeful (hey Elastic!) that work can be done to more accurately compute document frequency at query time. Could one, perhaps, create some way of efficiently diffing the postings between two fields for a term at query time? In fact, perhaps simply running an OR query for the term against both fields might be one (possibly pretty slow!) technique of computing the true document frequency. (Oh and we haven’t even talked about field norms!) Hopefully there’s work being done in this area!

If you’re stuck on a problematic Elasticsearch or Solr relevance problem, be sure to contact us! Don’t hesitate to get in touch to let us know if you have any thoughts, feedback or criticisms of this article!

photo from deviant art user Pandaxninja