Better term-centric scoring in Elasticsearch with BM25F and the combined_fields query

June 30, 2021 Nate Day
Category: Elasticsearch

Introduction

Elasticsearch version 7.13 introduced a new query combined_fields that brings better term-centric scoring to relevance engineers. Under the hood it uses the new Lucene query, CombinedFieldsQuery, (formally known as the BM25FQuery) which implements BM25F, a widely accepted extension of BM25 for multi-field search with weighting. Before 7.13, the multi_match query with "type": "cross_fields" (which for the remainder of this post will be referenced as cross_fields) was the best option in Elasticsearch. This post discusses term-centric versus field-scoring scoring and does a bake off between the scoring of the old (cross_fields) and the new (combined_fields).

Term vs Field-centric is important for scoring

Term and field-centric are two alternative strategies for token based scoring for ranking. In term-centric scoring the entire document is treated as one large field. This means putting less importance on the sections within the document, the goal being better matching when tokens are spread out or repeated across multiple sections.

In field-centric scoring the original sections are scored independently, each section in its own index with its own term statistics. The goal here is to reflect varying importance of different sections, but can create unevenness as IDF can vary widely between fields.

The behaviour of the commonly used minimum_should_match setting illustrates the difference between each approach. With the setting “minimum_should_match": ”100%” a field-centric query will require all tokens to match within a single field, whereas a term-centric query would be more relaxed, requiring only that all tokens appear in the document – and these tokens could be in different fields.

Old vs New in Elasticsearch (and Lucene)

In the old days (before v7.13) there was only one way to do term-centric with field weighting, by querying with multi_match{ ..., "type": “cross_field”} a.k.a. cross_fields. In Lucene the scoring for cross_fields was done by the BlendedTermQuery, which would mix the scores from individual fields based on user supplied field weights.

As Elasticsearch expert Mark Harwood writes:

“Searching for Mark Harwood across firstname and lastname fields should certainly favour any firstname:Mark over a lastname:Mark. Cross-fields was originally created because in these sorts of scenarios IDF would (annoyingly) ensure exactly the wrong field for a term was ranked highest.”

The cross_fields query would negate IDF for the most part, in order to ensure that scoring was similar across fields. Because this was originally conceived in the context of multi_match there was also a desire to reward the “correct” field. To achieve this, the scoring function would add 1 to the document frequency of the most frequent field. While this worked in practice, the scoring was confusing and not grounded in theory. Let’s consider some example queries:

cross_fields query

GET tmdb/_search
{
  "query": {
    "multi_match": {
      "query": "green Marvel hero",
      "fields": [
        "title^3",
        "overview^2",
        "tagline"
      ],
      "type": "cross_fields"
    }
  }
}

combined_fields query

GET tmdb/_search
{
  "query": {
    "combined_fields": {
      "query": "green Marvel hero",
      "fields": [
        "title^3",
        "overview^2",
        "tagline"
      ]
    }
  }
}

The syntax for combined_fields is similar but the scoring is different and done by the new Lucene CombinedFieldsQuery which implements BM25F. This is a variant of BM25 that adds that ability to weight individual fields. The field weights act by multiplying the raw term frequency of a field, before individual field statistics are combined into document level statistics. This does two big things: captures relative field importance and establishes a more generalizable formula for ranking than used by the cross_fields query.

An example query

Using a version of The Movie Database (TMDB) that we have in this Elasticsearch sandbox on Github I want to show the difference between combined_fields and cross_fields.

_explain API

First let’s look at what the explain API tells us about queries for “Captain Marvel” in each case:

cross_fields

Request

GET tmdb/_explain/299537
{
  "query": {
    "multi_match": {
      "query": "green Marvel hero",
      "fields": [
        "title",
        "overview",
        "tagline"
      ],
      "type": "cross_fields"
    }
  }
}

Response

{
  "_index" : "tmdb",
  "_type" : "_doc",
  "_id" : "299537",
  "matched" : true,
  "explanation" : {
    "value" : 14.744863,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 10.636592,
        "description" : "max of:",
        "details" : [
          {
            "value" : 10.636592,
            "description" : "weight(overview:marvel in 1190) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 10.636592,
                "description" : "score(freq=2.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 7.7968216,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 3,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 8514,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.62010074,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 2.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 36.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 35.016327,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value" : 7.7574196,
            "description" : "weight(title:marvel in 1190) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 7.7574196,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 7.5453897,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 4,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 8513,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.46731842,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.1431928,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 4.1082706,
        "description" : "max of:",
        "details" : [
          {
            "value" : 4.1082706,
            "description" : "weight(overview:hero in 1190) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 4.1082706,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.1554832,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 133,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 8514,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.4493811,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 36.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 35.016327,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

combined_fields

Request

GET tmdb/_explain/299537
{
  "query": {
    "combined_fields": {
      "query": "green Marvel hero",
      "fields": [
        "title",
        "overview",
        "tagline"
      ]
    }
  }
}

Response

{
  "_index" : "tmdb",
  "_type" : "_doc",
  "_id" : "299537",
  "matched" : true,
  "explanation" : {
    "value" : 16.732761,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 12.370674,
        "description" : "weight(CombinedFieldQuery((overview tagline title)(marvel)) in 1190), result of:",
        "details" : [
          {
            "value" : 12.370674,
            "description" : "score(freq=3.0), computed as boost * idf * tf from:",
            "details" : [
              {
                "value" : 2.2,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 7.7968216,
                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details" : [
                  {
                    "value" : 3,
                    "description" : "n, number of documents containing term",
                    "details" : [ ]
                  },
                  {
                    "value" : 8514,
                    "description" : "N, total number of documents with field",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 0.7211957,
                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details" : [
                  {
                    "value" : 3.0,
                    "description" : "termFreq=3.0",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "k1, term saturation parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "b, length normalization parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 40.0,
                    "description" : "dl, length of field (approximate)",
                    "details" : [ ]
                  },
                  {
                    "value" : 41.87221,
                    "description" : "avgdl, average length of field",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 4.232909,
        "description" : "weight(CombinedFieldQuery((overview tagline title)(hero)) in 1190), result of:",
        "details" : [
          {
            "value" : 4.232909,
            "description" : "score(freq=1.0), computed as boost * idf * tf from:",
            "details" : [
              {
                "value" : 2.2,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 4.1554832,
                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details" : [
                  {
                    "value" : 133,
                    "description" : "n, number of documents containing term",
                    "details" : [ ]
                  },
                  {
                    "value" : 8514,
                    "description" : "N, total number of documents with field",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 0.46301466,
                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "termFreq=1.0",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "k1, term saturation parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "b, length normalization parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 40.0,
                    "description" : "dl, length of field (approximate)",
                    "details" : [ ]
                  },
                  {
                    "value" : 41.87221,
                    "description" : "avgdl, average length of field",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

In the _explain response of cross_fields, we can see that scoring is still done by field per term before it is rolled up by term. In the combined_field this doesn’t happen because each term is scored just once on the synthetic field representing a combination of “title”, “tagline” and “overview”. The single per-term scoring against the synthetic field homogenizes the term statistics that may have varied drastically between fields with cross_fields.

First page of results

Next, I compare the first page of results (size: 30) as tables. I added the Jaccard set similarity to show how much overlap there is between the two result sets. A Jaccard similarity of 1.0 is perfect overlap, the same 30 items in both result sets. A Jaccard similarity of 0.0 is no overlap, so 60 different items between the two queries. Remember Jaccard similarity is set based and does not factor in position.

Jaccard similarity: 0.579

combined_fields		rank
score	title	rank
20.483114	Captain Marvel	1
16.441910	Green Lantern: First Flight	2
15.406511	Jimmy Vestvood: Amerikan Hero	3
13.150019	Hulk	4
12.759342	The Man Who Killed Don Quixote	5
12.038399	Justice League: War	6
10.916338	Maverick	7
10.763498	The Extra Man	8
10.158279	Green Lantern: Emerald Knights	9
10.123980	Rambo	10
9.909670	The Odd Life of Timothy Green	11
9.797913	The Green Inferno	12
9.777215	Green Lantern	13
9.688647	The Green Berets	14
9.402530	Revenge of the Green Dragons	15
9.362038	The Punisher	16
9.341401	Green Book	17
9.081026	Green Street Hooligans 2	18
8.764002	Blinky Bill the Movie	19
8.744556	Chain Reaction	20
8.648538	Green Room	21
8.553925	How Green Was My Valley	22
8.370777	Fried Green Tomatoes	23
8.282112	Green Mansions	24
8.211758	Big Trouble in Little China	25
8.195307	The Green Mile	26
8.099191	Hardball	27
7.975277	Taxi	28
7.816612	Last Action Hero	29
7.787214	Green Zone	30

cross_fields
title	score
Captain Marvel	31.48880
Jimmy Vestvood: Amerikan Hero	21.73690
Green Lantern: First Flight	18.18846
Hulk	17.29990
Green Mansions	16.13631
The Green Berets	16.13631
Green Zone	16.13631
The Green Hornet	16.13631
Green Room	16.13631
Green Lantern	16.13631
The Green Mile	16.13631
Green Book	16.13631
The Green Inferno	16.13631
Heroes	15.88856
Hero	15.88856
Green Lantern: Emerald Knights	15.71288
Justice League: War	15.45730
Maverick	15.06839
Chain Reaction	14.37306
Blinky Bill the Movie	13.89266
The Extra Man	13.58447
The Punisher	13.53983
Revenge of the Green Dragons	13.48916
Fried Green Tomatoes	13.48916
The Odd Life of Timothy Green	13.11556
Hero Wanted	12.77054
Heroes for Sale	12.77054
Almost Heroes	12.77054
Everyone’s Hero	12.77054
Kelly’s Heroes	12.77054

The Jaccard similarity of 0.579 highlights that a lot of different documents are being surfaced in the combined_fields query compared to cross_fields. In this example 34 results are shared between the queries, but 26 are unique to one of the other. This doesn’t mean the differences are bad (or good) but it does mean there is some major churn in rankings between the two queries.

Another view of that same data, with a scatter plot, better shows the changes in position and scores for individual movies. The x-axis is the score from the cross_fields query and the y-axis is the score from the combined_fields query. Each dot is a document and the dot color represents the positional shift switching from cross_fields to combined_fields. Some documents were not included in the results for both queries, so they are represented as a tick mark along the axis where they were retrieved.

The top several results are consistent and the golden result “Hulk” is returned in position #4 for both queries. Note the score plateau in cross_fields at a score of 16.13. All of those documents got identical scores, so their relative position in the final ranked list is decided by the order they were indexed. This arbitrary tie-breaking doesn’t happen in combined_fields because there isn’t the same plateau effect with a single large field.

Visualizing search data like this is a great way to glean insights you might miss in bigger tables. Tables are great for inspecting individual records or comparing a handful of items, but graphics are a better form of communication when many data points are involved. Search is a “medium” data problem, with lots of queries and lots of results, so getting a good graphic grip on how it is performing will always help.

To the future with term-centric scoring

If you were using cross_fields, switching to combined_fields will shake up your results. But the benefits (general acceptance and scoring interpretability) of BM25F might make it worth it.

Besides differences in scoring, introducing combined_fields clarifies the split between term and field-centric in the Elasticsearch API. Now we have multi_match for field-centric and combined_fields for term-centric. Having a clear API is big reason why I think Elasticsearch has been so successful, so I’m really happy to see this trend continue.

I’m also pleased to see the effort Elastic is committing to keeping Elasticsearch (and Lucene) current with the best methods from academic publications. HNSW approximate nearest neighbor search – vector search- is right around the corner for Lucene and Elastic is active in that effort too.

Do join us in Relevance Slack and let me know your comments or feedback – and if we can help you with these tricky scoring issues on your Elasticsearch cluster, get in touch.

Image from Wheat Field Vectors by Vecteezy