Introduction
Elasticsearch version 7.13 introduced a new query combined_fields
that brings better term-centric scoring to relevance engineers. Under the hood it uses the new Lucene query, CombinedFieldsQuery
, (formally known as the BM25FQuery
) which implements BM25F, a widely accepted extension of BM25 for multi-field search with weighting. Before 7.13, the multi_match
query with "type": "cross_fields"
(which for the remainder of this post will be referenced as cross_fields
) was the best option in Elasticsearch. This post discusses term-centric versus field-scoring scoring and does a bake off between the scoring of the old (cross_fields
) and the new (combined_fields
).
Term vs Field-centric is important for scoring
Term and field-centric are two alternative strategies for token based scoring for ranking. In term-centric scoring the entire document is treated as one large field. This means putting less importance on the sections within the document, the goal being better matching when tokens are spread out or repeated across multiple sections.
In field-centric scoring the original sections are scored independently, each section in its own index with its own term statistics. The goal here is to reflect varying importance of different sections, but can create unevenness as IDF can vary widely between fields.
The behaviour of the commonly used minimum_should_match
setting illustrates the difference between each approach. With the setting “minimum_should_match": ”100%”
a field-centric query will require all tokens to match within a single field, whereas a term-centric query would be more relaxed, requiring only that all tokens appear in the document – and these tokens could be in different fields.
Old vs New in Elasticsearch (and Lucene)
In the old days (before v7.13) there was only one way to do term-centric with field weighting, by querying with multi_match{ ..., "type": “cross_field”}
a.k.a. cross_fields
. In Lucene the scoring for cross_fields
was done by the BlendedTermQuery
, which would mix the scores from individual fields based on user supplied field weights.
As Elasticsearch expert Mark Harwood writes:
“Searching for
Mark Harwood
acrossfirstname
andlastname
fields should certainly favour anyfirstname:Mark
over alastname:Mark
. Cross-fields was originally created because in these sorts of scenarios IDF would (annoyingly) ensure exactly the wrong field for a term was ranked highest.”
The cross_fields
query would negate IDF for the most part, in order to ensure that scoring was similar across fields. Because this was originally conceived in the context of multi_match
there was also a desire to reward the “correct” field. To achieve this, the scoring function would add 1 to the document frequency of the most frequent field. While this worked in practice, the scoring was confusing and not grounded in theory. Let’s consider some example queries:
cross_fields query
GET tmdb/_search
{
"query": {
"multi_match": {
"query": "green Marvel hero",
"fields": [
"title^3",
"overview^2",
"tagline"
],
"type": "cross_fields"
}
}
}
combined_fields query
GET tmdb/_search
{
"query": {
"combined_fields": {
"query": "green Marvel hero",
"fields": [
"title^3",
"overview^2",
"tagline"
]
}
}
}
The syntax for combined_fields
is similar but the scoring is different and done by the new Lucene CombinedFieldsQuery
which implements BM25F. This is a variant of BM25 that adds that ability to weight individual fields. The field weights act by multiplying the raw term frequency of a field, before individual field statistics are combined into document level statistics. This does two big things: captures relative field importance and establishes a more generalizable formula for ranking than used by the cross_fields
query.
An example query
Using a version of The Movie Database (TMDB) that we have in this Elasticsearch sandbox on Github I want to show the difference between combined_fields
and cross_fields
.
_explain API
First let’s look at what the explain API tells us about queries for “Captain Marvel” in each case:
cross_fields
Request
GET tmdb/_explain/299537 { "query": { "multi_match": { "query": "green Marvel hero", "fields": [ "title", "overview", "tagline" ], "type": "cross_fields" } } }
Response
{ "_index" : "tmdb", "_type" : "_doc", "_id" : "299537", "matched" : true, "explanation" : { "value" : 14.744863, "description" : "sum of:", "details" : [ { "value" : 10.636592, "description" : "max of:", "details" : [ { "value" : 10.636592, "description" : "weight(overview:marvel in 1190) [PerFieldSimilarity], result of:", "details" : [ { "value" : 10.636592, "description" : "score(freq=2.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 7.7968216, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 3, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 8514, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.62010074, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 2.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 36.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 35.016327, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] }, { "value" : 7.7574196, "description" : "weight(title:marvel in 1190) [PerFieldSimilarity], result of:", "details" : [ { "value" : 7.7574196, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 7.5453897, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 4, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 8513, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.46731842, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 2.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 2.1431928, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } ] }, { "value" : 4.1082706, "description" : "max of:", "details" : [ { "value" : 4.1082706, "description" : "weight(overview:hero in 1190) [PerFieldSimilarity], result of:", "details" : [ { "value" : 4.1082706, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 4.1554832, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 133, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 8514, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.4493811, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 36.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 35.016327, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } ] } ] } }
combined_fields
Request
GET tmdb/_explain/299537 { "query": { "combined_fields": { "query": "green Marvel hero", "fields": [ "title", "overview", "tagline" ] } } }
Response
{ "_index" : "tmdb", "_type" : "_doc", "_id" : "299537", "matched" : true, "explanation" : { "value" : 16.732761, "description" : "sum of:", "details" : [ { "value" : 12.370674, "description" : "weight(CombinedFieldQuery((overview tagline title)(marvel)) in 1190), result of:", "details" : [ { "value" : 12.370674, "description" : "score(freq=3.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 7.7968216, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 3, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 8514, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.7211957, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 3.0, "description" : "termFreq=3.0", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 40.0, "description" : "dl, length of field (approximate)", "details" : [ ] }, { "value" : 41.87221, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] }, { "value" : 4.232909, "description" : "weight(CombinedFieldQuery((overview tagline title)(hero)) in 1190), result of:", "details" : [ { "value" : 4.232909, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 4.1554832, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 133, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 8514, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.46301466, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "termFreq=1.0", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 40.0, "description" : "dl, length of field (approximate)", "details" : [ ] }, { "value" : 41.87221, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } ] } }
In the _explain response of cross_fields
, we can see that scoring is still done by field per term before it is rolled up by term. In the combined_field
this doesn’t happen because each term is scored just once on the synthetic field representing a combination of “title”, “tagline” and “overview”. The single per-term scoring against the synthetic field homogenizes the term statistics that may have varied drastically between fields with cross_fields
.
First page of results
Next, I compare the first page of results (size: 30
) as tables. I added the Jaccard set similarity to show how much overlap there is between the two result sets. A Jaccard similarity of 1.0 is perfect overlap, the same 30 items in both result sets. A Jaccard similarity of 0.0 is no overlap, so 60 different items between the two queries. Remember Jaccard similarity is set based and does not factor in position.
Jaccard similarity: 0.579 | |||||
---|---|---|---|---|---|
|
|
The Jaccard similarity of 0.579 highlights that a lot of different documents are being surfaced in the combined_fields
query compared to cross_fields
. In this example 34 results are shared between the queries, but 26 are unique to one of the other. This doesn’t mean the differences are bad (or good) but it does mean there is some major churn in rankings between the two queries.
Another view of that same data, with a scatter plot, better shows the changes in position and scores for individual movies. The x-axis is the score from the cross_fields
query and the y-axis is the score from the combined_fields
query. Each dot is a document and the dot color represents the positional shift switching from cross_fields
to combined_fields
. Some documents were not included in the results for both queries, so they are represented as a tick mark along the axis where they were retrieved.
The top several results are consistent and the golden result “Hulk” is returned in position #4 for both queries. Note the score plateau in cross_fields
at a score of 16.13. All of those documents got identical scores, so their relative position in the final ranked list is decided by the order they were indexed. This arbitrary tie-breaking doesn’t happen in combined_fields
because there isn’t the same plateau effect with a single large field.
Visualizing search data like this is a great way to glean insights you might miss in bigger tables. Tables are great for inspecting individual records or comparing a handful of items, but graphics are a better form of communication when many data points are involved. Search is a “medium” data problem, with lots of queries and lots of results, so getting a good graphic grip on how it is performing will always help.
To the future with term-centric scoring
If you were using cross_fields
, switching to combined_fields
will shake up your results. But the benefits (general acceptance and scoring interpretability) of BM25F might make it worth it.
Besides differences in scoring, introducing combined_fields
clarifies the split between term and field-centric in the Elasticsearch API. Now we have multi_match
for field-centric and combined_fields
for term-centric. Having a clear API is big reason why I think Elasticsearch has been so successful, so I’m really happy to see this trend continue.
I’m also pleased to see the effort Elastic is committing to keeping Elasticsearch (and Lucene) current with the best methods from academic publications. HNSW approximate nearest neighbor search – vector search- is right around the corner for Lucene and Elastic is active in that effort too.
Do join us in Relevance Slack and let me know your comments or feedback – and if we can help you with these tricky scoring issues on your Elasticsearch cluster, get in touch.
Image from Wheat Field Vectors by Vecteezy