Blog

BM25F in Lucene with BlendedTermQuery

As part of the London hack days Diego Ceccarelli started a BM25F implementation. I began to continue it at Lucene Revolution’s Lucene hackathon. I realized though that when you break down the problem, BM25F can be implemented using existing Lucene bits, including the existing BM25Similarity and the BlendedTermQuery.

BM25 and BM25F

I’ve written about BM25 before. BM25 is an iteration on the classic TF*IDF ranking formula, but with several innovations:

  • Term Frequency Saturation: We know more times a search term occurs in a document, the more we should consider this document relevant for that search term. But with BM25, we realize at some point you reach diminishing returns. BM25 reaches this point relatively quickly, compared to classic TF*IDF.
  • Smarter document length weighting: A search term occurring once in a short doc is more relevant than a single term occurring in a longer doc (a book). BM25 penalizes/rewards document length relative to a document’s average document length, as opposed to just having a constant multiple based on document length
  • Basically the same IDF calculation: Rare search terms (low doc freq; high IDF) get weighted more heavily than common search terms. The BM25 calculation isn’t that much different than TF*IDF’s calculation

BM25F performs per-field BM25 calculation, but uses the shared document frequency across multiple fields. This document frequency blending is important. For example, sometimes you have scenarios where a common term, say “cat” is actually rare in one particular field (say the “title” field). But when you broaden out to other fields, you then realize that cat is particularly common, and shouldn’t be scored so highly.

In other words, BM25F basically does

CombinedIDF * ( BM25_title + BM25_description + …)

(note there are all kinds of subtle variants)

Can we implement that using existing Lucene bits? I think so… with a few caveats that Diego has pointed out. You can follow along by looking at this github repo with the sample code from this post.

Per-field BM25Similarity

BM25F lets us configure BM25 parameters per field. Luckily, per-field similarity is pretty easy to configure in Lucene using a PerFieldSimilarityWrapper. We simply need to setup our index accordingly. Notice in the similarity below, k1 and b differ for title and description:

static Similarity perFieldSimilarities =  new PerFieldSimilarityWrapper() {
      @Override
      public Similarity get(String name) {
          if (name.equals("title")) {
              return new BM25FSimilarity(/*k1*/1.2f, /*b*/0.8f);
          } else if (name.equals("description")) {
              return new BM25FSimilarity(/*k1*/1.4f, /*b*/0.9f);
          }
          return new BM25FSimilarity();
      }
};

Then when we setup the IndexReader

IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setSimilarity(perFieldSimilarities);

IndexWriter w = new IndexWriter(index, config);

BlendedTermQuery to implement BM25F

Lucene’s BlendedTermQuery forms the guts behind Elasticsearch’s cross_field search. What BlendedTermQuery does is blend global term stats as best as possible across different fields. If the document frequency for cat in title is 3, but the document frequency for cat in the description field is 50, it takes the maximum, 50. This approximates the actual document frequency across both fields (we can’t figure out overlaps very efficiently at query time).

This value is then used for searching both title and description, accounting for the true rareness of the term. To do this, it builds a set of Lucene TermQueries and tells the TermQuery for each field to use the blended document frequency, instead of what’s in the index for that field. So you can imagine that now you have two queries description:cat (*with doc freq 50) and title:cat (*with doc freq 50)

BlendedTermQuery then gives you two options for combining these queries. You can either as a dismax query (take the max of the underlying field scores) or a boolean query (sum the underlying field scores). In other words, you can pick between one of two ranking functions:

  • Dismax: The best scoring field wins, with an optional tie-breaker parameter to incorporate other field scores. In other words: max( description:cat (with doc freq 50), title:cat (with doc freq 50), …)
  • Boolean: A summation of the field scores, in other words description:cat (with doc freq 50) + tilte:cat (with doc freq 50) + …

Examining the math above, we really care about the latter form for BM25F as it involves a summation. To implement that, we simply use the following code.

BlendedTermQuery query = new BlendedTermQuery.Builder()
				 .add(new Term("title", "cat"), /*boost*/1.0f)
				 .add(new Term("description", "cat"), /*boost*/1.0f)
				 .setRewriteMethod(BlendedTermQuery.BOOLEAN_REWRITE)
				 .build();

Checking the ranking function

So with the BlendedTermQuery above, what we have now is something like the following ranking function

description:cat (with BM25, docfreq=50) + title:cat (with BM25, docfreq=50)

Which is the same as

IDF( docfreq=50) * (description:cat with BM25) +  IDF(docfreq=50) * (title:cat with BM25) + 

Here description:cat with BM25 means the non-IDF portion of the BM25 calculation (the term frequency and length part, specific to this field, boosts for this field, etc). The IDF(docfreq=X) is the BM25 IDF formula for a term with a given document frequency.

Now it turns out, this is (pretty much) BM25F! All we need to do to make this BM25F is to factor the IDF (docfreq=50) out to see the formula we have above

IDF( docfreq=50) * ( (description:cat with BM25) +  (title:cat with BM25) )

A few subtle differences

There’s a few reasons this isn’t quite BM25F. First of all Lucene’s boolean query uses a coordinating factor (coord) to reward/punish documents that match all the clauses. So:

IDF( docfreq=50) * ( (description:cat with BM25) +  (title:cat with BM25) )

Is actually

coord * IDF( docfreq=50) * ( (description:cat with BM25) +  (title:cat with BM25) )

If you recall coord is the number of matched clauses / number of total clauses. So if only 1 out of 2 clauses match, coord is ½. This further rewards documents that match more than one clause. Latest versions of Lucene have removed coord

Another thing to point out is that it’s difficult to piece together the true cross-field document frequency at query time. So the maximum is taken as an approximation. The actual blended document frequency isn’t entirely accurate, but we know it ranges somewhere between the max of the two field doc frequencies (when there’s 100% overlap) and the sum (when the fields mention the term in different documents).

And that’s it! Get in touch.

That’s it! Get in touch – I’d love to hear feedback if I missed anything. And if you want to pick my brain for free check out the lunch and learns. I’m about to announce some new topics that might interest your team! And as always, don’t hesitate to reach out about our relevancy services – we just got this great testimonial from Careerbuilder about our services:

OSC’s Solr/Lucene knowledge expanded and greatly improved the abilities of our public job search. They delivered technical excellence at every turn: demonstrating expertise in Lucene internals, relevance models, and data science backed by a solid methodology for improving search relevance. On a deliverables front: they learned our legacy search stack quickly and made high-quality code contributions. OSC marries technical excellence with strategic insight: we highly recommend their experts to any search team.