Digging into Term Statistics with Elasticsearch LTR

October 28, 2020 Dan Worley
Category: Elasticsearch

We built the Elasticsearch Learning to Rank (LTR) plugin with the Wikimedia Foundation several years ago and we’ve been pleased to see how it’s been adopted by many companies wanting to use the power of machine learning to improve search. When the plugin was created, there was an initial need to surface low level Lucene statistics for feature engineering, and this was accomplished with the ExplorerQuery. The ExplorerQuery was capable of producing some hard-coded aggregations of term statistics and positions, but it did not allow much customization without submitting a Github pull request to add custom formulas. As pull requests began to come in, we realized it would be beneficial to feature engineers if they could experiment with new ideas without needing to modify Java code.

Introducing the TermStatQuery

After much brainstorming, the TermStatQuery was devised. The idea was to create a query that was much like the ExplorerQuery but at its heart use a Lucene expression to gather statistics. Because of this, feature engineers are not only limited to what is baked into the plugin, they can craft any formula compatible with Lucene expressions to generate their final output. The term statistic gathering code was also refactored to improve performance and make the specification of terms much more explicit.

Usages

If you’re a feature engineer working with Learning to Rank, you may find the following new functionality of interest:

Custom Lucene Expression Support

If you’ve ever wanted to make use of two stat types in the same feature, then TermStatQuery is for you. It injects the following term statistic types into an expression context for your usage:

df — The document frequency of a term
idf — The inverse document frequency of a term using ‘classic’ similarity
tf — The term frequency of a term in a given document
ttf — The total term frequency of a term across the entire index
tp — The position of the term in the document
unique — The count of unique terms passed in
matches — The number of terms that matched a given document

For example, to compute a classic relevance ranking score, you can specify the expression as tf * idf to compute the score for each term in the index.

ScriptFeature Injection

Because Lucene expressions have limitations with array support, they always have to work with a statistical aggregation of the term statistics. However, if you need access to all of the raw values, then ScriptFeature comes to the rescue! If you pass in a term_stat object to a ScriptFeature, it will run the same term statistic collection code that the TermStatQuery runs under the hood, and it will then inject the data into a script context for your usage.

To access the raw term positions and carry out custom logic depending on different conditions, you could do the following:

{
    "params": {
        "term_stat": {
            "terms": ["rambo,  "rocky", "bullwinkle"],
            "fields": ["title"]
        }
    }
}

The above parameters would activate the Term Statistic gathering. Then an example script would look something like:

 // Example logic showing term count and custom logic based on position match

 // If there were at least 3 terms
 if(params.terms['tp'].length >= 3) {
    // If the second term matched after the first 
    if (params.terms['tp'][1] > params.terms['tp'][0]) {
        return 50.0;
    }
 } else {
    return 0.0;
 }

Limitations & The Future

Currently, the TermStatQuery only works with single terms. In the future we may look at adding phrase and span support. Using the TermStatQuery by itself will provide the best performance, but the ScriptFeature injection functionality may be useful for experimenting with new features. If you see performance degradation from promising features in the ScriptFeature, it may make sense to write a custom plugin in Java to shrink down response times.

If your interest has been piqued, be on the lookout for a follow-up article that will take a deep dive into how the query works while covering some more advanced use cases. If you run into issues or have questions please don’t hesitate to join us on Relevance Slack. Our Learning to Rank training is a great place to start your team on a project of this type, and do get in touch if you need our help directly.