Using CustomScoreQuery For Custom Solr/Lucene Scoring

March 12, 2014 Doug Turnbull
Category: Uncategorized

This is a preview of a talk Ill be giving entitled Hacking Lucene for Custom Search Results at ApacheCon. Come join me April 7-9th in Denver!

Lucene is the swiss army knife of fuzzy sorting!

Previously, I guided you through implementing a custom Lucene query and scorer. Before I introduced you to that ultimate level of control, I listed the things you should try to improve your relevancy before getting that low-level. As a reminder, here’s a list of Doug Turnbull’s official list of stuff to try to improve your relevancy® ordered from easy-to-hard:

You’ve utilized Solr’s extensive set of query parsers & features including function queries, joins, etc. None of this solved your problem
You’ve exhausted the ecosystem of plugins that extend on the capabilities in (1). That didn’t work.
You’ve implemented your own query parser plugin that takes user input and generates existing Lucene queries to do this work. This still didn’t solve your problem.
You’ve thought carefully about your analyzers – massaging your data so that at index time and query time, text lines up exactly as it should to optimize the behavior of existing search scoring. This still didn’t get what you wanted.
You’ve implemented your own custom Similarity that modifies how Lucene calculates the traditional relevancy statistics – query norms, term frequency, etc.
You’ve tried to use Lucene’s CustomScoreQuery to wrap an existing Query and alter each documents score via a callback. This still wasn’t low-level enough for you, you needed even more control.

One item stands out on that list as a little low-level but not quite as bad as building a custom Lucene query: CustomScoreQuery. When you implement your own Lucene query, you’re taking control of two things:

Matching – what documents should be included in the search results
Scoring – what score should be assigned to a document (and therefore what order should they appear in)

Frequently you’ll find that existing Lucene queries will do fine with matching but you’d like to take control of just the scoring/ordering. That’s what CustomScoreQuery gives you – the ability to wrap another Lucene Query and rescore it.

For example, let’s say you’re searching our favorite dataset – SciFi Stackexchange, A Q&A site dedicated to nerdy SciFi and Fantasy questions. The posts on the site are tagged by topic: “star-trek”, “star-wars”, etc. Lets say for whatever reason we want to search for a tag and order it by the number of tags such that questions with the most tags are sorted to the top.

In this example, a simple TermQuery could be sufficient for matching. To identify the questions tagged Star Trek with Lucene, you’d simply run the following query:

Term termToSearch = new Term(“tag”, “star-trek”);TermQuery starTrekQ = new TermQuery(termToSearch);searcher.search(starTrekQ);

If we examined the order of the results of this search, they’d come back in default TF-IDF order.

With CustomScoreQuery, we can intercept the matching query and assign a new score to it thus altering the order.

Step 1 Override CustomScoreQuery to create our own custom scored query class:

(note this code can be found in this github repo)

public class CountingQuery extends CustomScoreQuery {    public CountingQuery(Query subQuery) {        super(subQuery);    }    protected CustomScoreProvider getCustomScoreProvider(            AtomicReaderContext context) throws IOException {        return new CountingQueryScoreProvider("tag", context);    }}

Notice the code for “getCustomScoreProvider” this is where we’ll return an object that will provide the magic we need. It takes an AtomicReaderContext, which is a wrapper on an IndexReader. If you recall, this hooks us in to all the data structures available for scoring a document: Lucene’s inverted index, term vectors, etc.

Step 2 Create CustomScoreProvider

The real magic happens in CustomScoreProvider. This is where we’ll rescore the document. I’ll show you a boilerplate implementation before we dig in

public class CountingQueryScoreProvider extends CustomScoreProvider {    String _field;    public CountingQueryScoreProvider(String field, AtomicReaderContext context) {        super(context);        _field = field;    }    public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException {        return (float)(1.0f);    }}

This CustomScoreProvider rescores all documents by returning a 1.0 score for them, thus negating their default relevancy sort order.

Step 3 Implement Rescoring

With TermVectors on for our field, we can simply loop through and count the tokens in the field:

public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException{    IndexReader r = context.reader();    Terms tv = r.getTermVector(doc, _field);    TermsEnum termsEnum = null;    termsEnum = tv.iterator(termsEnum);    int numTerms = 0;    while((termsEnum.next()) != null) {        numTerms++;    }return (float)(numTerms);}

And there you have it, we’ve overridden the score of another query! If you’d like to see a full example, see my “lucene-query-example” repository that has this as well as my custom Lucene query examples.

CustomScoreQuery vs a Full-Blown Custom Query

Creating a CustomScoreQuery is a much easier thing to do than implementing a complete query. There are A LOT of ins-and-outs for implementing a full-blown Lucene query. So when creating a custom matching behavior isn’t important and you’re only rescoring another Lucene query, CustomScoreQuery is a clear winner. Considering how frequently Lucene based technologies are used for “fuzzy” analytics, I can see using CustomScoreQuery a lot when the regular tricks dont pan out.

I hope you found that helpful! We focus a lot on improving search relevancy & quality, so if you feel like you need this level of work or any other Solr or Elasticsearch relevancy help, please contact us!