Sub-phrase Highlighting in Solr

We wanted Solr to use its ability to do subphrase matching (pf2, pf3) to procedurally generate richer, contiguous highlighting snippets. Unfortunately, we reached a dead end in our efforts to do this without resorting to more complex tactics.

For example, imagine you to searched for q=’mary had a little sheep’. But your documents included:

  • Mary had a little lamb
  • Little sheeps had a mary

Now of course the first result is the most relevant to the query. It’s nearly (but not quite) a full phrase match for the query. But default Solr highlighting doesn’t do a great job of showing you why. Solr’s highlighting will do:

Mary had a little lambLittle sheeps had a Mary

If our queries are long like this, we might want to show the various phrase/subphrase matches in our highlights. Perhaps omitting even single term matches as too spurious. For example, we might prefer something like

Mary had a little lambLittle sheeps had a mary

One way to do this is to create bigram/trigram tokens using a shingle filter. This works by placing 2 or 3 phrase tokens in the index. However, for users with very large numbers of documents, having the added index bloat is untenable.

Tricking Solr phrase highlighting?

Solr fast vector and unified highlighters have a parameter called hl.usePhraseHighlighter. One thought is this parameter should allow us to search using pf2 or pf3, then perform highlighting on those subphrase matches. We would hope the following should work:

q=mary had a little lamb&defType=edismax&pf2=title&hl.usePhraseHighlighter=true&hl.useFastVectorHighlighter=true&hl.method=fastVector

If you turn debug=true on, this query is parsed into:

(+(DisjunctionMaxQuery((title:mary)) DisjunctionMaxQuery((title:had)) DisjunctionMaxQuery((title:a)) DisjunctionMaxQuery((title:little)) DisjunctionMaxQuery((title:lamb))) (DisjunctionMaxQuery((title:"mary had")) DisjunctionMaxQuery((title:"had a")) DisjunctionMaxQuery((title:"a little")) DisjunctionMaxQuery((title:"little lamb"))))/no_coord

You would think each of those searched-for phrases would be highlighted. Puzzlingly, this isn’t what happens. Instead Solr persists in doing single term highlights.

What’s odd is you indeed DO see phrase highlighting when you explicitly search with phrase queries. You can override the ‘hl.q’ Indeed, something like the following does indeed work:

hl.q="mary had" OR "had a" OR "a little" OR "little lamb" &hl.usePhraseHighlighter=true&hl.useFastVectorHighlighter=true&hl.method=fastVector

Of course this requires the client to procedurally generate the query, which is less than ideal.

User vs ‘boost’ queries – an important edismax gotcha

If you do some sleuthing in the highlighting code, you’ll learn that the query used for highlighting is controlled by the query parser. Specifically, each Solr query parser overrides getHighlightQuery, which takes parameters such as ‘hl.q’, and decides what query should be used for highlighting. For most query parsers, this would be the same query generated for search:

public Query getHighlightQuery() throws SyntaxError { Query query = getQuery(); return query instanceof WrappedQuery ? ((WrappedQuery)query).getWrappedQuery() : query;}

However edismax (and dismax) distinguish between what they call the “main user query” – what’s passed to “q” – and various add-ons like pf* and boosts. This “main user query” the ‘unaltered’ user query is what (e)dismax chooses to return for highlighting. If you’re in the code, the relevant variable in ExtendedDismaxQParser is parsedUserQuery:

@Overridepublic Query getHighlightQuery() throws SyntaxError { if (!parsed)   parse(); return parsedUserQuery == null ? altUserQuery : parsedUserQuery;}

Indeed, various features in Edismax (qs for example) differentiate the query ‘from the user’ vs all the background boost queries, functions, etc a developer would layer on. It’s important to keep this in mind when working with Edismax. What this means is that at the end of the day, unless quotes are explicitly used by a user, edismax/dismax will give you a term-only query.

Some alternatives

We’ve already talked about one alternative: generating a boolean query yourself and passing it to hl.q. Tokenizing the user’s query on whitespace and doing this, should probably work most of the time. Of course ‘whitespace’ isn’t the only thing a Solr query analyzer would tokenize on, so it’s far from perfect.

Another option is to use a query parser plugin with hl.q/hl.qparser that gives you more flexibility with slicing and dicing the query string. For example, our match query parser lets you specify a query-time analyzer, then turn the result into a phrase query. So you can bigram a query via a shingle analyzer, then turn those bigrams into phrase queries. Maybe this is a good time for me to set aside time to finally incorporate this into Solr 🙂

Note, this is subtly different than using normal Solr query-time only shingles and setting autoGeneratePhraseQueries="true", as shingles generates single term tokens, and autoGeneratePhraseQueries works with the token graph to try and construct good phrase queries from multiple, connected tokens. Though perhaps there’s a graph-aware shingle-like token filter that generates graphs of adjacent tokens, not just tokens.

What am I missing?

That’s the end of this round of spelunking 🙂 It’s likely I’m missing something obvious in my approach, or missed an obvious solution. Get in touch!