Improve search relevancy by telling Solr exactly what you want

July 21, 2013 Doug Turnbull
Category: Relevancy

Please find my piece of hay!

To be successful, (e)dismax relies on avoiding a tricky problem with it’s scoring strategy. As we’ve discussed, dismax scores documents by taking the maximum score of all the fields that match a query. This is problematic as one field’s scores can’t easily be related to another’s. A good “text” match might have a score of 2, while a bad “title” score might be 10. Dismax doesn’t have a notion that “10” is bad for title, it only knows 10 > 2, so title matches dominate the final search results.

The best case for dismax is that there’s only one field that matches a query, so the resulting scoring reflects the consistency within that field. In short, dismax thrives with needle-in-a-haystack problems and does poorly with hay-in-a-haystack problems.

We need a different strategy for documents that have fields with a large amount of overlap. We’re trying to tell the difference between very similar pieces of hay. The task is similar to needing to find a good candidate for a job. If we wanted to query a search index of job candidates for “Solr Java Developer”, we’ll clearly match many different sections of our candidates resumes. Because of problems with dismax, we may end up with search results heavily sorted on the “objective” field. Our top scoring result might have something like:

Goal: Work with Solr some day!

Clearly not what we want! We need the hardcore experienced folks!

I’ve switched to using a different strategy for search relevancy in these kinds of cases. Start with rudimentary yet simple scoring avoiding the wild swings of dismax. Once this is in place, give Solr a list of additive queries (via bq/bf) that describe the ideal document. Tune the multiplier on each qualification through testing and experimentation.

Simple Base Scoring

Instead of relying on qf/pf to search and take the best of multiple fields, I’ll create a grab-bag field. I’ll use Solr’s copyField directives to copy all text I want to match on into this field in the schema:

<copyField source=”resume_goal” dest=”text_all”/>
<copyField source=”resume_experience” dest=”text_all”/>
<copyField source=”resume_skills” dest=”text_all”/>

The field “text_all” becomes what Solr initially searches. The assumption here is that it’s appropriate to tokenize what goes into text_all the same way. In this kind of setup, you might also want to consider omitTermFreqsAndPositions for text_all, otherwise your scoring will be heavily biased toward the field that contributes the most tokens to text_all.

Now we can set

qf=text_all

and start searching!

Describe job qualification to Solr

Once there’s baseline, predictable scoring in place, let’s describe our ideal candidate by passing solr multiple boost queries that help bubble up the the best documents for the problem were trying to solve:

The candidate has at least 75% of the required skills
bq={!edismax qf=resume_skills mm=75% v=$q bq=}
The candidate wants to work with the technology
bq={!edismax qf=resume_goals v=$q bq=}
The candidate has a high StackOverflow reputation
bf=log(resume_stackoverflow_reputation)

Each of these queries lets Solr layer in an extra factor into the sorting. Notice how in the bq we set v=$q. We’re using Solr’s local param syntax to reprocess the original query against a new set of criteria. We’re also making an assumption in the first bq that resume_skills will utilize an analysis chain that will filter out tokens that are non-job skills through a combination of synonyms and filtering. It’s also important to note that this wouldn’t be the finished product. Each boost needs to be carefully tuned through testing, tweaking its impact with the ^(multiplier) syntax.

Which one of you is the perfect document for this query?

One nice thing about this strategy is we’re directly telling Solr exactly what we want in an awesome candidate. It’s a bit like using Solr for a fuzzy sorter, explicitly feeding it pieces of criteria we think are “good”, tuning those criteria, then using it to find the answers that match as many pieces of criteria as we specify. It’s also easy to decide later that we want to layer on additional criteria (does the candidate have code on github that utilizes skills in the query? – how much code? – how recent is it?). We could even apply additional queries based on additional criteria like salary requirements. It’s a pretty exciting strategy. John Berryman and I have even been wondering whether this might help get at his multiple objective scoring ideas. In any case, I hope to be using it more!

Let us know what you think of this strategy! If you’ve got a tough relevancy problem, let us know, we’ve got this and plenty other relevancy tricks up our sleeves and we’d love to talk with you!