Blog

Querying More Fields != More Results – Stop wording and Dismax’s minimum should match argument

Let’s recall from Anatomy of a Dismax Query some key components of the dismax query parser:

  • qf – the fields we will search over (we’ll take the highest score out of all the fields that match)
  • mm – the minimum number of fields that MUST match the query

OK, now we’ve had plenty of time to study John’s post (and hey you should be able to even debug Solr). Let’s take our new knowledge for a test drive with this puzzler: Why would adding a field to qf cause our result set to actually shrink in size? Consider these two Solr queries:

(A) http://localhost:8983/solr/select?q=captain+of+enterprise&qf=body&mm=3&defType=dismax
(B) http://localhost:8983/solr/select?q=captain+of+enterprise&qf=title+body&mm=3&defType=dismax

The only difference between A and B is qf. Query B adds “title” to qf.

Why would query A return more results than query B? In query B we added a field, so shouldn’t there be more fields to match on and therefore more documents? Not necessarily as it turns out. Why? Well let’s start with something that might help us solve this problem: body is stop worded at query time. Title is not. Well let’s dig a little deeper. Let’s set debugQuery=true to take a gander at what’s happening under the hood with query parsing & analysis. When we dig into query parsing, A and B turn into the following two dismax queries:

(A) +((body:captain body:enterprise)~2)
(B) +(((title:captain | body:captain)       (title:of) (title:enterprise | body:enterprise))~3)

Notice how in both cases body’s stop wording has removed our search for “of” in the body field. In query A this reduces mm to 2, as dismax nicely figures out that after stop wording, we only have 2 clauses in our query – “body:captain” and “body:enterprise”.

What has the addition of an extra field done in B? Well, notice it’s introduced a 3rd clause between “captain” and “enterprise”. Query-time stop wording has removed “body:of”. However title is not stop worded. Therefore, Solr can still potentially match on “title:of” so this component of the middle clause stays in place.

The result of query parsing is that now we have a mandatory clause requiring title to have “of”. Therefore, the result set for query B is limited to the number of titles that have “of” in them. If no titles have “of” in them, then we’ll get no results.

This sounds like an unlikely scenario, but consider if instead of “title” you have another field. Something with a very tightly controlled vocabulary. Something like, titles of laws. Then you could hit this problem very easily.

Solutions?

It’s a bit hard to figure out what’s expected of Solr in this case. Should the “title:of” query be mandatory? Should it be coupled with a “body:?” clause that will match on any term in body (effectively letting body off the hook?).

As a user, it doesn’t seem to make sense to avoid stop wording entirely just to avoid this behavior. It’s a useful tool. More importantly, I feel that we probably want dismax to continue to be able to search over heterogeneous fields with their own analysis chains. Why should the behavior of dismax constrain how we decided to slice up individual fields?

Nevertheless, one takeaway is clean – don’t get aggressive with mm. Think carefully about mm in terms of the percentage of stopwords you’ll likely encounter – realizing that might upgrade some parts of the dismax query to even more mandatory than they are. For long queries q=Where in the world is Carmen Sandiego? this could be quite a few stopwords. For short queries, you’re likely to encounter few stop words like in the query q=Carmen Sandiego. Luckily Solr lets us control mm as a function of the number of clauses in the query.

I’d love to get your thoughts though. Have you encountered this issue before in the wild? How have you solved it?