Parameterizing and Organizing Solr Boosts

November 22, 2013 John Berryman
Category: Uncategorized

One of my clients has some pretty heavy-duty requirements for boosting functions. Its actually right on the boundary of what I think is appropriate for Solr. BUT, while I choose to continue within the bounds of Solr, I might as well expect the boosting functions to be as readable, and well-organized as possible. So lets take a look at my strategy.

The Setup

Lets say that were Amazon, and were allowing our users to search over books. But rather than just return the books based upon straight TF-IDF search, we need to control the boosting behavior to guide users towards newer books and books with a higher margin. The text of the book is stored in the text field, and the margin and release date of the books are stored in the corresponding fields margin and release_date.

The Problem

The problem is that the syntax that we must assemble to create such a query is utterly unwieldy. Allow me to demonstrate:

            edismax       text       text       sum(product(margin,0.34),product(div(1,ms(NOW,release_date)),1100)     

Check out that boost parameter. Can you tell what its doing? Well, can you? (Im pausing to let you try and figure it out.) Yeah… so the answers no. And as a matter of fact, I cant tell what it does either – and I just wrote it. Whats more, if your eyeballs are a little better than mine at reading this stuff, youll notice that there are some hardwired constants in this equation: 0.34, and 1100. What do these do? Beats me! But they must be important, so lets never ever touch them ever again.

I think Ive made a good case for the problem. This type of function munging leads to brittle, inscrutable, and unchangeable configuration. Lets take another swing at it!

The Solution

Heres my second attempt. Take a moment to read over it and see what you think.

            edismax       text       text       $totalBoost       sum($marginBoost,$recencyBoost)       product(margin,$valMarginBoost)       product($inverseRecency,$valRecencyBoost       div(1,ms(NOW,release_date))       0.34       1100     

So the first thing that you might notice, is that its a little more verbose than the previous request handler, but I maintain that this verbosity is actually incredibly helpful. Because now, you can almost read this configuration as if its explaining to you exactly what its doing.

YOU: How is the total boost formed?

MR.REQUEST HANDLER: Oh, well its the sum of the margin boost and the recency boost. Duh!

YOU: Yeah, well whats the margin boost?

MR.REQUEST HANDLER: Simple! We just multiply the value stored in margin field with the constant called valMarginBoost.

YOU: Oh… so I can just modify the valMarginBoost and change how important the margin is in the results?

MR.REQUEST HANDLER: Bingo!

Personally I dont like Handlers tone, but hes right, this is lots easier to read, and therefore maintain and modify. The labeling of the functional pieces makes it easier to keep track of everything and understand how each piece builds up to the total boost. The ordering of the named pieces is also important. I made sure that the definition of each piece is located just below the place where it is first mentioned. The only exception is the section at the bottom where Ive placed the constants that the content curator or merchandising expert can fiddle with – thus there are no longer any magic constants in our configuration.

Shameless Plug for Quepid

Content curators, merchandising experts – now since the search team has built up the Solr request handler, and exposed the tunable parameters, its your job to find the perfect value for these parameters. This is hard! Why? Because you might find that the perfect parameter values for your top product is actually the worst possible configuration for all other products. And its hard to know this without looking at all those queries at once.

Quepid solves this. Look at tens or even hundreds of queries at once and watch how they change as you modify configuration parameters. Look here for details. And also read further here.

Check out my LinkedIn Follow me on Twitter