Multiple Objective Search Scoring

John Berryman — July 4, 2013 | 5 Comments | Filed in: solr

Over the past few years as I’ve worked with search, and Solr in particular, I have witnessed a commonly recurring problem associated with multiple objective search and how documents are scored. Perhaps the best way to understand this problem is to look at an example:

Let’s say that you are Zappos and you sell apparel online. When your users search for “green shoes”, you want to display green shoes, but you also want to promote the newest product first. So you set up your Solr request handler like so: shoes

In this case part of the score is based upon how well “green shoes” matches in the product description, and (thought it’s a bit cryptic) part of the score is based upon how new the product is. These two score components are summed together to form the final score. Or in equation format, for a given product,

(total score for product) = (how well product matches the search) + (newness of product)

Now that search has been appropriately configured, you issue the search for “green shoes” and to your chagrin, you find that instead of green shoes, you have a mixture of a bunch of really new products that happen to be either green or shoes, but rarely both at the same time. This stinks, because you know there are plenty of green shoes in the index, but there’s just not any recent products!

Well no problem, you can just change the weight the text match so that it’s more important than the newness of the product: shoes

Or again in equation form:

(total score for product) = **10***|(how well product matches the search) + (newness of product)

And upon issuing the search again, you see a lineup of the newest green shoes you have to offer.

Problem solved! Right?

Not so fast. Do a search for “dress shoes”. To your surprise you’ll find, not dress shoes, but a collection of recently introduced sun dressed and tennis shoes! What on earth happened?!

Here is what happened: Dresses and dress shoes are much more common in the index than “green shoes”, and because they are so much more common, they get a lower score! (See TF-IDF scoring for more information.) Because “dress shoes” are weighted so much lower, the relative importance of newness is increased and you get this mess we’re talking about here.

From this point it’s not difficult to see that there is no best setting. For instance, if we again increase the weight on text match, then in all likelihood we would match too heavily on the text “green shoes”, ignore the newness, the results would now contain only the oldest, most out-of-style green shoes in our inventory!

A visual interpretation

Perhaps you’re like me. I always feel a lot better when I’m looking at a visual interpretation of new concept. So here is a good visual way of thinking about the problem above.

Considering the “green shoes” search, every product in the inventory has a certain score for the text match and a certain score for the newness as seen here.

Screen Shot 2013-07-05 at 1.02.55 AM

But based on these two scoring dimension, there’s no real obvious order for displaying these products? So we do the best thing that we can think of, and we sum the scores together and turn them into one score. This can be represented as a 45˚ line on the diagram. And the further along this line the product is, the higher it’s total score.

Screen Shot 2013-07-04 at 5.16.28 PM

But as illustrated here, there are far too many high-scoring items that are not “green shoes” – they’re just newer. So we boost the text score – effectively stretching their placement along the vertical axis. The more we do this, the more influential the text match becomes on the result ordering.

Screen Shot 2013-07-04 at 5.28.50 PM

And, as we saw in the “dress shoes” example, our choice for the best text match weight for “green shoes” might not be appropriate at all for “dress shoes”.

Screen Shot 2013-07-04 at 5.39.42 PM

Exploring Better Options

The most obvious way of dealing with multiple objective search is to simply sum the individual scores together. However we’ve proven above that the best thing you can do for this approach is to optimize for a single search and then hope that the rest of searches aren’t too bad. There’s got to be something better. In the following sections I’ll discuss two ideas that we’ve recently been exploring.

Auto-Scaled Weights

Perhaps you know that one of the objectives is more important and that the other objectives should be secondary. For instance, in both of the examples above, the text matching is primary. If a customer searches for “green shoes” then by golly they should show them green shoes, and only after that should we care about the newness of the green shoes. Ditto for “dress shoes”.

One way of achieving this is to have two pass scoring as follows. In the first pass, the text score and the newness score are handled individually and we collect statistics for both scores. For instance, it will probably be useful to track the maximum, average, and lowest score along both dimensions, along with the variance of those stores. Then, on the second scoring pass, we add the two scores together, but we give the newness component a weight based upon the statistics that we’ve captured for the other components of score:

(total score) = (text score) + (newness score weight)*||(newness score)

There are certainly several ways to automatically determine the values for the secondary scoring weight. One example for now is to make the newness score weight equal the to the average text score divided by the average newness score. This way average contribution of the newness to the total score will be equal to the average contribution of the text score; the two components will be on equal footing.

The principle drawback here is that all matching documents have to be scored twice, however it might be possible mitigate this problem by smart use of caches.

Non-Dominated Results “Ordering”

A completely different approach would be to return results sorted so that so-called “non-dominated” results are shown first. What’s a non-dominated result? Consider a new example: Let’s now imagine that we are CareerBuilder and our users use our search engine to find new jobs. A user is going to be interested in simultaneously optimizing several criteria. For now we consider two of these criteria: distance and salary. In a particular search, a user sees the following results scored according to both criteria:

Screen Shot 2013-07-04 at 6.27.50 PM

A “non-dominated” result is any result X for which there is no other result that is in every way better that X. The collection of all such non-dominated results is called the non-dominated frontier (or sometimes the Pareto frontier).

Now since we’ve defined what a non-dominated result is, it should be easier to understand why you would want to present your user with these results first. For example, why on earth would our user want to see result H before seeing results B or C which in both cases are both closer and provide a higher salary?

One of greatest reasons that a non-dominated ordering of results is beneficial is because we can’t ask our users how strongly they weight salary vs. distance. For some users, they simply don’t want to move and therefore salary is secondary to distance. For others, the oposite will be true. The great thing about a non-dominated ordering is that you don’t have to ask, the user will only be presented with results that are not dominated in either dimension.

While this non-dominated ordering provides some great benefits, there are drawbacks as well. For one, it’s not a strict ordering. There are several items that are equally non-dominated (those on the non-dominated frontier). What order do we present these results to the user? Also, once we’ve exhausted the completely non-dominated results, how do we present the remaining results? Finally, calculating the non-dominated frontier is computationally much more complex that simply sorting the documents by some scalar score. This complexity, however, might be mitigated by making some approximations in our ordering of the results.

Implementing this in Solr

And all of this beg’s the question, “How do you actually build this in Solr?” The answer, for now at least, is that we don’t quite know. We know the component that does the scoring; it’s called the Similarity. And we know the component that searches the index; it’s called the IndexSearcher. And we even know the thing that collects all of the documents from the searcher; it’s called TopDocs. But we don’t know if all of the necessary requirements above are met. (For instance, do document scores have to be scalars?) — But what I do know is that our clients need this type of search. And the sooner the better!

If you are such a client, then tell us! The more people that are mutually benefitted by this, the better and the faster we’ll get this implemented! Are you a developer? We’re fun to hack with. If you’re interested, help us build this!

Check out my LinkedIn Follow me on Twitter

5 comments on “Multiple Objective Search Scoring

  1. Two questions from this text:
    1. Why is sum more applicable than multiplication?
    2. why you don’t simply order for “newness” within the search results?

    Best regards,

  2. Hey @Artem! I was waiting for someone to point this out! You are correct that multiplicative boosting would be better in the case. (I feel like addition is better for things that are semantic matches and multiplication is better for things that are modifiers of the semantic match. This probably deserves a blog post of its own!)

    However the main point remains: if your search has multiple terms or multiple fields then Solr is going to be summing scores in order to determine the total score. Effectively, even with boring, normal searches, you’re already looking at multiple objective search! And this argument applies.

  3. Well-written article!

    To implement this in Lucene/Solr, you need a custom Collector ideally. Solr doesn’t really allow custom collectors, unfortunately. There’s an issue in JIRA, SOLR-4465, I’m following that makes it configurable.

  4. Hi John, interesting article, but I’m not sure I agree with your premise or conclusions.

    To get around getting results for ‘sun dressed’ and ‘tennis shoes’ when searching for ‘dressed shoes’, all you need to do is to add an optional highly-boosted phrase query. That way results with dressed shoes will float all the way to the top. You’ll need to tailor the magnitude of the newness boost to still have an effect on these.

    I’ve personally not found that ‘best thing you can do for this approach is to optimize for a single search and then hope that the rest of searches aren’t too bad’. In general, I’ve found that by choosing the relative weights of the scoring factors carefully, we can arrive at something close to optimal in 90-95% of situations.

    I don’t quite follow the point of auto-scaled weights, or why adding or multiplying the secondary score would make a difference. You can achieve the desired effect simply by making the primary factor large enough to dominate, and the secondary factor close enough to the primary to still make a difference.

    Something that helps scoring is not using length norms, as that tends to complicate scoring, and is usually useless for product-based search.

Comments are closed.

Developed in Charlottesville, VA | ©2013 – OpenSource Connections, LLC