Reflective Search Relevance Testing

Many organizations find relevance testing frustrating. Often we start consulting engagements and the first problem we have to solve is “how do we test search relevance?” In this article I want to talk about an alternate method you can use to bootstrap search relevance testing and get to the solution more quickly.

Relevance Testing Methods

Search relevance testing usually means one of two things:

  1. There’s maybe a clickstream somewhere that we can use to see what users click on, interact with, or purchase after a search. From this we can see which docs they like/don’t like. Users search for “Rambo” and they click “First Blood” but not “Sense and Sensibility”
  2. There’s enough patience and expertise to use a tool like Quepid to gather judgments on which documents are good/bad for a given query. A movie critic tells us that obviously a search for “Rambo” should have “First Blood” in the first position, followed by Rambo I, etc.

Certainly these methods are valuable. They’re both more-or-less based in traditional information retrieval research in how search should be evaluated. Given that, there’s a lot of know-how in using the data generated by the processes. We are big fans of both approaches, and using them together. But they each come with their own unique pain points with no clear silver bullet.

For (1) many organizations do not have the sophistication or infrastructure to maintain reliable quality data derived from a clickstream. There are inherent problems of bias in clickstream data that you must overcome. Overcoming this bias is doable, but non-trivial. This data can only be massaged so far (users only click, purchase, etc what they see in search results). So getting good test data from a clickstream means applying a bag of tricks to overcome this bias without corrupting the original authenticity of the data. A tough balance to get right. On top of that, you often need more than clicks. Clicks can be noisy, and uncertain indicators. Often user actions with commitment (purchases, etc) have more reliability.

It goes without saying that manually creating judgments (2) is painful. Few have the patience to sit in front of a tool like Quepid or a spreadsheet and record which results were good, bad, ok, etc for a long list of test queries. For this reason, (2) takes skill in prioritizing enough representative queries for experts to grade. You’re always asking yourself questions like – “Did we get enough queries that cover searching by movie title?” A second concern, as Elizabeth Haubert points out in her excellent Haystack talk, is judges can be unreliable. So for some problems, careful analysis of judges for reliability may be needed, and you can detect which judges are more reliable / consistent.

Reflective Testing

Both click-based & expert created judgments have their pros and cons. So they’re often used together. Click-based data comes in large volumes, so has a lot of statistical significant. Expert-created judgments can have more reliability per judgment (if done right) BUT takes a lot of human labor to get right.

Lately, we’ve been using a third method for relevance testing: Reflective Testing. I’m becoming a big fan of it for certain use cases. After all our whole mission is to empower great search teams(™), and what could be more empowering than a relevance testing method that divorces you from some of the morass of click or expert based relevance judgments to get started on relevance work sooner?

Reflective testing uses prominent phrases, features, or other aspects of a document (or related documents) as test queries. An obvious example is taking a document’s title, issuing that to the search engine as a search, and checking whether that document was returned as a first result.

This seems like a hilariously obvious and naive testing method. No duh, right? But it can be effective and liberating mindset for organizations to get into. And it goes beyond just copy-pasting the title. Instead of asking ourselves “what documents should come back for this query” let’s turn the question upside down “what queries are good candidates for this document?”. This goes far beyond just copy-pasting the title.

Instead of asking this question document-by-document, we can abstract some obvious rules for how documents (really the entities/ideas related to document) might go into a users head, down their arm, and (mis)typed out as search queries.

Consider an e-commerce search, with the following metadata:

[{"name": "pencil dress", "color": "red", "brand","ralph lauren"},{"name": "blue jeans", "color": "blue", "brand": "levis"}]

How might we ‘reflect’ some of these documents into reasonable queries that are good matches? By knowing just a tad about our user, and how they search, we can code up some business rules that take as input a document, understand something about its structure, and output query with relative weight. This is the ‘reflective’ comes in, we’re reflecting a document back into queries.

Some obvious good methods might be

  • The full name as a moderately strong query for the document
  • The color plus the full name as a strong indicator (“red pencil dress”, “blue blue jeans”)
  • Just the brand or color, but as a relatively weak query for each document “levis” or “ralph lauren”

From these weak/strong queries we generate from the document, we end up building a judgment list using all the documents that generate the query. Documents that generate the query weakly would be graded lower for the query than those graded higher. So we might end up with a judgment list that looks like:

Doc Query Grade 0-4 Use Case
blue jeans blue jeans 4 Exact Match
blue jeans levis 3 Brand
blue jeans pencil dress 0 (query not generated)
pencil dress pencil dress 4 Exact Match
pencil dress ralph lauren 3 Brand
pencil dress levis 0 (query not generated)

Note this is the same judgment list we’d use if we asked experts to grade queries, just with our document as the primary object of evaluation. If we sort this table by “query”, we’ll get a classic judgment list we can use for evaluation.

With more sophistication, more query patterns could be extracted to flesh out this judgment list:

  • Use NLP to figure out the primary noun in the item (“jeans” or “dress”)
  • Mutate the query with common typos
  • Weight common terms relative to the popularity of the item
  • Apply reasonable taxonomy to the color / items to generate queries for semantically proximate terms (“pants”, “turquoise”, “turquoise trousers”) obviously weighted lower

The goal with this method of testing is not to be exhaustive. Indeed, that’s a downside to it. You’re in a sense identifying and regression testing use cases that your solution should get right with gradually increasing sophistication. You can put a number on each use case, and use it in conjunction with other forms of relevance testing to make decisions. It compliments, doesn’t replace, other forms of relevance testing.

Another issue is at some point you just start solving the problem itself. But to me solving and testing simultaneously is ok. I may be disabusing myself of the preconception that relevance testing always requires this untainted ‘pure’ judgment list that has had no engineering applied to it. Afterall, regular unit tests can require test code up to the level of sophistication of the solution being tested. The complexity of test/solution go hand-in-hand.

Moreover, its a kind of ‘solving’ that I think can engage product/business experts in a way that’s closer to how search actually works. Imagine this conversation:

  • Product person: the person searched for ‘turquoise jeans’ and got no results, they should have matched this document
  • Relevance engineer: Sorry “turquoise pants” won’t match here, notice there’s no terms for turquoise to match on in this document
  • Product person: well how do we inject that term, and prioritize this query (and others that fit this pattern) relative to the other queries that might fit this document?
  • Relevance engineer: Let’s collaborate on a business rule. Within this taxonomy, close, but non-exact color matches have this relative weight relative to other possible generated queries for this document
  • (collaboration ensues)

After this conversation the ‘reflective test’ turns into something of a regression test for this business rule, and hopefully can be generalized to other documents.

Can we synthesize learning to rank training data?

One big experiment I am working on is whether you can take this approach and generate a judgment list suitable for training machine learning model? If you recall from an earlier blog post, learning to rank requires labeled judgments. Lots of them! So a big hurdle is setting up the same infrastructure needed for relevance testing covered in the first part of this article.

Can you capture the important queries for each document, turn them into a judgment list, and then use them to train a model that can be used for ranking? In a sense, this reflective work IS feature engineering based on someone’s assumptions about what a ‘good query’ is. It seems crazy, but some of the methods you’d use to generate ‘good queries’ you wouldn’t want to do during runtime. So in a sense you’re capturing computationally expensive patterns (complex NLP stuff) in another medium (a judgment list) so that simpler features (Solr/ES queries) can try to approximate them.

It’s a crazy idea, and there’s a good chance it won’t work, but If you’re curious, check out this github repo for very early experiments. And don’t hesitate to get in touch with questions!


Want to call out the great work of Peter Fries and Dan Worley from our team for operationalizing many of these ideas for various OSC clients!