Pete establishes a Baseline Relevance Metric

January 3, 2023 Eric Pugh
Category: Relevancy

Pete has just been hired as Product Manager for Search at electronics retailer Chorus Electronics. His boss, a Vice President, has asked him to build a ‘best-in-class’ e-commerce search engine, to ‘increase customer satisfaction and drive improved revenue’.

On his quest to build superior e-commerce search, Pete dealt with single queries to improve the search results. While improving search on a query-by-query basis is a valid approach that will result in “better search” he may be experiencing something that is called the butterfly effect:

The butterfly effect is the idea that small, seemingly trivial events may ultimately result in something with much larger consequences – in other words, they have non-linear impacts on very complex systems. For instance, when a butterfly flaps its wings in India, that tiny change in air pressure could eventually cause a tornado in Iowa.
https://science.howstuffworks.com/math-concepts/butterfly-effect.htm

Pete is aware of the butterfly effect in his relevance tuning work. He improves the results for one query, and that change has unanticipated impacts on many other queries. Sometimes the unanticipated impacts are a pleasant surprise, improving many other queries. More commonly, it has a negative impact on other queries.

Baseline Relevance

To deal with this, and to use another analogy, Pete says to himself: I need to look at the forest, not the trees, and he does this by creating a Baseline Relevance Case in Quepid, a tool that is part of Chorus. If you want to follow along head over to the Chorus Github repository and check out the tenth Kata.

The Baseline Relevance Case is a set of queries that represents the typical queries that Chorus is receiving from the overall customer base. It is seen as a best practice approach to source them from your query logs so they represent real queries. Additionally, it should have queries that come from both the head and the long tail of queries that users are running. 100 queries is typically enough to get started. However, more is always better. There is an art form to picking the right sample set, you may be interested in how to succeed with explicit relevance evaluation using Probability-Proportional-to-Size sampling.

The sampled queries now make up your relevance case. You can see it as your representative collection of queries that you use to measure the relevance of your e-commerce search.

Rating the Query Results

Having the queries is only the first step as this does not yet give us anything measurable. Next, the results for these queries need to be rated. In previous posts, Pete already learned to rate search results and also how to scale the search result rating process in case doing this in house is not an option.

In general, the process of rating means applying a numeric grade to a document for a given query. You can find out more about best practices for judgement rating, as well as many other tips and guides, in the Quepid Wiki.

Using the Baseline Relevance Case

With the baseline relevance case established you now have a number associated with each query in this case and also overall a number for the whole case. This means you can compare each and every candidate algorithm that you have ready to go to this baseline case to see how the candidate would change the overall behavior according to your baseline metric.

These tests are run in lab environments without real users being exposed to the algorithm changes and this process is often referred to as offline testing. Offline testing can be seen as a quick and cheap way of verifying ideas to improve e-commerce search (or really any other search use case) by reducing the risk of introducing a change at the same time.

After verifying your idea, you can take it to the next level and run an online test (typically an A/B test) to prove its actual value based on business KPIs, not search metrics.

Each time you have a candidate algorithm ready to go, bring up this baseline relevance case, and update the algorithm settings for your new candidate setting. Then, rerun the queries, and compare it against your baseline relevance case you just established.

Establishing a Baseline Relevance Metric with Chorus

You can watch Eric walk through the steps of setting up a baseline relevance case in Chorus using Quepid.

Chorus is a joint initiative by Eric Pugh, Johannes Peter, Paul M. Bartusch and René Kriegler.

Read the complete Meet Pete series about e-commerce search:
1. Meet Pete, the e-commerce search product manager.
2. How does Pete, the e-commerce search product manager, build a web shop?
3. Pete solves the e-commerce search accessories problem with boosting & synonyms
4. Pete learns about selling search keywords with Chorus
5. Pete finds out how to rate search results to create a judgement list
6. Pete learns how to scale up search result rating
7. Pete learns how to curate search results for a single query
8. Pete establishes a Baseline Relevance Metric
9. Pete improves a new class of queries with redirects