It is probably fair to say that we at OpenSource Connections are known for two things: one is empowering search teams – we teach our clients ‘how to fish’ and support the search community (with Relevance Slack and our community events and trainings) – and the other is search relevance. We feel that the Relevant Search book that Doug Turnbull co-authored during his time at our company and the empowerment that we have provided to the community in this field really created a new focus – and sometimes it even felt like a movement! – within the community of search practitioners.
We’ve made progress on Search Relevance
We still meet a lot of search teams who struggle to improve search relevance but nowadays they often start from a higher skill level. For example, search metrics such as precision and recall were new to teams that we met five years ago and they were unlikely to have anyone in the team to build the related knowledge and put it into practice. Today, we find more and more teams who tell us that they already measure the likes of NDCG and it has become much more common to find relevance engineers and data scientists in search teams. It’s fantastic to see such a growth in search expertise!
But often, teams feel that something is still not right: they keep improving their NDCG or other search metrics but when they put their algorithm into an A/B test, they often don’t see any improvement or their business KPIs even decrease. In addition, interpreting online search user behaviour seems extremely difficult, not only due to the many biases that influence the user behaviour and the statistical complexity involved, but also because we struggle to understand the overall quality of search results. An interesting example that showed how search relevance ratings provided by subject matter experts did not correlate with online user behaviour was given by Doug Rosenoff in his talk at Haystack US 2022.
The next step – Search Result Quality
We think that we can further improve search results if we accept that search users evaluate search results based on criteria that reach beyond search relevance. We are thus now considering the broader notion of Search Result Quality. As opposed to relevance, these further search result quality criteria are usually not reflected in search queries and often users are not even consciously aware of them. Yet, they can often be observed in search user behaviour.
We will illustrate different search result quality criteria using the following search use case: let’s assume I’m looking for a birthday present for my little niece. I might go to an online shop and search for toys
. I might get back a list of toys – a toy trumpet, a toy car, a teddy bear, a doll. I finally decide on the teddy bear.
One aspect in my decision making would be that I would imagine how the teddy bear would make my niece and maybe her parents happy:
But why did I decide on the teddy bear? We cannot explain this just looking at search relevance, all results were relevant to the query toys
.
This could be a matter of personal preference, but we also share opinions about how much we would like to see certain toys in our kids’ and our own lives. Probably only the least considerate parents would buy their kids a toy trumpet.
Thus the trumpet should not show up as the first search result – though it is a perfectly relevant result.
Finally, I might be a bit late to buy the birthday gift and I probably don’t want to overwhelm my niece’s parents and buy a super-expensive toy.
This implies that aspects like price and delivery time are part of yet another group of search result quality criteria that is different from search relevance.
My judgment of search result quality will probably be influenced by further factors, such as the imagery and the layout that is being used for presenting the search results.
When we try to understand and to measure the quality of a search application, we depend on interpreting the feedback that we receive from the users – be it explicit feedback that we can ask them for or implicit feedback from observable online user behaviour. Both types of feedback relate to different aspects of search result quality where the explicit feedback mainly covers the traditional notion of search relevance and the implicit feedback also includes criteria like imagining how the new product would fit into or change our life and the effort it would take to purchase the product.
These additional criteria are usually domain-specific and we need a mental model about ‘how things work’ and how search supports users to get things done in that specific domain – and thus what defines a good search.
Using mental models for understanding search
The above example has been taken from e-commerce search. For this domain, we can usually map search to the consumer buying decision process, which has been an important topic in marketing research for decades:
This model enables us to understand the mindset of users at the different stages of the buying process, what queries they issue at the different stages and also helps us come up with a model of the criteria they will use to assess search result quality.
For example, abstracting from the above example use case of searching a birthday present, we can categorise the criteria of search result quality in e-commerce into the following groups (that we label ‘Buying, Having, Being’ after Michael R. Solomon’s book: ‘Consumer Behaviour. Buying, Having, and Being’):
Having The type of product that the user is looking for. What type of thing will they have/own? – Usually explicitly mentioned in the search query, corresponds to search relevance – Criteria for this aspect of good search result quality can be described reasonably well – Result quality can be assessed manually for these criteria | |
Buying Factors related to the actual purchase of the product. For example: price, delivery time, but also trust-related factors such as reviews and seller reputation – Usually not explicitly mentioned in the search queries but sometimes applied in filters and sorting – We can accept these factors as quality criteria easily but struggle to come up with estimates of how much a given price or product rating influences the acceptance of a product. There usually is a strong interaction with the product category. – Result quality is best assessed implicitly based on online user behaviour | |
Being Factors that relate to our (social) self. How will that product fit into my life? What would my friends say? Does it fit my values? – Usually not explicitly mentioned in search queries but sometimes encoded in searches for certain brands or product attributes such as ‘sustainably sourced’. – Users are often not aware of this type of criteria and we struggle to reason about them – Result quality is best assessed implicitly based on online user behaviour |
Using mental models in other domains
While the above model relates to e-commerce search, we think that defining such a mental model will be key for understanding and measuring search result quality in other domains as well.
For example, in a legal document search application, we might detect search result quality criteria such as authoritativeness of the source, the publication date of the document, and the type of the document (comment on a law vs original text of the law) only once we have understood what users want to get done with the documents that they are looking for. Most of these criteria will not be mentioned explicitly in the user query and will reach beyond the traditional notion of search relevance – yet they are still very important search result quality criteria.
Developing the concept of Search Result Quality
This new model is already informing how we work with clients. We’ll also be giving presentations on this subject over the next year, and we’re looking forward to discussing it with all of you at events such as Haystack and MICES, and in Relevance Slack.
Our new training course: Beyond Search Relevance – Understanding and Measuring Search Result Quality
At OpenSource Connections, we are convinced that this idea of Search Result Quality being much more than search relevance is so important that we created a new training course around it. It will not only help practitioners push the boundaries of how they can improve their search application but also provide a deeper understanding of search relevance itself and how to measure it as a better foundation for search optimization work.
In this training course participants will
- Learn about mental models for search and the derivation of search result quality criteria
- Deep dive into feedback mechanisms for search. This includes explicit and implicit feedback mechanisms: when to use which, how to create and scale rating tasks, how to process implicit feedback (including dealing with several types of biases), how to verify search result quality judgments
- Learn about the intuition of search quality metrics, when to use which and how to calculate them
- Learn about designing online tests and how to interpret their results
Join us for a first in-person training right after Haystack US, 25th-28th April (conference and training) – tickets are available here.
We’re much better search consultants than we are artists – so do let us know how we can help with your search!
Background Image from Premium Vectors by Vecteezy