Search Quality is about effective Collaboration*

*and Effective Collaboration is about Testing

vectors are fun

Sales expert not happy about the quality of her search. Shed like a better way to collaborate than yelling at the search devs.

When we start on a search project, we never have a spec that tells us what good search is. The best we have is the expert opinion of marketing, content curators and the (sometimes forceful) opinion of users and suppliers. With such a diverse group of stakeholders involved, there’s plenty of opportunity for negative communication around search quality. Stakeholders fight over what users really intend with their queries–then fight again with developers over how to mold search in that direction.

It makes sense. People do and should take the ordering of search results very seriously. It can mean real money for an eCommerce company. It can mean a doctor finding the right research to help her patients. When a non-technical sales expert or content curator can’t quite get the message across to search developers, disastrous search is bound to happen. Unfortunately these groups simply use different languages around search quality. Sales talks in conversions, sales, and profit. Developers talk in boosts, scoring, query parsing, and measuring a users intent.

Given that there’s this distance between the techies and non-techies, how can one possibly keep all the stakeholders on the same page productively moving forward without more meetings and fights?

Talk is cheap; Tests Matter

The best solution we’ve found is test collaboration – having stakeholders work together creating automated tests for search quality. In other words: Talk is cheap, tests matter.

Using automated tests to collaborate across disciplines is not new to software developers. Yet with search quality this practice is shockingly rare. I suppose this is because search is so fuzzy, heuristic, and psuedo-magical in a way that software development often is not. So at first automated tests would seem silly.

In fact, collaborating on tests is even more fundamental to a successful search implementation than a normal software project. First, developers dont do a great job of measuring good vs bad search. Search correctness – what results should be returned for a given query – is best measured by marketing and content experts. In an eCommerce store, this means knowing your customers. It means measuring what is profitable, measuring convergence. It means knowing all the things that developers dont think about. Without non-technical domain experts helping developers measure search quality, no progress can truly be made on search quality. Without the testing being verified and blessed by these experts, search developers have little to develop against.

From a technical point-of-view, collaborating on search testing is absolutely essential. Search quality for all searches is governed by a tiny set of heuristic rules crafted by developers. The changes your developers make to this algorithm are very global. Instead of thousands of lines of codes, search developers work using a handful of config files and search parameters. Modifications to search parameters almost never “just impact this one case” like a change to software might. Therefore fixing a broken search is almost certainly going to impact working queries. Thus getting everyone on the same page on the trade-offs of “if you fix this query, this other query will change by X%” is even more important. Without this testing, its pretty easy to inadvertently have a dress shoe moment where search tweaking accidentally breaks your highest grossing search, destroying your revenue.

An example of this can be seen in our work at Silverchair Information Systems. While implementing search for medical research journals, we were tasked with weighting matches across three main variables:

  1. Expert curated medical tags attached to the document. IE [breast cancer] or [chemotherapy]
  2. The text of the document
  3. The recency of the research

Giving too much weight to one of these variables could mean very poor results for crucial queries. Weigh the recency of the research too heavily, for example, and suddenly recent research trumps relevant research. If a user searches for “breast cancer”, the highest ranking result might be yesterdays research article on toe fungus.

Without tests around our important queries, our team would have no idea how catastrophic heavily boosting recency could be on search relevancy. In our example, they might not speak the right medical jargon to know this article on toe fungus has nothing to do with breast cancer. With a good test suite, the developers can incorporate the expertise of domain experts to instantly measure the impact of our weighting changes to both the queries that already work and the ones that are broken.

Aside these kinds of catastrophic failures, weighting search a smidgen too far in one direction or another can nudge working queries out of whack. For example, a research article might never directly mention chemotherapy, instead perhaps its a study on a specific chemotherapy drug. Theres little mention of the term “chemotherapy” directly in the document. Nevertheless, our curators have expertly tagged this article with the [chemotherapy] tag so that user searches for chemotherapy pick up the article.

Collaborative testing is crucial here. Without medical experts reporting in on what are good, mediocre, or abysmal results for a doctors search for “chemotherapy”, search developers have no way to test the impact of their changes. Without collaboration around a single testing tool, developers dont get the information they need from non technical experts and experts dont have a way to get developers to really listen.

Search Collaboration Needs Better Processes and Tools

It’s shocking to us that until now, nothing really existed for building collaborative tests around search relevancy. That’s why we built Quepid around the idea of “Test Driven Relevancy”. We built it to help marketing and content experts meaningfully provide their expert opinion and report broken searches. We built it to allow everyone to understand how search quality changed when developers altered relevancy parameters.

vectors are fun

Marketing expert rates the quality of this juice for the “apple juice” query, giving developers instant feedback on search quality.

In the same way sitting together to talk about automated software tests lets everyone communicate about correct behavior, Test Driven Relevancy advocates getting everyone on the same page with search quality. Each “test” is a user query representing an important or representative search to be tested. Non technical folks (ie marketing) use a tool to manually or automatically rate the quality of individual results for each query. Developers use this data to improve the overall site search quality. Everyone can see in one glance how well the search is doing.

Test Driven Relevancy using Quepid has helped us turn a number of difficult search stakeholder relationships into productive ones. It incorporates the feedback of all the stakeholders in a single glance. We’ve become accustomed to pairing with non technical content experts and letting them watch us tweak their search in real time. Simultaneously, the content experts can throw more curveballs – adding new queries that represent edge cases or long tail searches. Optimizing for the data they’ve put into our system, we can prove to the content experts in the moment that we are making progress on broken searches plus holding ground on working important searches. Its opened a whole new world for our relevancy work, letting us try whatever it takes to get search relevant to users.

In short, I won’t be leaving home without Quepid on my next search relevancy engagement! Are you interested in trying out Quepid for your organizations search problems? Do you have tough search relevancy problems and would like help? Contact us! We’d love to talk to you about how we can help.