Quepid has provided an implementation of nDCG for quite some time now and has added an implementation of ERR in version 8.0. As both of these metrics use multi-valued judgment scales, it is easy to consider them as competitors. In fact, they model two different, but related, information-seeking goals.
The Players
To talk about what the metrics are modeling, we need to introduce two of my favorite search personas. They personify two very common search goals. Joelle Robinson of Moody’s introduced these two in her Haystack US 2024 talk, Revisiting the Basics: How Round Robin Improved Search Relevancy.
First up, is Hulk Smash! Hulk knows what he wants. And he wants it right now! Smashing through the glass, grabbing the goods, and Hulk is out the door. This is the epitome of known-item search. The Highlander principle, “there can be only one”, applies. When our searchers are Hulks, we can model how satisfied they are with rank-based metrics that focus on top results, such as Expected Reciprocal Rank (ERR).
Next up is Dora the Explorer. Dora has goals. To realize her goals, Dora needs to know things. Not just one thing, all the things. Dora gathers information to help her achieve her goals. When Dora comes to search, we can model how satisfied she is using a gain-based metric, such as Normalized Discounted Cumulative Gain (nDCG).
The Metrics
Given our two personas, we can now move on to the details of the math underneath the two metrics. Feel free to skim over the formulae and funny symbols. The important thing to look for is what are we trying to measure with each of them.
nDCG
The underlying notion with gain-based metrics is that there is a pool of useful information: each document we might return has some amount of that useful information. We can add up how much useful information we obtained from each of the documents that we examine. As with most things, mo better makes it mo better. We want to get all of the possible information, no matter how many documents it takes.
Cumulative gain is the simplest of the gain based metrics. It is the sum of the relevance values of each of the documents in the result list. This is a set based metric, and is sometimes called graded precision.
DCG adds a discount to cumulative gain. At each rank we discount the gain contributed by dividing it by the log of the rank plus 1. We add the one because log(1) is 0. Just like CG, DCG allows us to see how well a query is doing directly by its magnitude. And, like CG, the value of the metric is not comparable across queries, as the amount of available gain is a function of the set of judgments for the query. If query A has a DCG of 47.5 and query B has a DCG of 22.6, we have no way to say that one of the two is performing better than the other.
This condition is due to a number of different possible reasons. First, A might have many more relevant documents than B, so its gain is higher. Second, A might have more highly rated relevant documents, e.g. judgments of 3, where B has lower rated documents, e.g. judgments of 1. And finally, based on the formula for the discounting, how the judgment values are distributed within the ranked list matters. The overall effect is that trying to compare DCG values between queries is comparing apples to oranges, they are not on the same scale.
In order to have an aggregate measure, we can normalize our DCG values by dividing them by the ideal DCG value that results from the perfect ordering of the judged documents. We sort our judged documents on their judgment values, all the 3s, then all the 2s, then all the 1s. We can ignore the irrelevant documents (the 0 scores). Computing DCG on that sorted list produces the very best score that can be obtained by any retrieval system for those judgments.
A common implementation error in this metric is to sort the result list by relevance, rather than set of relevant documents. Another implementation error, which was present in Quepid, is to truncate the list of relevant documents to K, when computing the value at K.
nDCG is an effective metric for evaluating information gathering activities, where the user wants to collect multiple sources of information, preferring the highest quality sources first. It is often used in web search.
ERR
Expected reciprocal rank allows us to compute reciprocal rank (RR) with graded relevance. It puts high emphasis on the very first ranks and it thus focuses on known item or single answer tasks. It is also often used to evaluate question answering systems. It uses a cascade model, one that assumes the user scans the result set from first to last, in order, and stops when they are satisfied. The expectation is that the first satisficing result is the only one the user wants.
This becomes obvious when you look at the equation: in the expression (1-Ri), Ri stands for the probability that the user was satisfied with the result at the i-th rank. Let’s for a moment assume that we work with binary judgments so that R=1 if the user is satisfied and R=0 if they are not.
If the user is satisfied by the result at rank i, the term (1-Ri) = 0 and our probability that the user would proceed to the next rank r is 0. In the equation, we multiply that probability with the probability at the r-th rank (Rr). If the probability equals 0, Rr gets ignored. We stopped at a rank before – depending on how much the user was satisfied at that position.
When we use graded judgments, we have to map our judgments to a range between 0 and 1 when we calculate R in order to make it a probability. More precisely, we will map it to a range between 0 and < 1 in order to always leave a bit of a probability that the user proceeds to the next rank. For example, R = Grade / (Grademax + 1).
When combined with binary judgments, ERR produces the binary version, reciprocal rank (RR).
The Point of it All
We set up our cases in Quepid with collections of queries aimed at specific user cohorts like Dora or Hulk. We gather relevance judgments, a measure of if a document could satisfy a search. We choose our evaluation metric to model whether or not the search results did satisfy our user’s information need. For Hulk, that need is one and done. For Dora, it is so much more.
We appeal to ERR to model how well we are satisfying Hulk. The metric is highest when the highest quality result is in the top position on the ranked list. While there might be numerous equally good alternative results at lower ranks, they do not provide any additional value to Hulk.
We appeal to nDCG to model how well we are satisfying Dora. Each new document adds some value. If all the best documents top the list, Dora will be very happy. If there is a mix of more and less relevant documents, the score is lower, and Dora is not as happy. Here we want to provide all of the potential useful items.
The nDCG Gotchas
nDCG does have some undesirable properties and some unexpected consequences. Beginning with Quepid 8.1, a new implementation of nDCG will be provided as a community scorer, with an eye to ensuring the implementation aligns with those used in the IR evaluation community, specifically with trec_eval.
The change to Quepid’s implementation is to use the full set of relevant documents, rather than the top K relevant documents, when computing the ideal DCG component of the formula. This is done to model that when you return k < |Rel| documents, you are leaving money on the table. If there are 25 documents judged relevant, and you only return 10, you should not be able to achieve an nDCG score of 1.0, even if all 10 of your returned documents have the highest possible judgment value. Because of this, as you increase the number of judgments for a query, the nDCG value will decrease. This can be disheartening. Note that this same issue occurs when using the Average Precision metric also.
Because nDCG is normalized by the ideal DCG value, it can also be a misleading metric. If I have a query that has three documents, all judged as 1, and a second query that has three documents, all judged as 3, and my system returns the three judged documents first for both queries, both queries will have an nDCG of 1.0. This can be dissatisfying.
The bottom line is that you can use gain-based metrics to model explorer-like search cohorts, but do need to be mindful of the gotchas.
See Also
See also Max Irwin’s excellent blog post on the two metrics.
Disclaimers
Hulk image, Fair use, https://en.wikipedia.org/w/index.php?curid=62782054
Dora Image, By Nickelodeon – Sourced from Dora the Explorer wiki., Fair use, https://en.wikipedia.org/w/index.php?curid=71769562
Need help understanding how to use evaluation metrics such as nDCG and ERR to improve your user search results experience through our proven hypothesis-driven development process?
Partner with Open Source Connections to transform your search capabilities and empower your team to continuously evolve them. Our proven track record spans the globe, with clients consistently achieving dramatic improvements in search quality, team capability, and business performance. Contact us today to learn more.