Blog

# Demystifying nDCG and ERR

Welcome back, dear reader! In this post, we unwrap the mystery behind two popular search relevance metrics through visualization, and discuss their pros and cons. Our subjects for this exercise are Normalized Discounted Cumulative Gain, and Expected Reciprocal Rank, commonly acronymified as nDCG and ERR. We’ll start with some refresher background, visualize what these metrics actually look like, and paint a picture of how each can be either helpful or misleading, depending on the situation. Afterwards, you’ll have a better understanding of their behavior and which ones to use when (and why).

## Assumptions

Note that, while some basic things are explained, this is not an introduction to these metrics – so I’ll assume you’ve at least heard of nDCG and what it’s used for! So if you’re new to relevance measurement, you probably want to start with something like the book “Relevant Search”, or at the very least the Wikipedia article on search metrics.

As a formality, we’ll stick with the relevance grading scale of poor=1, fair=2, good=3, and perfect=4. We’re also only going to look at the grades of the top 4 result positions, and assume each of those results has a grade.

We’re going to also assume that your results are listed one at a time on each row, and not on a grid. There are varying opinions on how best to measure grid results, but that’s beyond the scope of this post.

OK! No more assumptions, let’s get to it…

## Background

nDCG has been the go-to metric for measuring relevance in a typical search results list since its first introduction in SIGIR 2002. In the original ERR paper, improvements over its predecessor were mightily touted. However, though being around for about a decade, ERR is surprisingly underused.

The similarities shared between the two metrics are, as noted in the title, the discount and cumulative functions of relevance. These respectively mean that a result is ‘worth’ less the lower on the result list appears, and the grades of all the documents in the list contribute to the score for a query’s relevance. When I spoke at Haystack in April, I showed a simple breakdown of DCG while diagraming the components of the formula:

To get a DCG score, just follow these steps!

1. For all the results
2. Award a result by its relevance
3. Punish a result by its rank
4. Add all the result scores

Step 3 in the above diagram is the ‘discount’, and step 4 is the ‘cumulative’. Together they provide the motivation to get relevant documents to the top (and the irrelevant documents to the bottom) to acheive a higher overall score. nDCG has an additional trick up its sleeve – it uses the ideal list of documents to normalize the score between 0 and 1:

While this is helpful (or not, depending on your math chops), it’s hard to tease out the full picture. For one thing, it doesn’t use any sort of maximum grade. Grades can be anything, and this can be troublesome when performing apples-to-apples comparissons between results. We’ll see why later, but first let’s dive into Expected Reciprocal Rank, and its purported improvements over nDCG.

The first thing ERR incorporated as a fix, was that you must outright declare what the maximum possible grade is. This helps by not needing to calculate an ‘ideal’ score, but it also can be misleading because it assumes there are always relevant documents available. Another addition is the cascading model of relevance. When a user sees a search result they like, they are satisfied. Starting with the understanding that a user will lose trust in your search engine the longer it takes them to be satisfied is the cascade. It works by keeping a running tab of this ‘trust’ a user has, and punishes a query’s score when it’s not satisfying a user quickly.

## Showdown

Whew! With that out of the way, time for the fun part. Let’s look at how these two actually differ in the real practical world. To do that, we’re going to map out every possible combination of grades for the top 4 positions, get their scores, and plot them. The Y axis is the resulting score based on its respective X axis grades of the top 4 results. For example, given the first four results with grades 2, 4, 3, 2, that gives us a standard nDCG of 0.7697. The Y axis is 0.7697, and the X axis is “2,4,3,2”. To make things more interesting, we’ll also look at different discount models. We can change the way lower scores are punished, and it is useful to see how this impacts the metric. The standard for nDCG is 1/log2(r+1), and the standard for ERR is 1/r (where ‘r’ is the position rank).

Noting specifics, for the standard discount model of nDCG, right away we start off at about 0.45. This means that for the top four results, you’ll never have a score lower than that. You can also see that lots of the possible combinations for nDCG yields a perfect score of 1.0. Why is this? Well, If your top four results are 1,1,1,1 nDCG will say that’s a perfect score because the ideal sort is the same. We’re going to actually list out the full table below, and you’ll see more of that harsh truth there.

But first, let’s show the visuals for ERR and compare:

Fascinating. You can clearly see the juxtaposition of nDCG favoring higher scores, and ERR with a more balanced growth. You can also clearly see the dropoff cliff in the ERR scores as soon as the top result becomes a 4 (perfectly relevant). For many of the discounts, ERR heavily favors queries that give a perfect first result. This is not surprising – because the user will be satisfied immediately, making the other results inconsequential.

You may have also noticed something interesting when examining the discount functions. Did you happen to catch 1/(r^0.18)? I arrived at that function through human learning (trial and error), looking for a good discount that provided a more gradual dropoff when the first result was not perfect. While this makes for a more balanced metric however, it can be seen as going against the cascading model’s purpose. With the far more drastic cliff of 1/r (the green line), there will be a much clearer signal for an irrelevant top result.

## Pros and Cons

The visualizations above (and the data tables below), gave us an interesting glimpse into the behavior of these two formidable metrics.

### nDCG

nDCG is a great metric when you’ve done your best job at grading, and don’t mind a high score when you have nothing better to offer. Remember, nDCG will return a perfect 1.0 for any result set that has the grades in order of highest to lowest in the resultset. When using nDCG, I always recommend using the global ideal rather than the local ideal. This means that when you know a better document exists is out of your measurement scope (like the 10th document in an nDCG@4 measurement), use that as part of your ideal and avoid just sorting the top 4. Also, for learning-to-rank, consider just using DCG without the normalization! If the goal is a higher number, nobody cares that it’s between 0 and 1. nDCG also has no way to indicate what the maximum score is. To get around this, sometimes it might suit well for the ideal to be all perfect scores for the positions as a best theoretically possible relevant set, as the ideal denominator.

### ERR

The default ERR is a great metric for providing a good signal whether the top result is relevant. Some practitioners will argue this is all it’s good for, but If you are serving content that needs more flexibility, you can also tune the discount function for when this is not the case and you want more forgiveness. One interesting thing about ERR is that it never returns a perfect 1.0 score, and it will always assume that the score can be better, which is a main contrast with nDCG and one I happen to like.

## Conclusions

In this authors opinion, I prefer using ERR and modifying it to your needs for most cases. It is more advanced than nDCG and may be more complex to explain, but it’s more closely aligned with how people actually behave and react – people do get frustrated with search engines that don’t show relevant documents at the top, so it’s a good idea to use a metric that models that frustration. There are those that argue for information needs with multiple good results (such as exploratory search), that ERR doesn’t accurately reflect this, but there are ways to customize ERR to build the desired measurement – the paper itself has a section on diversity for such occasions, which usually goes overlooked.

Thanks for reading, and see you next time!

## Code

The code used to create the above plots and the data tables below can be found at https://github.com/o19s/metric-plots

## Data Tables

Here are the colorized data tables for nDCG and ERR, as visualized above.