Blog

How to Implement a Normalized Discounted Cumulative Gain (NDCG) Ranking Quality Scorer in Quepid

Background

As a search relevancy engineer at OpenSource Connections (OSC), when I work on a client’s search application, I use Quepid every day! Quepid is a “Test-Driven Relevancy Dashboard” tool developed by search engineers at OSC for search practitioners everywhere. You can run Quepid as a free hosted service or download the open source code and run it in your own environment.

I like to think of Quepid as both a unit and system tests environment for search relevancy development. A unit tests environment because the end-user can go into a deep dive of the search engine Solr or ElasticSearch-produced Lucene query structure as well as the math behind the Lucene scoring of matched terms and phrases; And a system test environment too as Quepid allows the end-user to monitor progress, or regress, across any number of queries, organized thematically as cases. Which brings us to the topic at hand in this blog: The metric in Quepid to measure progress or regress is a query score, and the queries’ scores combine to produce a top-level case score as shown in Figure 1 below. In this blog, we will use OSC’s favorite Movies index.

Figure 1: Quepid Query Score and Case Score

Figure 1 shows a Quepid case called Movies, which contains two queries (or searches): Star Trek and Star Wars, with scores 67 and 63 respectively. Quepid scores are always linearly scaled to the 1-100 range (This is not currently configurable). Also shown in Figure 1 is the Quepid case’s top-leve score of 65, which is simply the arithmetic mean (a.k.a., average) of the case’s queries’ scores.

So, what is behind these query scores? In this blog, we will explore and explain the Quepid scorers, which compute the queries’ scores. In particular, we will implement a Quepid scorer that computes a popular ranking quality metric in the Information Retrieval field called the Normalized Discounted Cumulative Gain (NDCG). We will be using the Wikipedia article on Discounted Cumulative Gain as our reference CG, DCG, and NDCG implementations; keep it on an opened tab next to you!

Recommended prior knowledge:

  • If you are reading this blog, I am assuming that you have some affinity with, and interest in, the field of search!
  • Some familiarity with Quepid. If you are a current Quepid end-user, it’s even better!
  • Some familiarity with metrics that measure the quality of search results. See this Wikipedia article{:target=”_blank”} for some suggested background readings.

Quepid Scorers

Before we dig into the implementation of an NDCG scorer, let’s briefly review what a Quepid scorer is, as well as the default scorer. You can also check out the great Quepid on-line documentation on scorers for a refresher if necessary. Figure 2 shows the anatomy of the default Quepid score.

Figure 2: Quepid’s Default Scorer

Note: There are two ways to produce Figure 2 on your own. Open any one of your Quepid cases, and:

  • Either, double-click on any query’s score, and click on Test with an ad-hoc scorer: The ad-hoc scorer is initialized to the default scorer’s settings.
  • Or, click on Custom Scorer on the top-level menu, and click on Add New: The new scorer is also initialized to the default scorer’s settings.

In either case, click the Cancel button to close the window.

As shown on Figure 2, a Quepid scorer is composed of the following parts:

  • A unique name, preferably user’s friendly, but you can choose any name of your liking. For example, “NDCG@10” might be a sensible name for an NDCG scorer at rank position 10;
  • A script written in JavaScript (JS) which computes the score for a given query and its results;
  • A rating scale (1 to 10 by default, from worst-result to best-result);
  • And, optionally, the rating labels (Not shown on Figure 2).

Let’s explore the default scorer’s computation, reproduced in Script 1 below:

// Gets the average score over a scale of 100
var score = avgRating100();
if (score !== null) {
  // Adds a distance penalty to the score
  score -= editDistanceFromBest();
}

// (Added by the blog's author) Return the score to Quepid
setScore(score);

Quepid’s default scorer is basically a 100-scaled Cumulative Gain (CG) calculation in avgRating100(), which is then penalized by the factor editDistanceFromBest(). The JS functions avgRating100() and editDistanceFromBest() are part of the Quepid API that any scorer has access to (See the section What can my code do? in the Quepid scorers documentation for a complete API reference).

More specifically, at the time of writing avgRating100() executes the following computation:

score = \frac{\sum_{i=1}^{p}{rating(i)}}{(Number\_of\_rated\_results)}*{\frac{100}{max\_rating}}

where:

  • p is the rank position as defined in the aforementioned Wikipedia article on Cumulative Gain, or depth within the results set, at which the measurement of the results quality is performed. Note that in the Quepid default scorer, p is equal to the number of results displayed as configured by the parameter “Number of Results to Show” in the Quepid settings (See Figure 3).
  • number_of_rated_results is the number of results on the Quepid results page that have been rated by the end-user.
  • rating(i) is the end-user’s rating of the result at position i.
  • max_rating is the maximum value in the scorer’s rating scale, e.g., 10 for 1-to-10 rating scale.

As mentioned earlier, the Quepid query scores are always scaled to the 1-to-100 range.

Below is an example of the value produced by avgRating100:

p (Number of results shown in Quepid) = 10
Scale: 1-10
List of ratings by position in the results set (0 = Position not rated): [10, 8, 9, 0, 5, 1, 4, 0, 0, 0]
Sum of ratings = 10 + 8 + 9 + 5 + 1 + 4 = 37
Number of rated results: 6
max_rating = 10

score = (37/6) * (100/10) = 61 (Rounded down)

The penalty factor editDistanceFromBest() quantifies the distance between the list of ratings and what Quepid calls the best ratings (or best docs), which is the list of non-zero ratings sorted by descending order. The distance is calculated using the Levenshtein distance{:target=”_blank”} (a.k.a., edit distance) between two character strings:

“edit distance … is the minimum number of edit operations required to transform s1 into s2. Most commonly, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string; and (iii) replace a character of a string by another character.”

Continuing with the above example, the edit distance is shown below:

best docs = [10, 9, 8, 5, 4, 1]
edit distance between 
    [10, 8, 9, 0, 5, 1, 4, 0, 0, 0] 
and [10, 9, 8, 5, 4, 1, 0, 0, 0, 0] 
at p=10 = 4   

score = 61 - 4 = 57 

To explore further the above computations, I recommend you create a new scorer, and use the script below to dump various messages and objects in your browser’s developer’s tools JS console (See the console.log(...) statements):

// Explore 'docs'
console.log(docs)

// Explore 'bestDocs'
console.log(bestDocs)

// Gets the average score over a scale of 100
var score = avgRating100();
console.log('avgRating100=' + score)

if (score !== null) {
  // Adds a distance penalty to the score
  var editDistance = editDistanceFromBest()
  console.log('editDistanceFromBest=' + editDistance)
  score -= editDistance;
}
console.log('score=' + score)
setScore(score);

Now armed with a solid understanding of the Quepid default scorer, let’s implement our own scorer, using the popular NDCG as an example.

NDCG@10 Scorer

As mentioned earlier, we will be using the NDCG definition specified in the Wikipedia Discounted Cumulative Gain article. And we will build an NDCG scorer at rank position 10 (a.k.a., NDCG@10).

CG, DCG, IDCG, and NDCG

For reading convenience, the math of interest is reproduced below. First, the primitive Cumulative Gain (CG), which simply adds the ratings up to a specified rank position:

Cumulative\ Gain\ at\ p= CG_p = \sum_{i=1}^{p}{rating(i)}

Then, the Discounted Cumulative Gain (DCG), which penalizes, or discounts (logarithmically), each rating based on its position in the results:

Discounted\ CG_p = DCG_p = \sum_{i=1}^{p}{\frac{rating(i)}{log_2(i + 1)}}

And finally, the Normalized Discounted Cumulative Gain (NDCG), which normalizes the gain to a number between 0.0 and 1.0, hence making NDCG comparable across queries and search engines. The normalization is accomplished by dividing the query’s DCG with the so-called Ideal DCG (IDCG), which is the DCG of the best possible results based on the given ratings (Same definition as in QCG above):

Ideal\ DCG_p = IDCG_p = \sum_{i=1}^{|REL|}{\frac{rating(i)}{log_2(i + 1)}}

where |REL| is the number of best ratings up to position p (Note: |REL| <= p)

Normalized\ DCG_p=\frac{DCG_p}{IDCG_p}

NDCG Scorer Implementation in Quepid

Experimentation in JSFiddle

We recommend that you first experiment with, and debug, your new Quepid scorer in a debugging-friendly JavaScript sandbox like JSFiddle or codepen with mocked (or simulated) Quepid objects and API. Our NDCG scorer was first written in this JSFiddle.

Mocked Quepid Objects and API

First, the following objects and API are mocked in our JSFiddle:

Mocked Quepid Objects:

  • docs: The search results.
  • bestDocs: The best (ideal) rated docs.

Mocked Quepid API:

  • function docRating(i): Returns the rating of result at position i.
  • function hasDocRating(posn): Returns true if search position i is withing the rank position, and false otherwise.

Using the Wikipedia article data example, the complete Quepid mockup is reproduced in Script 3 below:

// Wrap the mocked Quepid objects and API in a namespace
let mockedQuepidApi = {};

(function(context) {

  // List of docs/ratings on page #1 (e.g., 6 with the Wikipedia example)
  const _docs = [3, 2, 3, 0, 1, 2];

  // List of best docs: Descending order of ratings
  const _bestDocs = [{
    rating: 3
  }, {
    rating: 3
  }, {
    rating: 3
  }, {
    rating: 2
  }, {
    rating: 2
  }, {
    rating: 2
  }, {
    rating: 1
  }];

  context.getDocs = function() {
    return _docs;
  }
  
  context.getBestDocs = function() {
    return _bestDocs;
  }

  // Document's rating accessor
  context.docRating = function(position) {
    // from the
    return _docs[position];
  }

  // Has a ranking position been rated?
  context.hasDocRating = function(position) {
    return position >= 0 && position < _docs.length;
  }

})(mockedQuepidApi);

The Quepid API is wrapped in a namespace so that we are able to pass to the NDCG scorer either the mocked or actual Quepid API.

NDCG@10 Scorer Implementation

Our NDCG@10 implementation is shown in its entirety in Script 4 below, and is annotated subsequently.

// Scorer's name space to avoid name collision with the Quepid-provided objects.
var NDCG_scorer = {};

(function(context) {

  let _quepidApi = null;

  // --------------------------------------------------------------------
  // JS console logging.
  // Argument:
  // obj: Object to log to the JS console.
  // --------------------------------------------------------------------
  function log(obj) {
    // Log messages on the developer's console.
    var debug = true
    if (debug === true) console.log(obj)
  }

  // --------------------------------------------------------------------
  // Log base 2.
  // Argument:
  // num: Number to caculate the log base 2 of.
  // --------------------------------------------------------------------
  function log2(num) {
    return Math.log(num) / Math.log(2);
  }

  // --------------------------------------------------------------------
  // Round with precision. Used for logging numbers.
  // Arguments:
  // number: Number to round.
  // precision: Rouding precision.
  // --------------------------------------------------------------------
  function myRound(number, precision = 0) {
    const factor = Math.pow(10, precision);
    const tempNumber = number * factor;
    const roundedTempNumber = Math.round(tempNumber);
    return roundedTempNumber / factor;
  }
  
  // --------------------------------------------------------------------
  // Retrieve a document's rating.
  // Arguments:
  // posn: Document's position (0-based) in the results.
  // defVal: Default rating value to use if the document has not been rated.
  // --------------------------------------------------------------------
  function ratingOrDefault(posn, defVal) {
    if (!_quepidApi.hasDocRating(posn)) {
      return defVal;
    }
    return _quepidApi.docRating(posn);
  }  

  // --------------------------------------------------------------------
  // Build the actual list of ratings.
  // Arguments:
  // p: Max rank position (See 'p' in the Wikipedia article)
  // --------------------------------------------------------------------
  function actualRatings(rankPosition) {
    const ratings = [];
    for (let i = 0; i < rankPosition; i++) {
      const rating = ratingOrDefault(i, 0.0)
      log("Doc " + i + ": " + rating)
      ratings.push(rating);
    }
    log("Actual Ratings: " + ratings);
    return ratings;
  }

  // --------------------------------------------------------------------
  // Build the ideal list of ratings.
  // Arguments:
  // p: Max rank position (See 'p' in the Wikipedia article)
  // --------------------------------------------------------------------
  function idealRatings(rankPosition) {
  	const idealRatings = [];
    const bestDocsCount = _quepidApi.getBestDocs().length;
    for (let i = 0; i < bestDocsCount && i < rankPosition; i++) {
    	idealRatings.push(_quepidApi.getBestDocs()[i].rating);
    }
    
    // Padd with 0 (Not rated) if necessary.
    for (let i = idealRatings.length; i < rankPosition; i++) {
      log("Ideal ratings list: Padding with 0.0 at position (0-based) " + i);
      idealRatings.push(0.0);
    }
    log("Ideal ratings: " + idealRatings);
    return idealRatings;
  }

  // --------------------------------------------------------------------
  // DCG calculator
  // Arguments:
  // docList: List of rated documents.
  // p: Maximum rank position.
  // --------------------------------------------------------------------
  function dcg(ratings, p) {
    let dcgScore = 0
    for (var i = 1; i <= ratings.length && i <= p; i++) {
      const logPosition = log2(i + 1);
      const dcgAdder = ratings[i - 1] / logPosition
      dcgScore += dcgAdder
      log('i=' + i + " ratings[i-1]=" + ratings[i - 1] + "/log(i+1)=" + logPosition + " => DCG incr. of " + myRound(dcgAdder, 3) + " (DCG=" + myRound(dcgScore, 3) + ")")
    }
    log('DCG([' + ratings + ']) = ' + dcgScore)
    return dcgScore
  }

  // Public API: ndcg
  // Arguments:
  // 1. Rank position
  // For example, for an NDCG@10, use p = 10
  //
  // ATTENTION: Due to current limitation in Quepid ("bestDocs" is
  // 10-deep at most), p must be <= max(Number of results shown, 10)
  // For example, for p = 10, set the "Number of Results to Show"
  // in the Settings to 10.  
  // 
  // 2. Rating scale max value: MUST MATCH THE MAX RATING SPECIFIED IN THE 
  // "Scale for query ratings" scorer configuration.
  context.ndcg = function(quepidApi, rankPosition, ratingScaleMax) {

    log(" RUNNING NDCG!!! ");
    log(" *************** ");
    
    _quepidApi = quepidApi;

    log('Docs count    : ' + _quepidApi.getDocs().length)
    log('bestDocs count: ' + _quepidApi.getBestDocs().length)

    // Log the objects for inspection in the developer's console
    log(_quepidApi.getDocs())
    log(_quepidApi.getBestDocs())

    log('Rank position (p)=' + String(rankPosition))
    log('Using scale 1-' + String(ratingScaleMax))

    var myDcg = dcg(actualRatings(rankPosition), rankPosition);
    var iDcg = dcg(idealRatings(rankPosition), rankPosition);
    var nDcg = myDcg / iDcg;

    log(" DCG: " + myDcg);
    log("iDCG: " + iDcg);
    log("NDCG: " + nDcg);

    return nDcg;
  }

})(NDCG_scorer);
Scorer’s Name Space

The NDCG@10 implementation is wrapped in a name space (See Script 5 below) in order to avoid object name collision within the Quepid runtime context (See this Dynamic Namespace technique.)

var NDCG_scorer = {};

(function(context) {
    // NDCG implementation
    // ...
    
})(NDCG_scorer);

Helper Functions

The following helper/convenience functions are helping with various tasks in the NDCG calculation:

FunctionDescription
log(obj)JS console logging in your browser’s developer tools.
ratingOrDefault(posn, defVal)Returns the rating of document/result at position posn, using the Quepid api hasDocRating(posn) and docRating(posn).
log2(num)Calculats the log base 2 of a number (As used in the DCG).
myRound(num)Round a floating point at a given precision (Used for logging only).

Actual Ratings, Ideal Ratings, and DCG and NDCG Functions

let’s now get into the meat of the implementation, starting with the NDCG scorer’s public API ndcg function, reproduced in an abbreviated form (without the logging) in Script 6 below.

context.ndcg = function(quepidApi, rankPosition, ratingScaleMax) {
      
    _quepidApi = quepidApi;

    var myDcg = dcg(actualRatings(rankPosition), rankPosition);
    var iDcg = dcg(idealRatings(rankPosition), rankPosition);
    var nDcg = myDcg / iDcg;

    return nDcg;
  }

The ndcg function takes the following arguments:

  • quepidApi: A Quepid API wrapper, which wraps either a mocked Quepid API (See Script 3 earlier), or the production Quepid API. Such a thin wrapper layer allows us to run the same NDCG scorer implementation in either the JSFiddle sandbox environment for debugging, or in Quepid.
  • rankPosition: The rank position to calculate the NDCG for, i.e., the “p” in the Wikipedia article.
  • ratingScaleMax: The upper bound of the rating scale, i.e., 10 for a 1-10 rating scale.

The ndcg function performs the following steps:

  1. Saves the passed-in Quepid API wrapper in the private scorer name space variable _quepidApi
  2. Calculates the actual ratings’ DCG: dcg(idealRatings(rankPosition), rankPosition) using the scorer private name space function dcg
  3. Calculates the ideal ratings’ DCG: dcg(idealRatings(rankPosition), rankPosition);, also the same dcg function as above;
  4. And finally calculates and returns the NDCG by dividing the actual ratings’ DCG with the ideal ratings’ DCG.

The scorer’s private name space actualRatings function creates the list of the document’s ratings using our helper function ratingOrDefault, in the order they were returned in the seach results. The scorer’s private name space idealRating function creates the best list of ratings, using the Quepid API getBestDocs, that is sorted by descending rating value, and padded with zeros if necessary to reach a list length of rankPosition.

To see all that in action, go ahead and run the code in JSFiddle (Make sure you open your browser’s developer tools’ JS console to see the logging).

Taking our NDCG@10 to Quepid!

After debugging our NDCG@10 scorer in JSFiddle, we’re now ready to take it to Quepid. To do so, we perform the following steps in Quepid (See Figures 4, 5, 6):

  1. Click on Custom Scorers;
  2. Click on Add New;
  3. Type a scorer’s name, e.g., NDCG@10;
  4. Paste the code that is provided in Script 7 below (The code is annotated following the script);
  5. Click on Create; At any time later on, you may edit the scorer by clicking on the pencil icon (See Figure 5);
  6. Open your Quepid case;
  7. Click on Select Scorer, pick the scorer in the list, and click on Select Scorer (See Figure 6);
  8. Re-run the case to update the scores.
/ Wrap the Quepid objects and API in a namespace.
// Simple pass-through required by our NDCG scorer.
let quepidApi = {};

(function(context) {

  context.getDocs = function() {
    return docs;
  }
  
  context.getBestDocs = function() {
    return bestDocs;
  }

  // Document's rating accessor
  context.docRating = function(position) {
    // console.log('Position ' + position + ' rating: ' + docRating(position));
    return docRating(position);
  }

  // Has a ranking position been rated?
  context.hasDocRating = function(position) {
    // console.log('Position ' + position + ' rating? ' + hasDocRating(position));
    return hasDocRating(position);
  }

})(quepidApi);

// Scorer's name space to avoid name collision with the Quepid-provided objects.
var NDCG_scorer = {};

(function(context) {

  // Paste here the NDCG scorer implementation from Script 4.
  // ...

})(NDCG_scorer);

setScore(NDCG_scorer.ndcg(quepidApi, 10, 10) * 100);

Let’s point out the following in Script 7:

  • As required by an NDCG scorer implementation, a wrapper of the Quepid objects and API is created (See the quepidApi name space in Script 7). The wrapper is essentially a pass-through to the Quepid docs and bestDocs objects, as well as the hasDocRating and docRating functions.
  • The score calculation is performed by calling the NDCG scorer’s public API ndcg with the arguments quepidApi (Quepid API wrapper), 10 for the rankPosition, and 10 for the max rating scale value.
  • The calculated score is returned to Quepid by calling the Quepid API setScore(). Voilà!

As a final illustration, Figure 7 shows the execution of our Movies case with the NDCG@10 scorer. Note the logging messages in the (Chrome) JS logging console, that show the calculation of the score for the search “Star Trek”.

Closing Remarks

Well, it’s been a long blog. Thanks for staying with me. Take any part of the code above or in my JSFiddle, and experiment with it in your own NDCG scorer. Or, based on the scorer development methodology I presented in this blog, go ahead and create a Quepid scorer of your own invention. The sky is the limit! And let us know what you think, and don’t hesitate to contact OSC about all search things.