Learning to Rank turns search into a machine learning problem. In this article, I want to explore what makes search distinct from other machine learning problems. How does one approach search ranking as a machine learning problem vs say classification and regression? We’ll go through a tour of a couple approaches, that will give you an intuition on how to evaluate a learning to rank method.
Measuring Success! (And failure) of search
The fundamental difference comes down to the goals of search versus classic machine learning problems. Or more specifically, how each quantitatively measures success. The accuracy of a stock price prediction system, for example, depends on how far, on average, our predictions are from real company stock prices. If we predicted Amazon to be at $123.57, but in reality it’s $125, we’d say that’s pretty close. If on average, our prediction stayed under $1$2 error from the actual stock price, we’d probably conclude our system worked well for most needs.
This “error” in this case is really what we’d call a residual: the difference between the actual value and a predicted one: actual  predicted.
(In reality residual^2 is what’s actually minimized, but we’ll keep it simple here.)
During training time, a regression system uses how we quantify success (or failure) find an optimal solution. We can try different quantitative features of companies, such as number of employees, revenue, cash on hand, or whatever else we think will help to minimize the average error in our stock prices. Naively, it might turn out leastsquares regression “learns” that the linear formula like stockPrice = 0.01 * numEmployees + 0.9 * revenue + 0.001 * cashOnHand
does the best job minimizing the error.
Whether search is good or bad (or somewhere in between) looks quite different. Whereas stock prices are completely about minimizing actual  predicted
, search systems try to approximate an ideal ordering for a search query. In other words, the goal is to minimize how “far” a list of items is from an ideal, prioritized ordering that reflects how a human searcher perceives good/bad search.
For example, if a user of an ecommerce catalog searches for “dress shoes”, we might define a rough sense of the following ideal ordering:
 Bestselling dress shoes
 Lowperforming dress shoes
 Other kinds of shoes
 Ladies dresses (if at all)
With this ideal in mind we can imagine some scenarios and guesstimate how far they are from the ideal. Showing our “dress shoes” searcher some kind of shoes before dresses is at least something — but not at all close to perfect. It’d be like predicting Amazon’s price at $150 instead of $125:
Closeish?
 A pair of tennis shoes
 Meh Dress Shoes
 A ladies dress
On the other hand, showing the meh dress shoes one position before the bestselling dress shoes before other shoes, etc might be very close, but not exactly the same. This might be comparable to predicting Amazon’s stock price at $123:
 Meh pair of dress shoes
 Bestselling pair of dress shoes
 Tennis Shoes
As you can see with search, instead of actual  predicted
it’s all about getting as close as possible to the best ordering for each user query. NDCG is one such metric that measures how far an ordered set of search results is from an ideal ranking. Other metrics exist that try to measure the “goodness” of search results with their own pros & cons. These metrics almost always go from 0 (worst possible search results)  1 (perfect search results).
Further, these metrics can’t just be a pure diff or edit distance type calculation. As users look more at the top search results, getting these first few right should have priority. For this reason search metrics often include a position bias that means when the first several results deviate from the ideal, search is considered much worse than if a bottom result deviates. NDCG has a position bias built in.
While there’s a bunch of metrics available (such as ERR, MAP, etc), I’ll just say “NDCG” in this article as a shorthand for really any relevance metric.
Producing a ranking function with Machine Learning
A classic regression problem constructs a function, f, that makes direct predictions using a set of features. We try to minimize actualStockPrice  f(companyFeatures)
. The goal of learning to rank is to construct a ranking function that doesn’t make direct predictions. Instead it’s used for sorting – sorting a set of ideal orderings we’ve been provided as training data. The ranking function we learn takes two inputs, the query and the document, and assigns a score that correctly orders documents for a query.
Restating the above a bit more mathy:

Stock Market: For company x, Produce
f(x)
that minimizesy  f(x)

Search: For document d, query q, Produce
f(d,q)
that maximizes NDCG of all documents when sorted byf(d,q)
descending
Let’s walk through a rough example to see what we mean. As data scientists / search engineers we might suspect the following features of queries/documents could be helpful in our ecommerce catalog:
 The TF*IDF score of a keyword in a product’s title —
titleScore(d,q)
 The TF*IDF score of a keyword in a product’s description —
descScore(d,q)
 The number of items sold
numSold(d)
A machine learning process might arrive at a document scoring formula like:
f(d,q) = 10 * titleScore(d,q) + 2 * descScore(d,q) + 5 * numSold(d)
That maximizes NDCG across a set of test queries (that is it comes close as possible to user’s ideal ordering as we can get with this kind of model)
Most of the complication in learning to rank usually comes from reformulating the problem so that other machine learning methods can be applied. These tend to come in three categories: pointwise, pairwise, and listwise problems which we’ll cover next. We’ll take a brief tour of a couple of methods, and stop and reflect on the pros and cons of each.
Point wise learning to rank
In search, the “training sets” look sortof like normal machine learning training sets. Pointwise learning to rank uses this fact, so let’s review briefly what a search training set looks like. Search training sets come in the form of graded documents for a query known as judgment lists. Here’s an example of what a judgment list might look like.
Grade,query,document
4,dress shoes,best_selling_dress_shoes
3,dress shoes,meh_dress_shoes
…
0,dress shoes,ladies_dress
0,dress shoes,toga_item
As you can see, documents exactly relevant, such as best_selling_dress_shoes
are given a grade of 4, while completely irrelevant documents (dresses and random togas) are graded with 0.
Pointwise learning to rank just gives up caring about optimizing perquery ranking directly. Instead, we just try to predict the relevance grade. We use some kind of regression to create a ranking function f(d,q)
for document d, query q. Just like the stock price example, we attempt to minimize the residual. We hope we get a 0 for f(toga_item, “dress shoes”)
and a 4 for f(best_selling_dress_shoes, “dress shoes”)
.
During training and evaluation our simple error residual y  f(d,q)
is minimized (y here being the grade for d,q
). In this case suppose our ranking function, f
, gives our grade 0 document a score 0.12 and our grade 4 a 3.65. That appears to be doing very well just looking at residuals. If on average our document scores don’t stray on average more than 0.5 away from their grade, then we can hope that perquery NDCG is also being maximized. That is, we hope if we can create a ranking function that just spits out the grades, we should get close to the ideal ordering our users expect.
But appearances can be deceiving. One problem with pointwise learning to rank is that getting the ‘top items’ ordered correctly is usually more important than obscure items on the tail of the judgment list. Basically all the knowledge & position bias inherent in maximizing a metric like NDCG is being ignored.
In practical terms, a model that regularly swaps exactly relevant and meh items, but can accurately predict lower value relevance grades that end up on page 50 isn’t that great. Buyers see marginally relevant items in the first few results and aren’t impressed by the quality of the selections. So they move on.
Even more catastrophic, another problem arises when you consider relationships that only happen within a specific queries. Pointwise approaches wash away the query grouping, ignoring these withinquery nuances. For example, one query correlates strongly with the relevance score on a title field while another more with a description field score. Or perhaps a “good” title match for one query is a 5, but a “good” title match for another query is say a 15. These scenarios aren’t hypothetical: inconsistent document frequency of different matches can create these scenarios. You can learn more in this blog post on linear learning to rank.
So in general, pointwise approaches don’t perform terribly well, and we’ll move on to looking at approaches that don’t wash away the query grouping and instead attempt to directly optimize each query’s ordering with a ranking function.
Listwise, pairwise, lions, tigers, bears, oh my
Pointwise learning to rank works to minimize the difference between an ideal and actual relevance grade. Other methods define a different sense of error that comes closer to directly optimizing each query’s ideal ordering. Let’s look at two examples of a listwise and pairwise learning to rank solution that come closer to incorporating position bias and an ability to capture perquery nuances.
Optimizing the list directly w/ ListNet
Listwise learning feels like the purest form of learning to rank. It defines the error very directly: How far is the current ranking function’s list is from the ideal? One such method, for example, is to look at the permutation probability of a given ordering.
The underlying idea is to define a function that computes the probability that a given permutation of relevance scores is actually what a searching user is looking for. If we think of “scores” as grades from the judgment list, the ordering where the 1st results grade is higher than the 2nd, and so on would receive the highest probability. However, it’s possible that even our relevance grades from our judgment list aren’t correct for this user in this place, time, and context. So a single lower scoring item shuffling above a higher scoring item is treated as slightly less likely than perfect relevance sort order — Maybe that’s actually what the user here/now wants! A completely reordered list with the lowest relevance score first seems extremely unlikely, with a permutation probability approaching zero.
Instead of having to compute an error that must look at every possible list ordering, it’s more computationally feasible to approximate permutation priority by only looking at the probability that the first item in the permutation is “best” for a search. This is called the “Top One” probability. It looks at a single relevance score alongside every other relevance score for a query to compute the probability that item would be first. As you’d expect, higher graded/scoring relevance items would receive a higher top one probability and lower graded items less likely to be the right “top one” for this user, context, place, and time.
One List Wise Method ListNet proposes to minimize the crossentropy between the relevance grades from the training set and weights in a neural network. Sounds like a lot of mumbo jumbo, but what’s really important is to work through the error function that’s being minimized to develop an intuition for how it works, as shown in psuedopython below:
def error():
error = 0
For each query
For each doc:
error = TopOneP(doc.grade) * log( TopOneP(f(doc, query)))
Here f
here is a ranking function we’re optimizing. TopOneP
is the “top one” probability given the score or garde.
First, let’s look at the first term: TopOneP(doc.grade)
. If you think about what happens here, if the “top one” probability based on the judgment list grade is high, then the second term log(P…
will be weighted more heavily. In other words, the first term can be thought of as a priority of the item based on the grade in the judgment list. Higher graded items better have a more accurate top one probability: exactly the position bias we’re looking for!
Now look at the second, log( TopOneP...
) term. Recall, log(TopOneP=1) is 0, and log(TopOneP<1) is increasingly negative as TopOneP approaches 0. So if TopOneP gets close to 0, and thus the log increasingly negative, then the entire term gets increasingly negative. More negative is bad, because error=(big negative number)
makes error get more positive.
Taken together, more error is added when (1) the item is important in the judgment list (TopOneP(doc.grade)
is high) and when the TopOneP
of our ranking function f
is really low. Clearly this is bad: something that really needs to be towards the top based on the judgments, really should have f
sorting it to the top.
The goal in ListNet is to minimize error by iteratively updating weights in f
. I won’t get into that here; what’s more important is the intuition above. You can read more about it and how it uses this definition of error to calculate a gradient (how to change the weights of features) to minimize the error in the paper.
Pairwise optimization with RankSVM
Pair wise learning to rank works by minimizing the number of out of order results in the search results. A specific metric, Kendall’s Tau measures the proportion of inorder pairs in a search solution. One form of pairwise learning to rank classifies what makes an item “in order” vs “out of order” for a query. You might learn, for example, that a bit more title score, but a bit lower total number of sold matters when ordering a particular set of queries.
One popular pairwise learning to rank approach is RankSVM as introduced in this paper by Thorsten Joachims and also demonstrated in Python in Fabian Pedregrosa’s great blog post who kindly lent me his images for this post:).
Imagine a case where we have two queries. Perhaps our “dress shoes” query, but also one for “tshirt”. The graph below has the yaxis as, say, the title score feature while the xaxis might be the number of products sold. We’ve plotted each judgment for “dress shoes” as a triangle, and each judgment for “tshirt” in circles below. Increasingly “dark” shapes are considered more relevant.
Our goal is to discover a function similar to the w vector in the image above that comes closest to correctly ordering the search results.
It turns out that this task is the same as classifying what makes one item better than another item — a binary classification task appropriate for a Support Vector Machine. If dress_shoes
is better than meh_dress_shoes
— we can label that ordering as a “better than”, a “1”. Similarly, the ordering meh_dress_shoes
before dress_shoes
is a “worst than” relation or “1”. We can then create a classifier to predict when one item might be better than another.
This classification task requires a bit more fenegling of the underlying feature values. Think of what “better than” means. It means something is different between dress_shoes
and meh_dress_shoes
that classifies it as better. So RankSVM performs a classification on this delta: itemA’s features minus itemB’s features.
As an example, just focused on the single feature “numSold” we might see dress_shoes
sells 5000 while meh_dress_shoes
sells a mere 1000. So the resulting “difference” of 4000 would be one feature used during training. We’d take this difference as an SVM training sample: (1,4000)
. Here 1 is the “better than” label we’re predicting and 4000 is the numSold delta we’re using as a feature in the SVM.
Plotting each pairwise difference in the feature space creates two classes, as shown below. An SVM can be used to find the appropriate decision boundary between the two classes:
Of course, we don’t need a decision boundary. We need a directional vector which says “more relevant” in this direction. So finally, the vector orthogonal to this decision boundary provides linear weights in proportion to the ranking function:
This sounds like it turns into simple linear regression, after all we’re just arriving at feature weights and a linear ranking function. But if you recall from the pointwise discussion, within a single query you sometimes have withinquery dependencies/nuances. Cases where a good title score in one feature is say “5” for “dress shoes” compared to “15” for “tshirts.” What RankSVM does, by focussing classifying deltas within a single query is to point out a rough sense of “more title score, corresponds to more relevance within a query”.
Graphically, you can see that here, with the same data points above being run using linear regression:
RankSVM vs ListWise Approaches
You can see though that RankSVM appears to still create a direct, linear sense of relevance. We know of course reality is often nonlinear. With an SVM, you can use nonlinear kernels. Though a linear kernel tends to be the most popular. Another drawback with RankSVM is that it looks at pairwise difference with no consideration of position bias. When items get out of order in RankSVM, there’s no way to prioritize keeping the top items correct while ignoring that the bottom items are inaccurate.
While listwise approaches tend to be more accurate, and incorporate a position bias, training and evaluating them tends to be computationally expensive. And while RankSVM tends to be less accurate, the model is simple to train and use.
Due to its simplicity, RankSVM can build models for specific users or subsets of queries/users rather easily. One can imagine classifying queries into different use cases. Perhaps for ecommerce there are queries we can definitively say are typos. While others we know are queries for broad category searches (like “shoes”). If we classify queries accordingly, we can build models for each type of use case. We can even combine different SVM models. For example, since the model is a set of linear weights, we could take a model tied to a specific user, Tom, and sum it with the model tied to the kind of query Tom is executing, say “dress shoes” to show me the dress shoes we feel Tom will likely enjoy.
The takeaway!
The main takeaway is whatever model you choose, you have to understand what the model optimizes for. What error is it trying to minimize?
You see how Pointwise approaches optimize for judgment residuals, and how that can be less than ideal. RankSVM performs a simpler optimization to eliminate outoforder pairs, but this too has problems as there’s no consideration for position bias. ListNet’s permutation probability and top one probability are interesting by giving leeway to equally valid good answers. I wonder personally if this approach can be used to diversify search results to show many valid hypothesis for this user here and now.
Of course, at the end of the day, the type of model may not matter so much if we don’t select the right features to train our model on! Feature selection, creation, generation, and engineering, not model selection, is often the hard part!
Hope you enjoyed the tour
Hopefully this blog post gives you an intuition for different methods for machine learning. And maybe it turned into a tour of different approaches :). As always, I’m open to feedback in case I got something wrong. Don’t hesitate to get in touch and I’d love to chat!