Learning to Rank turns search into a machine learning problem. In this article, I want to explore what makes search distinct from other machine learning problems. How does one approach search ranking as a machine learning problem vs say classification and regression? Weâll go through a tour of a couple approaches, that will give you an intuition on how to evaluate a learning to rank method.
Measuring Success! (And failure) of search
The fundamental difference comes down to the goals of search versus classic machine learning problems. Or more specifically, how each quantitatively measures success. The accuracy of a stock price prediction system, for example, depends on how far, on average, our predictions are from real company stock prices. If we predicted Amazon to be at $123.57, but in reality itâs $125, weâd say thatâs pretty close. If on average, our prediction stayed under $1$2 error from the actual stock price, weâd probably conclude our system worked well for most needs.
This âerrorâ in this case is really what weâd call a residual: the difference between the actual value and a predicted one: actual  predicted.
(In reality residual^2 is whatâs actually minimized, but weâll keep it simple here.)
During training time, a regression system uses how we quantify success (or failure) find an optimal solution. We can try different quantitative features of companies, such as number of employees, revenue, cash on hand, or whatever else we think will help to minimize the average error in our stock prices. Naively, it might turn out leastsquares regression âlearnsâ that the linear formula like stockPrice = 0.01 * numEmployees + 0.9 * revenue + 0.001 * cashOnHand
does the best job minimizing the error.
Whether search is good or bad (or somewhere in between) looks quite different. Whereas stock prices are completely about minimizing actual  predicted
, search systems try to approximate an ideal ordering for a search query. In other words, the goal is to minimize how âfarâ a list of items is from an ideal, prioritized ordering that reflects how a human searcher perceives good/bad search.
For example, if a user of an ecommerce catalog searches for âdress shoesâ, we might define a rough sense of the following ideal ordering:
 Bestselling dress shoes
 Lowperforming dress shoes
 Other kinds of shoes
 Ladies dresses (if at all)
With this ideal in mind we can imagine some scenarios and guesstimate how far they are from the ideal. Showing our âdress shoesâ searcher some kind of shoes before dresses is at least something â but not at all close to perfect. Itâd be like predicting Amazonâs price at $150 instead of $125:
Closeish?
 A pair of tennis shoes
 Meh Dress Shoes
 A ladies dress
On the other hand, showing the meh dress shoes one position before the bestselling dress shoes before other shoes, etc might be very close, but not exactly the same. This might be comparable to predicting Amazonâs stock price at $123:
 Meh pair of dress shoes
 Bestselling pair of dress shoes
 Tennis Shoes
As you can see with search, instead of actual  predicted
itâs all about getting as close as possible to the best ordering for each user query. NDCG is one such metric that measures how far an ordered set of search results is from an ideal ranking. Other metrics exist that try to measure the âgoodnessâ of search results with their own pros & cons. These metrics almost always go from 0 (worst possible search results)  1 (perfect search results).
Further, these metrics canât just be a pure diff or edit distance type calculation. As users look more at the top search results, getting these first few right should have priority. For this reason search metrics often include a position bias that means when the first several results deviate from the ideal, search is considered much worse than if a bottom result deviates. NDCG has a position bias built in.
While thereâs a bunch of metrics available (such as ERR, MAP, etc), Iâll just say âNDCGâ in this article as a shorthand for really any relevance metric.
Producing a ranking function with Machine Learning
A classic regression problem constructs a function, f, that makes direct predictions using a set of features. We try to minimize actualStockPrice  f(companyFeatures)
. The goal of learning to rank is to construct a ranking function that doesnât make direct predictions. Instead itâs used for sorting â sorting a set of ideal orderings weâve been provided as training data. The ranking function we learn takes two inputs, the query and the document, and assigns a score that correctly orders documents for a query.
Restating the above a bit more mathy:

Stock Market: For company x, Produce
f(x)
that minimizesy  f(x)

Search: For document d, query q, Produce
f(d,q)
that maximizes NDCG of all documents when sorted byf(d,q)
descending
Letâs walk through a rough example to see what we mean. As data scientists / search engineers we might suspect the following features of queries/documents could be helpful in our ecommerce catalog:
 The TF*IDF score of a keyword in a productâs title â
titleScore(d,q)
 The TF*IDF score of a keyword in a productâs description â
descScore(d,q)
 The number of items sold
numSold(d)
A machine learning process might arrive at a document scoring formula like:
f(d,q) = 10 * titleScore(d,q) + 2 * descScore(d,q) + 5 * numSold(d)
That maximizes NDCG across a set of test queries (that is it comes close as possible to userâs ideal ordering as we can get with this kind of model)
Most of the complication in learning to rank usually comes from reformulating the problem so that other machine learning methods can be applied. These tend to come in three categories: pointwise, pairwise, and listwise problems which weâll cover next. Weâll take a brief tour of a couple of methods, and stop and reflect on the pros and cons of each.
Point wise learning to rank
In search, the âtraining setsâ look sortof like normal machine learning training sets. Pointwise learning to rank uses this fact, so letâs review briefly what a search training set looks like. Search training sets come in the form of graded documents for a query known as judgment lists. Hereâs an example of what a judgment list might look like.
Grade,query,document
4,dress shoes,best_selling_dress_shoes
3,dress shoes,meh_dress_shoes
âŚ
0,dress shoes,ladies_dress
0,dress shoes,toga_item
As you can see, documents exactly relevant, such as best_selling_dress_shoes
are given a grade of 4, while completely irrelevant documents (dresses and random togas) are graded with 0.
Pointwise learning to rank just gives up caring about optimizing perquery ranking directly. Instead, we just try to predict the relevance grade. We use some kind of regression to create a ranking function f(d,q)
for document d, query q. Just like the stock price example, we attempt to minimize the residual. We hope we get a 0 for f(toga_item, âdress shoesâ)
and a 4 for f(best_selling_dress_shoes, âdress shoesâ)
.
During training and evaluation our simple error residual y  f(d,q)
is minimized (y here being the grade for d,q
). In this case suppose our ranking function, f
, gives our grade 0 document a score 0.12 and our grade 4 a 3.65. That appears to be doing very well just looking at residuals. If on average our document scores donât stray on average more than 0.5 away from their grade, then we can hope that perquery NDCG is also being maximized. That is, we hope if we can create a ranking function that just spits out the grades, we should get close to the ideal ordering our users expect.
But appearances can be deceiving. One problem with pointwise learning to rank is that getting the âtop itemsâ ordered correctly is usually more important than obscure items on the tail of the judgment list. Basically all the knowledge & position bias inherent in maximizing a metric like NDCG is being ignored.
In practical terms, a model that regularly swaps exactly relevant and meh items, but can accurately predict lower value relevance grades that end up on page 50 isnât that great. Buyers see marginally relevant items in the first few results and arenât impressed by the quality of the selections. So they move on.
Even more catastrophic, another problem arises when you consider relationships that only happen within a specific queries. Pointwise approaches wash away the query grouping, ignoring these withinquery nuances. For example, one query correlates strongly with the relevance score on a title field while another more with a description field score. Or perhaps a âgoodâ title match for one query is a 5, but a âgoodâ title match for another query is say a 15. These scenarios arenât hypothetical: inconsistent document frequency of different matches can create these scenarios. You can learn more in this blog post on linear learning to rank.
So in general, pointwise approaches donât perform terribly well, and weâll move on to looking at approaches that donât wash away the query grouping and instead attempt to directly optimize each queryâs ordering with a ranking function.
Listwise, pairwise, lions, tigers, bears, oh my
Pointwise learning to rank works to minimize the difference between an ideal and actual relevance grade. Other methods define a different sense of error that comes closer to directly optimizing each queryâs ideal ordering. Letâs look at two examples of a listwise and pairwise learning to rank solution that come closer to incorporating position bias and an ability to capture perquery nuances.
Optimizing the list directly w/ ListNet
Listwise learning feels like the purest form of learning to rank. It defines the error very directly: How far is the current ranking functionâs list is from the ideal? One such method, for example, is to look at the permutation probability of a given ordering.
The underlying idea is to define a function that computes the probability that a given permutation of relevance scores is actually what a searching user is looking for. If we think of âscoresâ as grades from the judgment list, the ordering where the 1st results grade is higher than the 2nd, and so on would receive the highest probability. However, itâs possible that even our relevance grades from our judgment list arenât correct for this user in this place, time, and context. So a single lower scoring item shuffling above a higher scoring item is treated as slightly less likely than perfect relevance sort order â Maybe thatâs actually what the user here/now wants! A completely reordered list with the lowest relevance score first seems extremely unlikely, with a permutation probability approaching zero.
Instead of having to compute an error that must look at every possible list ordering, itâs more computationally feasible to approximate permutation priority by only looking at the probability that the first item in the permutation is âbestâ for a search. This is called the âTop Oneâ probability. It looks at a single relevance score alongside every other relevance score for a query to compute the probability that item would be first. As youâd expect, higher graded/scoring relevance items would receive a higher top one probability and lower graded items less likely to be the right âtop oneâ for this user, context, place, and time.
One List Wise Method ListNet proposes to minimize the crossentropy between the relevance grades from the training set and weights in a neural network. Sounds like a lot of mumbo jumbo, but whatâs really important is to work through the error function thatâs being minimized to develop an intuition for how it works, as shown in psuedopython below:
def error():
error = 0
For each query
For each doc:
error = TopOneP(doc.grade) * log( TopOneP(f(doc, query)))
Here f
here is a ranking function weâre optimizing. TopOneP
is the âtop oneâ probability given the score or grade.
First, letâs look at the first term: TopOneP(doc.grade)
. If you think about what happens here, if the âtop oneâ probability based on the judgment list grade is high, then the second term log(PâŚ
will be weighted more heavily. In other words, the first term can be thought of as a priority of the item based on the grade in the judgment list. Higher graded items better have a more accurate top one probability: exactly the position bias weâre looking for!
Now look at the second, log( TopOneP...
) term. Recall, log(TopOneP=1) is 0, and log(TopOneP<1) is increasingly negative as TopOneP approaches 0. So if TopOneP gets close to 0, and thus the log increasingly negative, then the entire term gets increasingly negative. More negative is bad, because error=(big negative number)
makes error get more positive.
Taken together, more error is added when (1) the item is important in the judgment list (TopOneP(doc.grade)
is high) and when the TopOneP
of our ranking function f
is really low. Clearly this is bad: something that really needs to be towards the top based on the judgments, really should have f
sorting it to the top.
The goal in ListNet is to minimize error by iteratively updating weights in f
. I wonât get into that here; whatâs more important is the intuition above. You can read more about it and how it uses this definition of error to calculate a gradient (how to change the weights of features) to minimize the error in the paper.
Pairwise optimization with RankSVM
Pair wise learning to rank works by minimizing the number of out of order results in the search results. A specific metric, Kendallâs Tau measures the proportion of inorder pairs in a search solution. One form of pairwise learning to rank classifies what makes an item âin orderâ vs âout of orderâ for a query. You might learn, for example, that a bit more title score, but a bit lower total number of sold matters when ordering a particular set of queries.
One popular pairwise learning to rank approach is RankSVM as introduced in this paper by Thorsten Joachims and also demonstrated in Python in Fabian Pedregrosaâs great blog post who kindly lent me his images for this post:).
Imagine a case where we have two queries. Perhaps our âdress shoesâ query, but also one for âtshirtâ. The graph below has the yaxis as, say, the title score feature while the xaxis might be the number of products sold. Weâve plotted each judgment for âdress shoesâ as a triangle, and each judgment for âtshirtâ in circles below. Increasingly âdarkâ shapes are considered more relevant.
Our goal is to discover a function similar to the w vector in the image above that comes closest to correctly ordering the search results.
It turns out that this task is the same as classifying what makes one item better than another item â a binary classification task appropriate for a Support Vector Machine. If dress_shoes
is better than meh_dress_shoes
â we can label that ordering as a âbetter thanâ, a â1â. Similarly, the ordering meh_dress_shoes
before dress_shoes
is a âworst thanâ relation or â1â. We can then create a classifier to predict when one item might be better than another.
This classification task requires a bit more fenegling of the underlying feature values. Think of what âbetter thanâ means. It means something is different between dress_shoes
and meh_dress_shoes
that classifies it as better. So RankSVM performs a classification on this delta: itemAâs features minus itemBâs features.
As an example, just focused on the single feature ânumSoldâ we might see dress_shoes
sells 5000 while meh_dress_shoes
sells a mere 1000. So the resulting âdifferenceâ of 4000 would be one feature used during training. Weâd take this difference as an SVM training sample: (1,4000)
. Here 1 is the âbetter thanâ label weâre predicting and 4000 is the numSold delta weâre using as a feature in the SVM.
Plotting each pairwise difference in the feature space creates two classes, as shown below. An SVM can be used to find the appropriate decision boundary between the two classes:
Of course, we donât need a decision boundary. We need a directional vector which says âmore relevantâ in this direction. So finally, the vector orthogonal to this decision boundary provides linear weights in proportion to the ranking function:
This sounds like it turns into simple linear regression, after all weâre just arriving at feature weights and a linear ranking function. But if you recall from the pointwise discussion, within a single query you sometimes have withinquery dependencies/nuances. Cases where a good title score in one feature is say â5â for âdress shoesâ compared to â15â for âtshirts.â What RankSVM does, by focussing classifying deltas within a single query is to point out a rough sense of âmore title score, corresponds to more relevance within a queryâ.
Graphically, you can see that here, with the same data points above being run using linear regression:
RankSVM vs ListWise Approaches
You can see though that RankSVM appears to still create a direct, linear sense of relevance. We know of course reality is often nonlinear. With an SVM, you can use nonlinear kernels. Though a linear kernel tends to be the most popular. Another drawback with RankSVM is that it looks at pairwise difference with no consideration of position bias. When items get out of order in RankSVM, thereâs no way to prioritize keeping the top items correct while ignoring that the bottom items are inaccurate.
While listwise approaches tend to be more accurate, and incorporate a position bias, training and evaluating them tends to be computationally expensive. And while RankSVM tends to be less accurate, the model is simple to train and use.
Due to its simplicity, RankSVM can build models for specific users or subsets of queries/users rather easily. One can imagine classifying queries into different use cases. Perhaps for ecommerce there are queries we can definitively say are typos. While others we know are queries for broad category searches (like âshoesâ). If we classify queries accordingly, we can build models for each type of use case. We can even combine different SVM models. For example, since the model is a set of linear weights, we could take a model tied to a specific user, Tom, and sum it with the model tied to the kind of query Tom is executing, say âdress shoesâ to show me the dress shoes we feel Tom will likely enjoy.
The takeaway!
The main takeaway is whatever model you choose, you have to understand what the model optimizes for. What error is it trying to minimize?
You see how Pointwise approaches optimize for judgment residuals, and how that can be less than ideal. RankSVM performs a simpler optimization to eliminate outoforder pairs, but this too has problems as thereâs no consideration for position bias. ListNetâs permutation probability and top one probability are interesting by giving leeway to equally valid good answers. I wonder personally if this approach can be used to diversify search results to show many valid hypothesis for this user here and now.
Of course, at the end of the day, the type of model may not matter so much if we donât select the right features to train our model on! Feature selection, creation, generation, and engineering, not model selection, is often the hard part!
Hope you enjoyed the tour
Hopefully this blog post gives you an intuition for different methods for machine learning. And maybe it turned into a tour of different approaches :). As always, Iâm open to feedback in case I got something wrong. Donât hesitate to get in touch and Iâd love to chat!