Learning to Rank turns search into a machine learning problem. In this article, I want to explore what makes search distinct from other machine learning problems. How does one approach search ranking as a machine learning problem vs say classification and regression? We’ll go through a tour of a couple approaches, that will give you an intuition on how to evaluate a learning to rank method.
Measuring Success! (And failure) of search
The fundamental difference comes down to the goals of search versus classic machine learning problems. Or more specifically, how each quantitatively measures success. The accuracy of a stock price prediction system, for example, depends on how far, on average, our predictions are from real company stock prices. If we predicted Amazon to be at $123.57, but in reality it’s $125, we’d say that’s pretty close. If on average, our prediction stayed under $1-$2 error from the actual stock price, we’d probably conclude our system worked well for most needs.
This “error” in this case is really what we’d call a residual: the difference between the actual value and a predicted one:
actual - predicted. (In reality residual^2 is what’s actually minimized, but we’ll keep it simple here.)
During training time, a regression system uses how we quantify success (or failure) find an optimal solution. We can try different quantitative features of companies, such as number of employees, revenue, cash on hand, or whatever else we think will help to minimize the average error in our stock prices. Naively, it might turn out least-squares regression “learns” that the linear formula like
stockPrice = 0.01 * numEmployees + 0.9 * revenue + 0.001 * cashOnHand does the best job minimizing the error.
Whether search is good or bad (or somewhere in between) looks quite different. Whereas stock prices are completely about minimizing
actual - predicted, search systems try to approximate an ideal ordering for a search query. In other words, the goal is to minimize how “far” a list of items is from an ideal, prioritized ordering that reflects how a human searcher perceives good/bad search.
For example, if a user of an e-commerce catalog searches for “dress shoes”, we might define a rough sense of the following ideal ordering:
- Best-selling dress shoes
- Low-performing dress shoes
- Other kinds of shoes
- Ladies dresses (if at all)
With this ideal in mind we can imagine some scenarios and guesstimate how far they are from the ideal. Showing our “dress shoes” searcher some kind of shoes before dresses is at least something — but not at all close to perfect. It’d be like predicting Amazon’s price at $150 instead of $125:
- A pair of tennis shoes
- Meh Dress Shoes
- A ladies dress
On the other hand, showing the meh dress shoes one position before the best-selling dress shoes before other shoes, etc might be very close, but not exactly the same. This might be comparable to predicting Amazon’s stock price at $123:
- Meh pair of dress shoes
- Best-selling pair of dress shoes
- Tennis Shoes
As you can see with search, instead of
actual - predicted it’s all about getting as close as possible to the best ordering for each user query. NDCG is one such metric that measures how far an ordered set of search results is from an ideal ranking. Other metrics exist that try to measure the “goodness” of search results with their own pros & cons. These metrics almost always go from 0 (worst possible search results) - 1 (perfect search results).
Further, these metrics can’t just be a pure diff or edit distance type calculation. As users look more at the top search results, getting these first few right should have priority. For this reason search metrics often include a position bias that means when the first several results deviate from the ideal, search is considered much worse than if a bottom result deviates. NDCG has a position bias built in.
Producing a ranking function with Machine Learning
A classic regression problem constructs a function, f, that makes direct predictions using a set of features. We try to minimize
actualStockPrice - f(companyFeatures). The goal of learning to rank is to construct a ranking function that doesn’t make direct predictions. Instead it’s used for sorting – sorting a set of ideal orderings we’ve been provided as training data. The ranking function we learn takes two inputs, the query and the document, and assigns a score that correctly orders documents for a query.
Restating the above a bit more mathy:
Stock Market: For company x, Produce
y - f(x)
Search: For document d, query q, Produce
f(d,q)that maximizes NDCG of all documents when sorted by
Let’s walk through a rough example to see what we mean. As data scientists / search engineers we might suspect the following features of queries/documents could be helpful in our e-commerce catalog:
- The TF*IDF score of a keyword in a product’s title —
- The TF*IDF score of a keyword in a product’s description —
- The number of items sold
A machine learning process might arrive at a document scoring formula like:
f(d,q) = 10 * titleScore(d,q) + 2 * descScore(d,q) + 5 * numSold(d)
That maximizes NDCG across a set of test queries (that is it comes close as possible to user’s ideal ordering as we can get with this kind of model)
Most of the complication in learning to rank usually comes from reformulating the problem so that other machine learning methods can be applied. These tend to come in three categories: point-wise, pair-wise, and list-wise problems which we’ll cover next. We’ll take a brief tour of a couple of methods, and stop and reflect on the pros and cons of each.
Point wise learning to rank
In search, the “training sets” look sort-of like normal machine learning training sets. Point-wise learning to rank uses this fact, so let’s review briefly what a search training set looks like. Search training sets come in the form of graded documents for a query known as judgment lists. Here’s an example of what a judgment list might look like.
Grade,query,document 4,dress shoes,best_selling_dress_shoes 3,dress shoes,meh_dress_shoes … 0,dress shoes,ladies_dress 0,dress shoes,toga_item
As you can see, documents exactly relevant, such as
best_selling_dress_shoes are given a grade of 4, while completely irrelevant documents (dresses and random togas) are graded with 0.
Point-wise learning to rank just gives up caring about optimizing per-query ranking directly. Instead, we just try to predict the relevance grade. We use some kind of regression to create a ranking function
f(d,q) for document d, query q. Just like the stock price example, we attempt to minimize the residual. We hope we get a 0 for
f(toga_item, “dress shoes”) and a 4 for
f(best_selling_dress_shoes, “dress shoes”).
During training and evaluation our simple error residual
y - f(d,q) is minimized (y here being the grade for
d,q). In this case suppose our ranking function,
f, gives our grade 0 document a score 0.12 and our grade 4 a 3.65. That appears to be doing very well just looking at residuals. If on average our document scores don’t stray on average more than 0.5 away from their grade, then we can hope that per-query NDCG is also being maximized. That is, we hope if we can create a ranking function that just spits out the grades, we should get close to the ideal ordering our users expect.
But appearances can be deceiving. One problem with point-wise learning to rank is that getting the ‘top items’ ordered correctly is usually more important than obscure items on the tail of the judgment list. Basically all the knowledge & position bias inherent in maximizing a metric like NDCG is being ignored.
In practical terms, a model that regularly swaps exactly relevant and meh items, but can accurately predict lower value relevance grades that end up on page 50 isn’t that great. Buyers see marginally relevant items in the first few results and aren’t impressed by the quality of the selections. So they move on.
Even more catastrophic, another problem arises when you consider relationships that only happen within a specific queries. Point-wise approaches wash away the query grouping, ignoring these within-query nuances. For example, one query correlates strongly with the relevance score on a title field while another more with a description field score. Or perhaps a “good” title match for one query is a 5, but a “good” title match for another query is say a 15. These scenarios aren’t hypothetical: inconsistent document frequency of different matches can create these scenarios. You can learn more in this blog post on linear learning to rank.
So in general, point-wise approaches don’t perform terribly well, and we’ll move on to looking at approaches that don’t wash away the query grouping and instead attempt to directly optimize each query’s ordering with a ranking function.
List-wise, pair-wise, lions, tigers, bears, oh my
Point-wise learning to rank works to minimize the difference between an ideal and actual relevance grade. Other methods define a different sense of error that comes closer to directly optimizing each query’s ideal ordering. Let’s look at two examples of a list-wise and pair-wise learning to rank solution that come closer to incorporating position bias and an ability to capture per-query nuances.
Optimizing the list directly w/ ListNet
List-wise learning feels like the purest form of learning to rank. It defines the error very directly: How far is the current ranking function’s list is from the ideal? One such method, for example, is to look at the permutation probability of a given ordering.
The underlying idea is to define a function that computes the probability that a given permutation of relevance scores is actually what a searching user is looking for. If we think of “scores” as grades from the judgment list, the ordering where the 1st results grade is higher than the 2nd, and so on would receive the highest probability. However, it’s possible that even our relevance grades from our judgment list aren’t correct for this user in this place, time, and context. So a single lower scoring item shuffling above a higher scoring item is treated as slightly less likely than perfect relevance sort order — Maybe that’s actually what the user here/now wants! A completely reordered list with the lowest relevance score first seems extremely unlikely, with a permutation probability approaching zero.
Instead of having to compute an error that must look at every possible list ordering, it’s more computationally feasible to approximate permutation priority by only looking at the probability that the first item in the permutation is “best” for a search. This is called the “Top One” probability. It looks at a single relevance score alongside every other relevance score for a query to compute the probability that item would be first. As you’d expect, higher graded/scoring relevance items would receive a higher top one probability and lower graded items less likely to be the right “top one” for this user, context, place, and time.
One List Wise Method ListNet proposes to minimize the cross-entropy between the relevance grades from the training set and weights in a neural network. Sounds like a lot of mumbo jumbo, but what’s really important is to work through the error function that’s being minimized to develop an intuition for how it works, as shown in psuedo-python below:
def error(): error = 0 For each query For each doc: error -= TopOneP(doc.grade) * log( TopOneP(f(doc, query)))
f here is a ranking function we’re optimizing.
TopOneP is the “top one” probability given the score or garde.
First, let’s look at the first term:
TopOneP(doc.grade). If you think about what happens here, if the “top one” probability based on the judgment list grade is high, then the second term
log(P… will be weighted more heavily. In other words, the first term can be thought of as a priority of the item based on the grade in the judgment list. Higher graded items better have a more accurate top one probability: exactly the position bias we’re looking for!
Now look at the second,
log( TopOneP...) term. Recall, log(TopOneP=1) is 0, and log(TopOneP<1) is increasingly negative as TopOneP approaches 0. So if TopOneP gets close to 0, and thus the log increasingly negative, then the entire term gets increasingly negative. More negative is bad, because
error-=(big negative number) makes error get more positive.
Taken together, more error is added when (1) the item is important in the judgment list (
TopOneP(doc.grade) is high) and when the
TopOneP of our ranking function
f is really low. Clearly this is bad: something that really needs to be towards the top based on the judgments, really should have
f sorting it to the top.
The goal in ListNet is to minimize error by iteratively updating weights in
f. I won’t get into that here; what’s more important is the intuition above. You can read more about it and how it uses this definition of error to calculate a gradient (how to change the weights of features) to minimize the error in the paper.
Pair-wise optimization with RankSVM
Pair wise learning to rank works by minimizing the number of out of order results in the search results. A specific metric, Kendall’s Tau measures the proportion of in-order pairs in a search solution. One form of pair-wise learning to rank classifies what makes an item “in order” vs “out of order” for a query. You might learn, for example, that a bit more title score, but a bit lower total number of sold matters when ordering a particular set of queries.
One popular pair-wise learning to rank approach is RankSVM as introduced in this paper by Thorsten Joachims and also demonstrated in Python in Fabian Pedregrosa’s great blog post who kindly lent me his images for this post:).
Imagine a case where we have two queries. Perhaps our “dress shoes” query, but also one for “t-shirt”. The graph below has the y-axis as, say, the title score feature while the x-axis might be the number of products sold. We’ve plotted each judgment for “dress shoes” as a triangle, and each judgment for “t-shirt” in circles below. Increasingly “dark” shapes are considered more relevant.
Our goal is to discover a function similar to the w vector in the image above that comes closest to correctly ordering the search results.
It turns out that this task is the same as classifying what makes one item better than another item — a binary classification task appropriate for a Support Vector Machine. If
dress_shoes is better than
meh_dress_shoes — we can label that ordering as a “better than”, a “1”. Similarly, the ordering
dress_shoes is a “worst than” relation or “-1”. We can then create a classifier to predict when one item might be better than another.
This classification task requires a bit more fenegling of the underlying feature values. Think of what “better than” means. It means something is different between
meh_dress_shoes that classifies it as better. So RankSVM performs a classification on this delta: itemA’s features minus itemB’s features.
As an example, just focused on the single feature “numSold” we might see
dress_shoes sells 5000 while
meh_dress_shoes sells a mere 1000. So the resulting “difference” of 4000 would be one feature used during training. We’d take this difference as an SVM training sample:
(1,4000). Here 1 is the “better than” label we’re predicting and 4000 is the numSold delta we’re using as a feature in the SVM.
Plotting each pairwise difference in the feature space creates two classes, as shown below. An SVM can be used to find the appropriate decision boundary between the two classes:
Of course, we don’t need a decision boundary. We need a directional vector which says “more relevant” in this direction. So finally, the vector orthogonal to this decision boundary provides linear weights in proportion to the ranking function:
This sounds like it turns into simple linear regression, after all we’re just arriving at feature weights and a linear ranking function. But if you recall from the point-wise discussion, within a single query you sometimes have within-query dependencies/nuances. Cases where a good title score in one feature is say “5” for “dress shoes” compared to “15” for “t-shirts.” What RankSVM does, by focussing classifying deltas within a single query is to point out a rough sense of “more title score, corresponds to more relevance within a query”.
Graphically, you can see that here, with the same data points above being run using linear regression:
RankSVM vs List-Wise Approaches
You can see though that RankSVM appears to still create a direct, linear sense of relevance. We know of course reality is often non-linear. With an SVM, you can use non-linear kernels. Though a linear kernel tends to be the most popular. Another drawback with RankSVM is that it looks at pair-wise difference with no consideration of position bias. When items get out of order in RankSVM, there’s no way to prioritize keeping the top items correct while ignoring that the bottom items are inaccurate.
While list-wise approaches tend to be more accurate, and incorporate a position bias, training and evaluating them tends to be computationally expensive. And while RankSVM tends to be less accurate, the model is simple to train and use.
Due to its simplicity, RankSVM can build models for specific users or subsets of queries/users rather easily. One can imagine classifying queries into different use cases. Perhaps for e-commerce there are queries we can definitively say are typos. While others we know are queries for broad category searches (like “shoes”). If we classify queries accordingly, we can build models for each type of use case. We can even combine different SVM models. For example, since the model is a set of linear weights, we could take a model tied to a specific user, Tom, and sum it with the model tied to the kind of query Tom is executing, say “dress shoes” to show me the dress shoes we feel Tom will likely enjoy.
The main takeaway is whatever model you choose, you have to understand what the model optimizes for. What error is it trying to minimize?
You see how Point-wise approaches optimize for judgment residuals, and how that can be less than ideal. RankSVM performs a simpler optimization to eliminate out-of-order pairs, but this too has problems as there’s no consideration for position bias. ListNet’s permutation probability and top one probability are interesting by giving leeway to equally valid good answers. I wonder personally if this approach can be used to diversify search results to show many valid hypothesis for this user here and now.
Of course, at the end of the day, the type of model may not matter so much if we don’t select the right features to train our model on! Feature selection, creation, generation, and engineering, not model selection, is often the hard part!
Hope you enjoyed the tour
Hopefully this blog post gives you an intuition for different methods for machine learning. And maybe it turned into a tour of different approaches :). As always, I’m open to feedback in case I got something wrong. Don’t hesitate to get in touch and I’d love to chat!