Learning What’s Relevant Without The Search

What’s “relevant” can transcend any particular search query or user. Shopkeepers for eons have highlighted today’s relevant items in the store window. A book store owner knows cheesy romance novels spike in beach season. They might realize a biography of Alexander Hamilton has suddenly become relevant due to a popular musical based on the book. In my experience, the secret weapon to search relevance is sometimes as simple as being savvy about what’s hot for your audience and biasing search in that direction. And it doesn’t stop at search: recommendations, or any place you display your content can leverage this information.

The newspaper is another classic example. Traditionally, an editor chooses the most important headlines for today’s printed edition. The most relevant headlines for the newspaper’s audience go at the top of the page in large letters.

alt text

Newspaper editors know how their audience evaluates their stories. They know the audience’s expectations of the brand (Wall St. Journal, and NY Post have very different headlines, for very different audiences). They know what stories out of their pool of content will appear most relevant to today’s reader.

Being a good shopkeeper or newspaper editor is a carefully honed skill. But even those of us savvy with trends can be surprised by what suddenly becomes hot to today’s audience. Trailing metrics like sales might fail to keep up with the latest trends. Other metrics like raw page views, or sales, can be driven by a number of unfortunate biases aside from trends.

So increasingly, I’m curious about measuring general relevance, especially as it changes quickly and unexpectedly with trends: How do you keep up with trends? How can you validate what you sense is hot and relevant to your audience? For example, if I were picking the hot toys for this Christmas season, how would I know what to choose? I’d be a bit like this fellow:

alt text

Data Science To Save Lame Parents

One answer is the math behind Rating and Ranking. I’ve been reading Who’s #1 by Amy Langville and Carl D. Meyer (thanks Rene Kriegler!). In it, the authors explain how to rank competitors as the result of several scored contests. The most obvious examples are sports teams. How do you arrive at a fair ranking of the sports teams in a league after each contest using stats like the score differential, or other offensive/defensive stats?

For example, if you know the Orioles beat the Yankees 7-5 and the Yankees beat the Red Sox 3-2 but the Red Sox beat the Orioles 5-4, how would you rank the teams?

Who’s #1 explores dozens of techniques for ranking competitors after numerous contests. Some sports like US College Football, involve hundreds of teams that each play a total of 11 games a piece – presenting one kind of challenge. Others, like Baseball, involve a small number of teams engaged in a large number of contests, presenting quite another kind of ranking challenge.

Ok, sports are cool – but how does it save us in our job as an out-of-touch shopkeeper? Well we can use the science of ranking things to learn what items are hot. We can engineer fair contests between two products at a time. We can use the result of these contests, along with the science of ranking, to adjust and reorder the naive, initial attempt at a hotness ranking.

For example, if we decide clicks are the important metric, we can let two items duke it out as our daily deal. Shown side-by-side and receiving equal footing, we can evaluate which of our supposed items receive the most clicks. We can adjust the ranking of our items in the same way that the techniques in “Who’s #1” adjust the ranking of sports teams. Instead of contests between the Orioles and Yankees, we have a click-fight between Tickle-Me-Elmo and Hello Barbie and other products.

We can learn which product is more relevant to today’s shoppers. Then we can bias search results, recommendations, or other systems towards more topical or exciting items. We can also use this data to dynamically control our landing pages with implicit search – highlighting items that are seasonal and relevant without needing to know anything about users.

Ok so how do we accomplish this?

Elo’s System of Predictive Ranking

There are numerous methods that attempt to arrive at an accurate rank. Elo’s System has a lot of appeal to me. It’s simple to understand and doesn’t involve keeping track of large matrices. It learns a rank that can be used to predict contests with other competitors. It can be calibrated to learn faster (biasing towards recent contests) or learn more slowly (biasing towards historical contests).

What is Elo’s System? Elo’s System originates from the chess world, and is used to rank chess players. It has two basic steps given a contest between any two competitors:

  1. Use the difference in rank between the competitors to predict the outcome, in terms of the proportion of points each competitor should receive
  2. Learn a new rank based on how much each competitor outperformed or underperformed the expectation

As an intuitive example, let’s consider a game between the Yankees, which we’ll give an initial Elo rank of 123.5 and Red Sox given 42.1. As you can see, with Elo’s system, ranks are arbitrary floating point values, the higher the rank the better. So here the Yankees clearly should beat the Red Sox. However, let’s suppose the game ends with the Red Sox defeating the Yankees 12-2.

Let’s walk through the math! As stated in step 1 above, Elo’s System examines the rank of each player to determine how such a game should end. To do this, it uses as a building block the logistic function, which can be expressed in Python as:

def logistic10(x):    
  return 1.0 / (1.0 + 10.0**x)     # 1 / (1 + 10^x)

The logistic function has a nice property of being bounded between 0 and 1. As x becomes negative, it tends towards one. As x becomes positive, it tends towards 0. Elo’s System uses this property to determine the proportion of points that the Yankees and Red Sox should score based on their ranks.

How is this done? Elo’s System does this by passing the difference in the ranks between the two teams. Fortunately, logistic10(rankA-rankB) + logistic10(rankB-rankA) always add up to 1. Given that property, you can get a sense for how the proportion of the 14 total points that game should be divied up.

Let’s do that! So the expected proportion of points that should be awarded to the Red Sox is logistic10(rankYankees - rankRedsox), or:


which turns out to be ridiculously small:


Our ranks eventually self correct to very small numbers as Elo’s System plays out over many contests. However, you’d prefer to have more ergonomic rankings. So Elo’s System simplifies things by dividing the rank difference by a parameter. This divisor, known as the logistic parameter, is often set to 400. Restating the code from above:

logistic10((123.5-42.1) / 400.0)

This equates to a prediction that the Red Sox receive 0.385 of the points. Or, in a 14 point game, this would equate to ~5.3 points. With this prediction made, we’ve accomplished step 1 above – predicting the outcome using the relative ranks of the two teams.

Finally, we update the ranks based on the outcome of the contest (step 2 from above). It seems that the Red Sox might be better than we thought! Our expected score for the Red Sox 0.385 of the available points, but they got 0.857 of the points! Elo’s System updates the ranks in proportion to this difference:

rankRedsox = rankRedsox + (0.857 - 0.385)

Looking at that rank – it looks like there’s a miniscule increase. Which doesn’t seem very fair! The Red Sox rank only gets nudged a little, but they tromped the Yankees! So Elo’s System adds a K parameter. I think of K as a learning rate for those familiar with machine learning parlance. It controls how much of the point difference to bleed into the updated rank. K is often set to 40 for a regular-season baseball game:

rankRedsox = rankRedsox + 40 * (0.857 - 0.385) # Now 62.396

Viola! You’ll hopefully notice two beautifully symmetric properties to Elo’s System:

  1. The rank difference predicts the score – it predicts that a contest between a rank 50 vs 47.5 might be a tight game. However a rank 50 vs a rank -100 should mean the first team demolishes the second
  2. The performance of competitors predicts the rank. As teams over or underperform, their rank updates accordingly – thus improving the rank’s predictive power.

The rank carries very useful information about exactly how well a competitor should do. It’s not a bland uninformative ranking of 1,2,3,… Instead, due to this useful property you’ll get a clumpy distribution of competitors. Some top competitors might clump together near the top of the pack, some meh, some abysmal. All in proportion to how well they are at destroying opponents.

Enough with the Sports References – which of my products are cool?

But you didn’t come here to chat about baseball. You came because you wanted to evaluate which of your products, news articles, or whatever are most relevant, hot, and topical.

There’s nothing particular about sports in Elo’s System I’ve introduced. It can be used to have any entity compete – products, articles, whatever.

To show this, I’ll wrap up the Elos System code above into some handy reusable Python code, like one with the interface in this class (you can see the fleshed out code here in github):

class EloModel:
    def __init__(self, teams, K=400.0, logisticParam=400.0):
        pass    def predict(self, teamA, teamB, totalPoints=1.0):
        """ Given the rank difference between the teams, predict 
            the outcome of a contest where totalPoints will be

    def report(self, teamA, scoreA, teamB, scoreB):
        """ Report a contest between team A with scoreA and teamB with scoreB

Here an Elo’s System is initialized with a set of competitors (a Python dictionary mapping name -> initial rank). As contests move forward, we record their outcome using the report method – reporting which teams played by their name and the scores that resulted. This method performs steps 1 and 2 from above, updating the ranks based on how far the actual contest was from a predicted one.

Like I said, Elo’s system applies to a great deal more than sports. We can use it to evaluate products we’ve promoted side-by-side. For example, let’s start with my lame dad assumptions about what products are popular:

products = {
    "tickle-me-elmo": 5.0,
    "fruit-loops": 4.0,
    "legos": 3.0,
    "army-men": 2.0,
    "lame-shirt": 1.0}

On my e-commerce site’s main page, let’s say I’ve managed to engineer a fair contest between each item. Perhaps there’s a simple box on the landing page with these items to choose from. Perhaps two are shown to users as part of the site’s “daily deals” – or in a marketing email. (engineering fair contests, and/or handicapping them, could be a blog post of it’s own!).

Let’s presume you’re recording clicks on each item. And let’s say, you have some interesting outcomes you start to record about how users react when they’re side by side:

# Record the contest, lame-shirt does unexpectedly well
# Counts are numbers of clicks
model = EloModel(teams=products)
model.contest("tickle-me-elmo", 6000.0, "lame-shirt", 10.0)
model.contest("tickle-me-elmo", 6000.0, "legos", 8000.0)
model.contest("army-men", 1000.0, "lame-shirt", 10000.0)

Interesting, my initial rankings suspected “tickle-me-elmo” would result in more clicks. However, the “lame-shirt” is apparently so lame its become cool again! Huh, just like trucker hats come back into fashion – so do silly t-shirts. Must be one of those spirit wolf shirts nobody told me was cool again…

After several other contests, with lame-shirt triumphing unexpectedly I arrive at a final ranking.

model.contest("fruit-loops", 1000.0, "lame-shirt", 10000.0)
model.contest("army-men", 4000.0, "tickle-me-elmo", 5000.0)
model.contest("legos", 1000.0, "lame-shirt", 1000.0)

# Output{
	'tickle-me-elmo': -56.955071725981966,	'army-men': -137.78126294101122,
	'fruit-loops': -121.72026109587489,
	'legos': 170.00556640578898,
	'lame-shirt': 161.45102935707902}

What’s hot right now is lame shirts and NOT tickle me elmo!

This is a rather crucial piece of intelligence I’ve gathered about my products. I’ve tried stocking the “shelves” in a couple different arrangements, and I’ve arrived at some interesting conclusions about what’s hot and surprisingly what’s not! Lame spirit wolf shirts and legos have shown to drive great interest from customers. Nobody seems to love elmo anymore :(. Army men and fruit loops cereal are even worse off!

With enough of these trials, over enough products we can make timely adjustments to what’s hot and relevant to our current customers. We can adjust our understanding of trends in the marketplace very quickly over a large number of products. We can see what products/items are attracting interest or purchases. Perhaps you can even get a sense for how effective promotions are, and stock up inventory beforehand across a hundred of possible “hot” items. The possibilities are endless.

If clicks measure search success to you, then these ranks carry the raw proto-stuff of a relevant search result. And if clicks don’t measure relevance to you, then Elo’s System can be easily instrumented to use whatever method you’d like to score competitors for “relevance” (purchases, click-with-dwell, etc). Whatever you use, this system gives your content a direct, raw feature about basic relevance. A feature that predicts, independent of other factors, how likely a user will find this item interesting.

In some ways this is “learning to rank” lite. In my experience with search relevance biasing search using a good, general hotness measures can give you more ROI, than a heavily instrumented machine learning solution trying to optimize each search. Finally, Elo’s System is particularly attractive given the small amount of data you store – putting less stress on your infrastructure than giant matrices per item. In future blog posts, I hope to discuss using an Elo rank with a search engine like Solr or Elasticsearch.

The part of the article where we say “yeah but it depends”

I think this is an interesting method to explore. But before you go rewire your website to run contests between products, I think there’s some considerations to keep in mind

  • What should the value of K be? Remember from above, K is the learning rate. A higher value of K means your contests have a significant impact on rank. A lower value means a granular impact.
  • Should K change per contest? In sports prediction, K often increases with the stakes. Playoff games have higher impact on the updated rank. Perhaps Black Friday for an e-commerce site should adjust K?
  • You: clicks! Me: clicks? really? Clicks are more controversial indicators of relevance than you might expect. Indeed this paper from Microsoft Research shows only a 45% correlation between clicks and relevance. In my experience, other domain specific factors that happen after the click can be more predictive
  • Sure Elo’s System is attractive, but would other methods, such as contextual multi-armed bandits, be better? What about the other ranking methods in Who’s #1?
  • Does your domain have a sense of “hotness” or is relevance more about other factors? The patent and trademark office doesn’t have hot patents that all the examiners are excited to read! Though maybe they do, but not in the same way that there’s a hot romance novel hitting the shelves.
  • How many contests can you run at once? How do you make them fair or otherwise handicap them?
  • How many items is Elo’s System realistic for? Hundreds? Thousands? Millions? And with many competitors, how much do the ranks lose their predictive ability?

Regardless, this may be an interesting area to explore for you.

Tough search relevance problems you need solved? Be sure to contact us and talk to our specialised e-commerce search team.