High-Quality Recommendation Systems with Elasticsearch

September 9, 2016 Doug Turnbull
Category: Relevancy

Did you know technical books can have spoilers? Well ours does 🙂 and so spoiler alert: we close our book, Relevant Search, making an argument that search engine technology is a great platform for delivering recommendations. We argue recommendations and search are two sides of the same coin. Both rank content for a user based on “relevance” the only difference is whether a keyword query is provided.

We argue that search engines provide the ideal framework for implementing easy-to-tune recommendations. In fact, I’ll be talking about this idea at Lucene Revolution in not too long (see also my coauthor’s talk about this subject at Elastic{On})

In Relevant Search, we implement recommendations using Elasticsearch aggregations. We just introduce these ideas. We tease :). In this blog article, I want to get beyond teasing, and begin exploring how to deliver good recommendations with Elasticsearch.

(Note: While it’s not required reading, definitely check out my previous article on basket analysis for more information.)

Aggregrations: Find what people who like this movie also like

Ok, let’s dig in! You’re running a movie website. You’ve got a set of users and you want to know what to recommend to them. Well, one idea you might have is to index each user as a document, like so movies_liked here is setup as a keyword analyzed field, which I’m omitting):

PUT recs/user/1{
 "movies_liked": ["Forrest Gump", "Terminator", "Rambo", "Rocky", "Good Will Hunting"]}
PUT recs/user/2{
 "movies_liked": ["Forrest Gump", "Terminator", "Rocky IV", "Rocky", "Rocky II", "Predator"]}
PUT recs/user/3{
 "movies_liked": ["Forrest Gump", "The Thin Red Line", "Good Will Hunting", "Rocky II", "Predator", "Batman"]}
PUT recs/user/4{
 "movies_liked": ["Forrest Gump", "Something about Mary", "Sixteen Candles"]}

We want to perform recommendations for users that like Terminator. In other words, we want to know what users who like Terminator also like. This is very basic “basket analysis” – it’s the fundamental building block of recommendations. The name comes from the idea of looking at user’s shopping baskets and finding statistically interesting relationships. Famously, grocery chains learned that people who bought diapers also frequently bought beer. Insights like this can help extremely valuable to users and businesses. Here we’ll focus in on Terminator to find what’s statistically interesting to Terminator viewers.

With documents that store user history, how do we figure out what viewers of Terminator would like? Well the first idea that might come to you is to run a terms aggregation. A terms aggregation is a traditional facet. It gives a raw count of the terms for a field in the current search results. In a typical application terms would probably be movie ids, but we’ll use titles here. We’ll search for “Terminator” and aggregate on the “movies_liked” field to get a raw count of the movies that occur frequently alongside Terminator. Basically like so:

POST recs/user/_search{
    "query": {
        "match": {
            "movies_liked": "Terminator"
        }
    },
    "aggregations": {
        "movies_like_terminator": {
            "terms": {
                "field": "movies_liked",
                "min_doc_count": 1
            }
        }
    }}

The end result, just focussing on the aggregations is the following breakdown

"movies_like_terminator": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
        {
            "key": "Forrest Gump",
            "doc_count": 2
        },
        {
            "key": "Rocky",
            "doc_count": 2
        },
        {
            "key": "Terminator",
            "doc_count": 2
        },
        {
            "key": "Good Will Hunting",
            "doc_count": 1
        }}

If you scan the user records above, you’ll see users that like Terminator also coincides with 2 users that also like Forrest Gump, 2 users that like Rocky, and so on. As an aside, if we repeat this for all of your favorite movies – searching also for “Terminator” OR “Predator” OR “Say Anything” and all your favorite movies, then you’ll get collaborative filtering: issue a search to find the users that like the same movies you do. Then you’ll examine the raw counts to see movies you haven’t watched that co-occur frequently wtih movies you like.

But is this raw count approach a good approach? We discussed both in Relevant Search and in the basket analysis blog article that raw co-occurrence counts are really poor ways of doing recommendations. Raw counts prioritize what’s globally popular over what’s interesting here. In this example, for instance, everyone likes Forrest Gump. So if you used this as a method for recommendations, every user will be recommended Forrest Gump. We refer to this problem as the “Oprah Book Club” problem – a bias towards cross cutting popularity that isn’t particularly useful.

What’s more interesting here is the relative count of movies in the Terminator results. For example, look at Rocky. Every user that likes Rocky also likes Terminator – they have exactly 100% overlap! Another way of saying this is Rocky occurs 100% here in the foreground, and only 50% globally: in the background. This seems like a great recommendation for users that like Terminator!

Measuring meaningful user-item connections with Significant Terms

To transcend the problems with raw counts, Significant Terms Aggregration comes in handy. This aggregration measures the kind of statistically significant relationships we need to deliver meaningful recommendations. It’s job is to calculate not raw counts, but specifically the statistical significance of terms in the current results when compared to the background corups. In my basket analysis article we discussed the pros/cons of different ways of scoring significance, let’s explore what significant terms does.

But let’s take it out for a spin before thinking about it too critically:

POST recs/user/_search{
    "query": {
        "match": {
            "movies_liked": "Terminator"
        }
    }
,
    "aggregations": {
        "movies_like_terminator": {
            "significant_terms": {
                "field": "movies_liked",
                "min_doc_count": 1
            }
        }
    }}

The only thing that’s different above is we use a significant_terms calculation. We set the min_doc_count to 1 to work well in our really puny data set.

And indeed, this solves the immediate problem, we see that our recommendations appear more appropriate, with Forrest Gump nowhere to be seen:

"buckets": [
            {
               "key": "Rocky",
               "doc_count": 2,
               "score": 1,
               "bg_count": 2
            },
            {
               "key": "Terminator",
               "doc_count": 2,
               "score": 1,
               "bg_count": 2
            },
            {
               "key": "Rambo",
               "doc_count": 1,
               "score": 0.5,
               "bg_count": 1
            },
            {
               "key": "Rocky IV",
               "doc_count": 1,
               "score": 0.5,
               "bg_count": 1
            }
         ]

Thinking Critically about JLH Significance Scoring

That’s it, right? No need to dive further? Not quiet: you’ll need to understand how this scoring works in the context of your data. To build really good recommendation systems you need to be able to think more deeply about the scoring used. How did significant terms arrive at this ranking? As we saw in the basket analysis article, different forms of significance scoring have their own pros and cons. Choosing the wrong method can have disastrous consequences for the quality of recommendations.

I’m not going to take apart every form of significant terms scoring in this article (there’s many options), but let’s dive into the default method to teach ourselves how to think through these problems. The default is called JLH. It multiplies two values

(foregroundPercentage / backgroundPercentage)  *
    (foregroundPercentage - backgroundPercentage)

“Foreground” here means the percentage this item occurs in the current search results (the users retrieved that like our item). For example, above Rambo’s foreground percentage in Terminator movies was 100%. The background percentage is the global percentage in the whole collection. For Rambo above it’s 50%.

JLH Scoring on Extremely Common Items

How would you go about evaluating the appropriateness of this scoring to your recommendations use case? Let’s think of a couple of hypothetical scenarios. These let us take apart the scoring mechanics around some imaginary scenarios. Then you can think critically about how realistically they might play out in your real-world data. While we continue to use movies, I’m only playing with hypothetical scenarios here: don’t mistake this for an analysis of JLH scoring’s appropriatteness against a real data set like Netflix or Movielens.

The first scenario, which we’ve already discussed is the movie everyone likes. In our data set, this is Forrest Gump. 99.999% of users like Forrest Gump.

Indeed such an extremely popular movie doesn’t fare particularly well with JLH scoring. With the (foreground/Background) term, the best Forrest Gump could achieve would be 100 / 99.999, barely > 1. Similarly (foreground – background) or (100 – 99.999) would get just below one. As you’ll see below, this is a pretty low number for JLH.

But is this a fair scenario? I would argue it’s rare in most domains that an item would be liked by close to 100% of the users. It’s more likely that a much smaller number has a measurable preference for a film. For example, sure perhaps everyone likes Forrest Gump, but most methods of detecting a “like”, such as an explicit rating or an implicit watch/click, will not show 100% of the users interacting with Forrest Gump. Sure, I like Forrest Gump, but last time I watched it was probably 10 years ago. It’s a bit of a passive preference. I’ve seen it, I’m unlikely to be excited when I see it show up on Netflix.

A more likely scenario is the most popular movie or show is something that, say, 20% of the users like. How does JLH scoring fare here? If this movie, let’s say Good Will Hunting, is liked 100% by this set of users, then the highest score would be (100 / 20) * (100 – 20) or 5 * 80 or 400. Much higher than the perhaps unrealistic Forrest Gump scenario.

Items of Average Popularity

Most movies show single digit percentage of users that like the movie. Let’s move on to an average movie: Rocky IV, liked by 4% of users.

How well does Rocky IV fare with JLH? Well repeating the JLH movie, if 100% of the foreground users like Rocky, we get (100 / 4) * (100 – 4) or 25 * 96 = 2400. Wow that’s significantly more than the likelihood of being recommended Good Will Hunting!

More “normal” deviations

With a 400 best possible score for popular Good Will Hunting and a 2400 for average popularity Rocky IV, it seems like JLH is just terrible for recommendations, right? Not quite. Well let’s think about the likelihood of these “best case” scenarios. How reasonable is it for 100% of users that like one item to like another item? How likely is it that the foreground probability of liking an item deviates so dramatically from the background preferences?

Well that depends heavily on the relative distribution of item preferences in your data. One could argue, for example, that if you like Rocky III there’s a very high probability you’ll also like Rocky IV. Perhaps close to 100% overlap.

However, if your data tends to not change as much between the background and foreground data sets, perhaps this isn’t as big of a deal. For example, let’s consider cases where the expected value of the foreground is based on a normal distribution, centered around the background percentage. Let’s consider a hypthothetical constant change between background and foreground is a change of about 75% (Rocky IV could be 7+/-3%; Good Will Hunting could be 20 +/- 15%). In these perhaps more realistic scenarios, we would expect scores:

Rocky IV best score: (7/4) * (7-4) = 5.25
Good Will Hunting best score: (35 / 20) * (35 – 20) = 26.25

In other words, an equal proportionate shift rewards the popular movie much more than the one of middling popularity. The first term (the division) will be equivalent between the two movies. The other term (the subtraction) however has the potential for providing a much higher multiple for the popular movie given the shift.

Why might this be? As the creator of JLH, Mark Harwood, points out this is precisely what JLH is designed to do. JLH reflects patterns seen in many recommendation data sets: popular items don’t change dramatically in popularity. Even when scoped to other seemingly related items they deviate far less from the background popularity. We mentioned that everyone purchases eggs in the basket analysis article. If almost everyone buys eggs, scoping egg purchases to other omellete ingredients probably shows only a small increase in total proportion of egg purchases. But we still want reasonable recommendations for these omelette chefs: so even mild increases in egg popularity should seen as significant.

Given this, perhaps it’s more typical that Good Will Hunting only deviates +/- 25% from the background. It just so happens, this makes Good Will Hunting’s highest score to be the same as less popular Rocky IV:

(25 / 20) * (25 - 20) = 6.25

In other words, if we examine, say, Sense and Sensibility, Good Will Hunting increasing 25% in popularity scores as highly as our hypothetical Rocky IV increasing 75%.

Final thoughts on JLH Significance Scoring for recommendations

Given what we’ve just seen, JLH assumes less popular items are more likely to increase their foreground probability than background percentage.

Typical foreground percentages tend to stay fairly close to background percentage, as in you won’t suddenly see a 100% overlap between two items
The foreground percentage for popular items deviates much less than the foreground percentage of less popular items.

The most important takeaway is this criteria is dictated by the statistical patterns in YOUR data. You need to evaluate typical patterns in your data. When you focus on typical preferences what’s the relationship between foreground and background percentages? This is the hard, domain specific work you need to pursue to build a great recommendation system.

But there’s one nefarious case we haven’t yet considered. What about extremely rare movies that happen to overlap? We saw in the basket analysis article that this was the achilles heel for the Jaccard similarity. So let’s consider one such scenario, involving a two movies that just so happened to be liked very rarely and by exactly one user:

Two films Dumb Movie and Weird Art Film each occur twice. And it just so happens that Dumb Movie and Weird Art Film were both liked by one user that happens to like obscure and/or bad movies:

PUT recs/user/2124{
    "movies_liked": ["Weird Art Film", "Forrest Gump"]}
PUT recs/user/2125{
    "movies_liked": ["Weird Art Film", "Dumb Movie"]}
PUT recs/user/2126{
    "movies_liked": ["Dumb Movie", "Rambo"]}

The background percentage of these movies is very low. Perhaps 0.001%. When recommending movies to people that like Weird Art Film, the foreground percentage of “Dumb Movie” becomes 50%. Suddenly very common! How does this score?

(50 / 0.001) * (50 * 0.001) = 2499950.0

Eeegads! That’s a high score.

Suffice it to say, just as discussed in Ted Dunning’s paper on Surprise and Coincidence this result is unreasonably high. We only have a very small handful of users that like these movies. Probably not enough to make statistically robust conclusions about significance. You can see this if you consider how far the foreground and background percentages are from each other. If the data shows +/- 25%, we’d eexpect this rare movie’s foreground percentage to be some value under 0.001 to just above 0.001. Definitely not 50. This value probably defies our beliefs about the typical patterns we see in our data.

So in addition to the recommendations above, there’s another important criteria to evaluate when applying any form of significance scoring:

Find a reasonable lower bound of number of likes that begins to reflect typical statistical significance between foreground and background relationships. If its more typical for a fg percentage to be +/- 3% centered around the background, but one rare movie shows wild deviations, consider it an outlier and set a reasonable minimum.

(It’s worth noting a high lower bound in the minimum document frequency is a good idea anyway for performance, lest you run all these expensive calculations on every spurious term that occurs).

The big takeaway, is it’s your data that matters with any method of measuring significance. What’s a typical relationship? And what on its face appears to wildly defy expectations?

Towards a Bayesian Approach

I want to keep exploring ways of scoring co-occurences in search engines in future articles. But I want to leave you by exploring one area I find fascinating. In all this talk about foreground and background percentages, I can’t help but think about Bayes formula:

P(A | B) = P(B | A) * P(A) / P(B)

Another way of thinking about what we’re trying to do is we’re trying to understand the probability of liking one item given the probability of liking another item. That’s exactly what P(A | B) expresses. Scoped to all the users that like B (perhaps “Terminator”) what is the probability of liking some A (perhaps “Rambo”).

P(Rambo | Terminator) =
     P(Terminator | Rambo) * P(Rambo) / P(Terminator)

Well P(Rambo | Terminator) corresponds to the foreground percentage. While P(Rambo) corresponds to the background percentage. In other words:

foregroundRambo =
     ( P(Terminator | Rambo) / P(Terminator) ) * backgroundRambo

I’ve alluded to performing your own data analysis work. Don’t just try to find this relationship between foreground and background probabilities. But try to figure out a typical relationship between foreground and background percentages. You can then use the typical general relationship to evaluate scoring methods, and find a good lower bounds where you feel confident in the statistical patterns in the scoring. If we think about Bayes formula a bit more generally, we see there’s a common transformation happening. Getting away from one example, we can think of functions that return probabilities (ie probability distributions) to get at these relationships:

ForegroundDistribution(item A, item B) = 
    transformation(item A, item B) * background( item B)

I haven’t finished mapping this out in my head yet, but this idea is at the root of Bayes Formula. The “background” distribution seems to correspond to a Bayesian Prior. It says, with no evidence, what sort of typical background probability distribution can we expect. In the basket analysis article, we argued for a power-law or zipfian distribution for buying patterns. The “transformation” in classic Bayesian analysis corresponds to the likelihood function. Here, it’s the relatively likelihood of A, given some other B. You can imagine this as outputting a constant that’s going to shrink or grow the background probability based on the relative co-occurence of A and B. Perhaps it spits out a 1.5 for some A and B pair – growing the foreground percentage. Or spits out a 0.5, shrinking it.

The “transformation” function you can imagine in three dimensions. The x and y axis corresponding to items a and b, with the z axis corresponding to how much you’re going to multiply background to get foreground. You can sample P(B | A) (Terminator scoped to Rambo) by looking up Rambo users and counting the raw count of Terminator users. You can similarly lookup P(Rambo) by snagging the total count of Rambo users. From these values you can compute a point-wise value for transformation.

You probably won’t be able to derive this transformation into a formula. But you can use your data to draw a graph. At any given item pairing, for example Terminator and Rambo, you can compute this value. You need not compute this for every pairing. Remember your goal is to figure out what’s typical to guide your evaluation, not find every possible value.

If you sample this transformation, and values tend to stay between 0.9 and 1.1, you know that your foreground and background percentages don’t deviate a great deal. If popular things grow/shrink less than unpopular things, then JLH might work well for you. If there’s wild swings that go against this pattern, you might have a sense that something like JLH might not be appropriate for your data.

I’m actually really excited to pursue these ideas further (and if you play with them, let me know what happens). For example, this isn’t particularly classical Bayesian analysis, where we’re trying to find our confidence in some model parameters (such as the bias of a coin, or weight of die, or distribution of topics in a corpus). Instead it’s more like a regression where we’re trying to determine the relationship between two values (items A and B) and a transformation. Perhaps something like a loess regression that’s not parameterized (ie doesn’t assume the underlying data fits exactly one formula). Perhaps what I really ought to be doing is trying to do Bayesian analysis on the transformation process to find latent parameters in the relationship between arbitrary A and B items. I’d love to hear your ideas! Get in touch if you’re curious about exploring these ideas more deeply or if you want to point me at wisdom that already explores these ideas.

Practical Elasticsearch and Data Modeling Considerations

With an agregrations approach, we’re left with a couple of practical considerations for building great recommendations. It’s worth noting them here as areas for further investigation

We can’t use a significance score from the aggregations alongside typical search scoring, so we can’t directly bias by factors like release date with just one query. We can’t simply do “recommendations” alongside “search” to do personalized search. Unfortunately aggregations and search live in two different universes.

One way to get around this is by also using the sampler aggregation. The sampler aggregation performs aggregations over the top N most relevant documents. So we could, for example, prioritize recent movies in our search results, then treat the top 100 search results as the “foreground” set. This provides an indirect relevance influence over the significance calculation.

Also, perhaps there’s a way to integrate significance scoring with the regular query process. I’ve been playing with using More Like This or locality-sensitive hashing via the new Simhash capabilities which can occur in the normal query process. But I have significant reservations about these methods, which I won’t get into here (I apologize, this article is probably the root for 10 other articles :-).)

Graph-based Recommendations with Elastic Graph

Elastic has released an interesting graph product that does the sampling/significance aggregation alongside an ability to navigate to secondary and tertiary significance. For example, in addition to the exercise above, what if we could generate recommendations, notice which genres or taste profiles are significant to me, then find what’s significant across those taste profiles? This is a kind of graph navigation that can provide a lot of value to users and dramatically simplify the creation of recommendation systems.

We’re actually building some interesting demos using Elastic’s graph product for recommendations that go beyond what’s in this article. If you’re interested in seeing a proof of concept of what we’re doing with Elastic’s graph product, feel free to get in touch. I’d be happy to run you through the ideas and proof-of-concept, even as we’re actively developing it. We’d love to hear your feedback!

Search Engines are the future of recommendations

Open source search engines like Solr and Elasticsearch made search extremely simple to implement. Recommendation systems still require integrating multiple distributed systems, learning R, and hiring a huge team of data scientists. It sounds extremely hard. You can standup reasonable search in a few days, but recommendation systems require a great deal more effort.

We’re pushing to change that. In the same way that Solr and Elasticsearch enable pretty-good search out of the box, we want to help build recommendation systems with far total lower cost of ownership than the current regime. Search engines are the ideal place to implement a recommendation system explicitly because they’re easy-to-maintain, simpler to tune, and straightforward to deploy than juggling dozens of machine learning libraries and distributed systems.

If you’re interested in exploring how Solr or Elasticsearch can simplify your development of a recommendation system, get in touch and we’d love to explore this with you.

Special thanks to Mark Harwood at Elastic for his help reviewing this article.