Search-Aware Product Recommendation in Solr

John Berryman — October 5, 2013 | 8 Comments | Filed in: solr

Building upon earlier work with semantic search, OpenSource Connections is excited to unveil exciting new possibilities with Solr-based product recommendation. With this technology, it is now possible to serve user-specific, search-aware product recommendations directly from Solr.

In this post, we will review a simple Search-Aware Recommendation using an online grocery service as an example of e-commerce product recommendation. In this example I have built up a basic keyword search over the product catalog. We’ve also added two fields to Solr: purchasedByTheseUsers and recommendToTheseUsers. Both fields contain lists of userIds. Recall that each document in the index corresponds to a product. Thus the purchasedByTheseUsers field literally lists all of the users who have purchased said product. The next field, recommendToTheseUsers, is the special sauce. This field lists all users who might want to purchase the corresponding product. We have extracted this field using a process called collaborative filtering, which is described in my previous post, Semantic Search With Solr And Python Numpy. With collaborative filtering, we make product recommendation by mathematically identifying similar users (based on products purchased) and then providing recommendations based upon the items that these users have purchased.

Now that the background has been established, let’s look at the results. Here we search for 3 different products using two different, randomly-selected users who we will refer to as Wendy and Dave. For each product: We first perform a raw search to gather a base understanding about how the search performs against user queries. We then search for the intersection of these search results and the products recommended to Wendy. Finally we also search for the intersection of these search results and the products recommended to Dave.

Here are the results:


SOLR_URL: http://localhost:8983/recommendation?q=potatoes


Canned yams, potato bread, and other potato derivative products. These are unusual results, because when one searches for potatoes… they just want potatoes.

Wendy’s Potatoes

SOLR_URL: http://localhost:8983/recommendation?q=potatoes&fq=recommendToTheseUsers:46f7a9a9-c661-4497-81a9-64ed537b65b0


Much better; with the exception of one item, everything pictured here is a potato.

Dave’s Potatoes

SOLR_URL: http://localhost:8983/recommendation?q=potatoes&fq=recommendToTheseUsers:973990af-1482-4b63-a83c-bd2d8fc1ff32


Dave’s apparently a convenience shopper – besides actual potatoes, we have soups, frozen french fries, chips – perfectly reasonable!


SOLR_URL: http://localhost:8983/recommendation?q=steak


Plenty of steak here – though there are a few things that aren’t steak. What’s the deal with that first result? Taco mix?

Wendy’s Steak

SOLR_URL: http://localhost:8983/recommendation?q=steak&fq=recommendToTheseUsers:46f7a9a9-c661-4497-81a9-64ed537b65b0


Uh-on, only one steak recommendation for Wendy. Does this mean that the recommendation system has failed her? Not really. She probably doesn’t grill much. And not having bought related products in the past, the recommender lacks sufficient information to make any more than one recommendation. But, when Wendy finally does search for steak, we can at least be assured that a this steak will be at the top of her search results rather than the Taco seasoning from the raw search.

Dave’s Steak

SOLR_URL: http://localhost:8983/recommendation?q=steak&fq=recommendToTheseUsers:973990af-1482-4b63-a83c-bd2d8fc1ff32


Dave, on the other hand, apparently loves to grill. The richness of these recommendations indicate that his spending habits are probably akin to those Tim the Tool Man Taylor.


SOLR_URL: http://localhost:8983/recommendation?q=banana


Those mushroom looking things are muffins that contain bananas. No wait… they’re actually muffins that don’t contain bananas; rather they’re produced by a company that has “banana” in their name. As a matter of fact, the only bananas on this page are decorations surrounding the actual product. Hm… :-/

Wendy’s Bananas

SOLR_URL: http://localhost:8983/recommendation?q=banana&fq=recommendToTheseUsers:46f7a9a9-c661-4497-81a9-64ed537b65b0


Yumm… bananas.

Dave’s Bananas

SOLR_URL: http://localhost:8983/recommendation?q=banana&fq=recommendToTheseUsers:973990af-1482-4b63-a83c-bd2d8fc1ff32


Here we again see several different options for bananas. The other non-banana results aren’t necessarily bad results; remember that above, Dave appeared to be somewhat of a convenience shopper, and banana chips and banana bread are certainly consistent with that.

Incorporating Solr Search-Aware Product Recommendations into Your Search

For even a moderately large inventory and user base, building the recommendation data is a relatively straightforward process that can be performed on a single machine and completed in minutes. Incorporating these recommendations into search results is as simple as adding a parameter to the Solr search URL. In every case outlined above, the recommended subset is clearly better than the full set of search results. What’s more, these recommendations are actually personalized for every user on your site! Incorporating this functionality into search will allow customers to find the items they are looking for much more quickly than they would be able to otherwise.

Want to Improve Your Conversion Rates?

We are currently seeking alpha testers for Solr Search-Aware Product Recommendation. If you are interested in trying out this technology then please . We will review your current search application, get you set up with Search-Aware Product Recommendation, and help you collect metrics so that you can measure the corresponding increase in conversion rate.

8 comments on “Search-Aware Product Recommendation in Solr

  1. @Bob B: Great observation! And for many clients, the answer to this question will be “yes, it is feasible”. For example, the search put together above takes into account a full inventory and a purchase history across all users from a real online grocer. From scratch, the process required to get Solr running takes roughly 10 minutes on my wimpy laptop and the resulting search is quite snappy! I’m sure though, that at some point the number of clients might become so large that this method becomes infeasible. However we’ve identified 2 alternative architectures that, while a bit more complicated, will likely scale to truly extreme eCommerce offerings.

  2. @John Berryman: Interesting! I have a similar situation, and I never really considered storing the user list in the index. I just kind of dismissed it for obvious reasons, but I am going to give it some more thought.

    I’d be curious to know how you evaluate when this approach no longer scales. I’d be even more curious to know what the alternative architectures are! But maybe that is the super-secret “special sauce”.

  3. I’ve explored enrichment a bit over recent months, including running a Temis (semantic enrichment hopefuls) pilot for a large publisher in the UK. They seem to go the heavy handed route of entity tagging and term disambiguation, before then connecting it all back up again for us in a (annoyingly external) recommendation web service. It seemed pretty rudimentary and for all their efforts the result was bizarrely not any better than using the internal SOLR (tf*idf) related component to human eyes. Furthermore they expect users to come up with industry taxonomies at a huge cost. Looks like you guys have got a lower entry barrier and an improvement by measuring implied vocabulary in the tf-idf. That said it also still feels like more of the same, and would mostly be suited to shorter descriptions on products (just as you’re doing). How would you approach large datasets, with comprehensive inter-disciplinary taxonomies where inter-relating terms are plentiful in each document but their meaning changes? Feels like actual AI has heaps of scope to progress us here.

  4. @mark Beeby: don’t be fooled, tf*idf has never been sufficient BEYOND vocabularies; every research item I’ve come across uses taxonomies and ontologies to exceed ‘commodity’ metrics.

    As to your questions about large datasets….you’re joking? Competitive advantage dude. Let you know in a year or two.

  5. @Bob B I think evaluation of the effectiveness would probably best involve simple A B testing. As for the other two methods, it all revolves around the math that I talk about in this presentation. In the index, you can either represent A) all users who would purchase said item, or B) the general “genres” that a product resides in or C) just the product itself (e.g. the id). And then, with the rank-reduced product-user matrix handy, you can calculate wether someone might buy the product (that is, information A) from either information B or information C. But there’s a trade-off. Method A takes up a lot of space, and B, then C take up less. As a corrolary, A is the least computationally intense at query time, while B and C require you doing some matrix math. Basically, the method I use here, and in my previous, related post is method A. We’ll see how things pan out with method B and C… but I’m quite hopeful they’ll work just fine.

  6. @Mark Beebly If you haven’t already, look at our related post which applies similar methods to semantic search. We’ve tried (and are currently trying) the tagged content approach. And while Billy EM is correct that TF*IDF will, by itself, never exceed the exacting nature of a human-generated ontology… statistical methods sure are a heck of a lot easier on the humans! In the end, it’s a balancing act. Nothing is perfect (including human tagging!), but humans will always be better at understanding language than machines, and machines will always accomplish mind-numbing work more efficiently than humans. But we’re pushing the limits here every day. In regard to your specific question about terms that change meaning from document to document (this is called polysemy as opposed to synonymy), the approach above is not well cut out for that. Effectively what it’s doing is quantifying the rate of co-occurrence between all terms in the index. But this doesn’t reveal when the same term has different meanings.

Comments are closed.

Developed in Charlottesville, VA | ©2013 – OpenSource Connections, LLC