Blog

Semantic Search with Solr and Python Numpy

Built upon Lucene, Solr provides fast, highly scalable, and easily maintainable full-text search capabilities. However, under the hood, Solr is really just a sophisticated token-matching engine. Whats missing? – Semantic Search!

Consider three, somewhat silly documents:

1. Yellow banana peels.
2. A banana is a long yellow fruit.
3. This mystery fruit is long and yellow and has a peel.

Now what happens if you search for the term “banana”. Under normal circumstances you only get back the first and second document. But why shouldnt you also get back the third document? Its obviously talking about bananas!

Semantic Search via Collaborative Filtering

Colleague Doug Turnbull and I recently set about to right this wrong with help from a machine learning technique called collaborative filtering. Collaborative filtering is most often used as a basis for recommendation algorithms. For example, collaborative filtering algorithms were the central focus of the now-famous Netflix Prize which awarded \$1Million to the team which could build the best movie recommendation engine. When dealing with recommendations, collaborative filtering works by mathematically identifying commonalities in groups of users based upon the movies that they enjoyed. Then, if you appear to fall in one of those groups, the recommendation engine will point you towards a movie that a) you havent watched and b) you are likely to enjoy.

So what does this have to do with Semantic Search? Everything! In just the same way that certain users gravitate towards certain movies, certain words commonly co-occur in the same documents. When working with Semantic Search, rather than recommending user to movies that they would likely enjoy, we are going to identify words that are likely to belong in a given document, whether or not they actually occurred there. The math is exactly the same!

Heres how the process works:

• First we identify a text field of interest in our documents and extract the associated term-document matrix for external processing. Each element of this term-document matrix indicates the strength of a particular term within a particular document (where strength can be anything, but will likely be either term frequency or TF*IDF).
• Next, collaborative filtering is applied to the term-document matrix which effectively generates a pseudo-term-document matrix. This pseudo-term-document matrix is the same size and shape as the original term-document matrix and references the same terms and documents, but the numbers are slightly different. These new values indicate the strength that a particular term should have in a particular document once noisy data is removed.
• Finally, the high-scoring values in the pseudo-term-document matrix are mapped back to the associated terms. These terms are then injected back into Solr in a new field which can be used for Semantic Search.

Demo Time!

So lets consider an example case. As in plenty of our previous posts, we will be using the Science Fiction Stack Exchange. Why? Because were all nerds and with such a familiar topic, we can quickly intuit whether or not a search is returning relevant results. In this data set, the field of interest is the Body field because it contains the contents of all questions and answers.

So now that weve decided upon our demo dataset, were ready run the analysis. If youd like to follow along, then please take a look at our git repo. This repo contains the example SciFi data set, the Semantic Search code, and README to get you going. However Im going execute everything from within Python:

>>> from SemanticAnalyzer import *>>> stvc = SolrTermVectorCollector(field='Body',feature='tf',batchSize=1000)>>> tdc = TermDocCollection(source=stvc,numTopics=150)

That last line takes a few minutes. If its in the AM where you are, grab a coffee. If its in the PM, grab a beer. Once that line completes, we will have successfully extracted the term-document matrix from Solr. Now lets play with it for a bit. One of the cool side effects of this analysis is the ability to quickly find words that commonly occur together. Lets give it an easy test; here are the 30 most highly correlated words with the word vader (as in Darth Vader).

>>> tdc.getRelatedTerms('vader',30)

Did you notice that pause when you called the function? That was the collaborative filtering taking place. The results of that process have now been saved, so additional calls will return quite quickly.

vader luke emperor darth palpatin anakin sith skywalk sidiou apprentic empir luca side star son forc turn kill death rule suit father question jedi command obi tarkin dark wan plan

Hey not bad! Everything here seem very reasonably connected with Mr. Vader. You may notice some odd spellings here, thats because these are the indexed terms, therefore they are stemmed. Lets try again with a different term; this time everyones favorite wizard:

>>> tdc.getRelatedTerms('potter',30)

harri potter voldemort wizard snape death magic jame love spell time rowl lili eater travel seri hous hand hogwart three find wormtail kill slytherin hallow secret deathli muggl order lord

Again, pretty good! One last try, and well make it a little more challenging – a vague adjective:

>>> tdc.getRelatedTerms('dark',30)

dark side jedi sith eater lord death mark snape magic curs evil forc luke mercuri cave yoda jame palpatin dagobah anakin black call wizard slytherin live light siriu matter voldemort

Indeed, most of these terms are like a hall of fame of dark things from Star Wars and Harry Potter.

Now since the word correlation has proven itself out, its time to generate the pseudo terms and post them back to Solr.

>>> SolrBlurredTermUpdater(tdc,blurredField="BodyBlurred").pushToSolr(0.1)

This line will probably see you to the end of your coffee or beer (it takes about 10 minutes on my machine). But once its done, you can start issuing searches to Solr.

Solr Results

Heres an example of Semantic Search using Solr:

http://localhost:8983/solr/select/?q=-Body:dark +BodyBlurred:dark

The Body field contains the original text while the BodyBlurred contains the pseudo-terms. So this finds all documents that do not include the term dark, but presumably contain dark content. Take a look at the documents that come back:

{  Body: "In the John Carter movie (2012), he shows off some of his powers, like jumping abnormally high, but I have difficulty evaluating his strength. On the one side, he shows great strength, as when he kills a thark warrior with one hand, but he is also quite mistreated by them. He also seems helpless when he is strangled by Tars Tarkas. Why does the strength he shows seem so inconsistent? ",  BodyBlurred: "tv great movi control kill consid hand dark side power long mutant fight machin light abil sauron wormtail hulk"},{  Body: "In the movies, the Nazgul ride black horses with armour. I was wondering if that is all they are, or do they have some sort of magic? Are they evil?",  BodyBlurred: "movi black magic dark demon engin hous aveng slytherin"},{  Body: "The remaining Black Brother from the prologue of A Game of Thrones is apparently the deserter who is beheaded in the beginning of the book. But how did he manage to get to Winterfell from the other side of The Wall? Or did the show throw me off track and in the book there weren't any survivors, so the deserter is someone else? ",  BodyBlurred: "book watch black hole dark side plai long game demon engin light turn district"},{  Body: " Was this ever discussed in any episode, or as a side-plot somewhere? ",  BodyBlurred: "episod dark side light"}

Not bad – most of those topics are rather… dark. Though check out that last result. So… maybe there are still some improvements we can make! But you also have to remember that were dealing with word correlation here and I can only guess that somewhere else in the corpus dark side-plots and dark episodes were surely discussed.

Speaking of word correlations, check out this gem:

{  Body: "You're correct, Enterprise is the only Star Trek that fits into both the original and the new 2009 movie timelines. From the perspective of the Enterprise characters, both are possible futures, given the over-arcing conceit of the show was a Temporal Cold War, so its future is in flux and could line up with either of the timelines we're familiar with, or with an entirely different future. ",  BodyBlurred: "answer charact place klingon star trek design travel crew watch work movi happen enterpris featur futur exist origin 2009 chang altern timelin war to version event captain gener pictur tng creat iii galaxi theori return alter voyag entir fry turn kirk paradox biff doc marti feder 1955 starship 2015 class hero centuri tempor uss phoenix mirror river 800 ncc 1701 simon conner skynet alisha"}`

The original document involves Star Trek and time-travel. And appropriately, the pseudo terms include Star Trek things and time-travel terms… but do you see anything funny? Thats right Biff Doc and Marti made their way into the pseudo terms – likely because of their role in the popular time-travel film “Back to the Future.”

Speaking of the future …

Future Work

Semantic Search with Solr is hot right now. In the upcoming Dublin LuceneRevolution I know of at least 3 related talks that have been submitted (one of them my own); I have heard that MapR is working on a Solr Semantic Search/Recommendation engine built atop of their Hadoop offering; and I suspect that with Clouderas recent foray into Solr with Mark Miller, they will also be working on the same thing.

Whats next for our work? Recommendations! (Remember, thats how we started this conversation.) E-commerce recommendations is a simple extension of the work presented above. Given an inventory catalogue (e.g. product title, description, etc.), and given a history of user purchases, we can build a search-aware recommendation engine. That is, when a customer searches for a particular item, they will receives results as usual, except that the results will be boosted with items that they are more likely to purchase. How? Because we know what type of customer they are and what products that type of customer is more likely to buy!