Yesterday, John and I gave a talk to the DC Hadoop Users Group about using Mahout with Solr to perform Latent Semantic Indexing — calculating and exploiting the semantic relationships between keywords. While we were there, I realized, a lot of people could benefit from a bigger picture, less in-depth, point of view outside of our specific story. In general where do Mahout and Solr fit together? What does that relationship look like, and how does one exploit Mahout to make search even more awesome? So I thought I’d blog about how you too can get start to put together these pieces to simultaneously exploit Solr’s search and Mahout’s machine learning capabilities.

The root of how this all works is with a slightly obscure feature of Lucene-based search — Term Vectors. Lucene-based search applications give you the ability to generate term vectors from documents in the search index. It’s a feature often turned on for specific search features, but other than that can appear to be a weird opaque feature to beginners. What is a term vector, you might ask? And why would you want to get one?

## What is a term vector?

To answer the first question, a document’s term vector is simply of listing for each term in the entire corpus with how frequent the term occurs in this document. So if our corpus is these three documents:

```
doc 1: brown dog
doc 2: brown unicorn
doc 3: unicorn eats unicorn
```

We could choose to represent a document by assigning a column to each term in the corpus and a row to each document. The number at each row/column would correspond to the term’s frequency in that document. We refer to each row as a term vector for a document and the entire set of term vectors as a term-document matrix. Here’s an example of the term-document matrix for the above corpus, containing three term vectors for each of our documents:

doc | brown | dog | eats | unicorn |
---|---|---|---|---|

doc1 | 1 | 1 | 0 | 0 |

doc2 | 1 | 0 | 0 | 1 |

doc3 | 0 | 0 | 1 | 2 |

As stated above, each document’s term vector has a spot for every possible term that occurs in the entire corpus of text. As you can imagine, with real documents in real corpuses with lots of terms this turns into term vectors that are mostly 0. The mostly 0 nature of this matrix leads us to calling these matrices (and their component vectors) sparse.

## What can you do with this information?

Using the mathematical concept of a vector has pretty specific implications. If you recall your high school Geometry, you’re probably remembering a vector as a kind of arrow coming from the origin of a graph (0,0) with a pointy thingy at the vector’s coordinates ((1,2) for example). So you’d get something like this. If you learned anything past High School math, you likely got that there’s such things as 3D vectors. They have three numbers because in 3D space we have numbers for x, y, and z.

Ok well stretch your mind and think about the term vectors above. These vectors have frequencies for all four terms in the corpus — they are four dimensional vectors. And this is a very trivial example. In our testing with the SciFi Stackexchange data set, there are roughly 36000 terms meaning vectors 36000 wide, with each individual vector mostly 0s. Luckily the same math that applied to two and three dimensions can be applied to vectors of any dimensionality.

Now that we’ve brought it back to our high school math, lets remember what kinds of things we learned about vector back then:

- We can take the dot product of two vectors (x1
*x2 + y1*y2 + z1*z2 … for more dimensions) - calculate vector magnitude
`sqrt(x1^2 + y1^2 ... )`

- Add two vectors [x1+x2, y1+y2, ... ]
- etc etc
- Hazy memories of some trig stuff
- and a guy name Oscar that always wanted to copy my homework

When you realize that the x,y,z… above are term frequencies in a term vector, vector math turns out to have certain very practical implications. Lets think through an example that might awaken your dormant math neurons! When term vectors point in the same direction in highly dimensional space, its because there’s a lot of terms that overlap between those documents. One way to figure that out is the dot product. For example consider the dot product of document 1 (“brown dog”) and document 3 (“unicorn eats unicorn”). We can detect there’s no overlap by taking the dot product of the two four-dimensional vectors:

doc | brown | dog | eats | unicorn | |
---|---|---|---|---|---|

dotprod(doc1,doc3) | 1*0+ | 1*0+ | 0*1+ | 0*2 | ==0 |

Compare this to the dot product of two documents that overlap, document 2 (“brown unicorn”) and document 3 (“unicorn eats unicorn”) then the “unicorn” parts of the dot product amplify each other, letting us know of some similarity:

doc | brown | dog | eats | unicorn | |
---|---|---|---|---|---|

dotprod(doc2,doc3) | 1*0+ | 1*0+ | 0*1+ | 1*2 | ==2 |

Pretty cool, huh! This is a pretty minor example. If you like finding practical ways to exploit cool math, here’s an area just for you. It gets even cooler when you work with our term vectors as if it’s a matrix, and heck there’s really neat stuff you can do with that too!

So where do I go to get gobs of good math to apply to giant vectors/matrices?

## This is where Mahout steps in

Mahout is a machine learning library built on top of Hadoop. Machine learning sounds hard, but its just marketing speak for math you don’t know yet. Hadoop is a map reduce framework for managing jobs and a storage system to work on gigantic data sets. So a more realistic term-document matrix, maybe one with one billion documents and one million terms might actually be usable.

Mahout has all the tidbits, advanced and beginner, that can help us do very interesting machine learning (read:math) processing on our giant term-document matrices. For example we can cluster the rows, use similarity metrics to calculate distances between our term vectors. We can fill in some of the empty spots in our sparse vectors by detecting relationships between terms (“banana” and “peel” commonly co-occur, let’s score this document for banana too!) and much much more.

Each of these processes has its own mathematics behind it. Many of these processes like singular-value decomposition and K-means clustering you can get at least get a basic appreciation for what its doing. With a little work you can get your head around the math by brushing off some old textbooks. We’ve also got a great presentation on Singular Value Decomposition, which might be a great place to equip some machine learning with your search application.

## Get Started!

So great, where do you start playing around using Mahout with your term vectors? You’ll need to get term vectors into a format that Hadoop/Mahout can understand. Mahout gives us a nice utility called lucene.vector for dumping term vectors from a Lucene index and saving the information you need to feed data back into the search index. From this point on, the sky is the limit. Learn some cool math and machine learning and find interesting/fun ways you can play with your term vectors! Have fun!

Really great review….required reading for folks attending your next presentation. I really enjoyed your Latent Semantic Indexing (LSI) topic at the DC Hadoop meetup and I’m sure you’ll be successful with your cool search technology.

Great article.

Correct link to lucene.vector is https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text