Blog

Semantic Search with Latent Semantic Analysis

A few years ago John Berryman and I experimented with integrating Latent Semantic Analysis (LSA) with Solr to build a semantically aware search engine. Recently I’ve polished that work off, integrated it with Elasticsearch, and sunk my teeth in a few levels deeper. I wanted to get a sense for whether this technique could be made really useful for building semantically aware search applications – and how exactly to do that. It turns out getting LSA to work requires a lot of data cleaning, feature modeling, and careful tuning.

What is LSA? How might it improve search relevance?

So we all know how a search engine works. Users enter search terms and the search engine uses an index to match documents that contain those terms. Search engines work well, but require very strict term matching. Search engines don’t know, without help, that “cat” and “kitty” correspond to similar ideas.

Looking at the search documents another way, you can discern relationships between terms and documents that get at broader ideas. One way to view these documents is as a big term-to-document matrix that maps each document to each term. Each entry mentions how often each term occurs in each document. Ala something like this:

TermDocMatrix:

  doc1 doc2 doc3 doc4
cat 0 1 0 2
kitty 0 2 0 0
the 6 5 4 5
dog 2 0 1 0
pooch 3 0 3 0

If you look at this matrix, patterns jump out at the savvy reader. Some documents clearly are about cats (docs 2 and probably 4). Other documents are about dogs (1 and 3). One might even look at this matrix and realize that the word “kitty” is pretty closely related to “cat.” Armed with this insight, we might have the capacity to take user searches for “kitty” and return document 4, as it seems to have a very strong affinity to the idea of “catness” without.

Latent Semantic Analysis is one technique that attempts to recognize these patterns. Latent Semantic Analysis runs a matrix operation called Singular Value Decomposition (SVD) on the term-document matrix. A Singular Value Decomposition can be interpreted many ways. Fundamentally, it factors the matrix into something of a simpler form. Mathematically it’s a way of saying that our matrix can be decomposed (some might say compressed) around the corpus’s main ideas:

TermDocMatrix = catness + dogness

Most importantly, when you look into “catness” and “dogness” you’ll see the components give you insight into that topic’s overall strength, each term’s affinity for that topic, and each document’s affinity for that topic, as in:

catness =  * corpusCatnessStrength * 

Here

  • eachTermsCatness: a vector that corresponds to every term’s affinity for the “cat” idea
  • eachDocsCatness: corresponds to every document’s affinity for the “cat” idea
  • corpusCatnessStrength: how strongly catness occurs in the corpus

It’s beautiful! If we look at “eachTermsCatness” we might see something listing every term in our corpus and its affinity to the idea of “cat”

2.0>

’“”

“”

 

 

  1. –“”
  2. –“”“”
  3. ––“”
  4. –“”
  5. “”
  6. ×

“”

’’’’

’“”“”

“”

“”

“”

alt text

alt text

“”“”“”’’

“”“”’“”“”“”

’“”

’’’

“”“”“”“”“”

“”“”

’’

’…

’’“”“”“”¼½

“”“”

 

 

’’’“”“”

“”

“”’

“”

  • ’’’“”“”’

  • ’“”“”“”“”“”“”’“”“”

  • “”“”

’’’

’“”

’’

“”“”’