I’m sure if you run in search or NLP circles, you’ve heard of BERT. It’s what Google famously used to improve 1 out of 10 searches, in what they claim is one of the most significant improvements in the company’s history. Even my favorite neural search skeptic had to write a thoughtful mea culpa.
As Max Irwin wrote about in his overview The T in BERT stands for Transformer, which a freakishly accurate deep-learning based language model, able to predict with alarming accuracy what the ‘next word’ should be. We want to start getting into the nitty gritty. We’ll show you what you need to know to be a successful search relevance engineer in this brave new BERT future.
First part of this deep dive, before we jump to the cool Transformer architecture, it’s important to first understand the background & motivation for BERT. What does BERT solve that traditional search engines, like Lucene/Solr/Elasticsearch, don’t solve? What does BERT do that word2vec wouldn’t do for a search application?
Let’s start by first understanding the basics of where search engine technology is now, it’s roots and recent evolution, te see where BERT fits into the landscape.
Sparse Vectors, The Inverted Index and You!
Search engines like Lucene come with a baked in TF*IDF scoring method (such as BM25). This is how we score a ‘similarity’ between a query and a document. The idea is pretty intuitive, with every document, the relevance score of a search term match in that document in proportional to:
- Term Frequency – how often does that search term occur in this doc? More occurrences – more relevant.
- Inverse Document Frequency – how rare is the term? Rare terms are more specific to the user’s intent.
- Inverse Document Length – how many terms does this document have? If “Skywalker” occurs once in a tweet it’s more about skywalker than once in a book
TF*IDF based try to measure the ‘aboutness’ of the text. Higher TF*IDF, means the text in more ‘about’ the term. Terms are usually words, but as I write in Relevant Search in practice they might be any discrete attribute, including words with parts of speech, concepts in a knowledge graph, pitches in a song, attributes of an image, or even the pixels in an image themselves.
This is famously known as the classic ‘bag of words’ vector space model. Each term in the corpus is a dimension in some N-dimensional space. If we take a piece of text, we could compute a TF*IDF score for every term in our corpus. We could compute a sparse vector where each dimension corresponds to a single term. In the table below, several terms from movie scripts are given BM25 scores. If we were to fully flesh out this sparse vector representation, we would arrive at possible hundreds of thousands of terms, with most terms getting a score of 0 for a document (as it doesn’t occur in that doc – that’s what makes it ‘sparse’!).
|Empire Strikes Back||…||0.0||3.2||13.4||4.1||0.0||0.0||0.0||…|
|Cool Hand Luke||0.0||1.4||0.0||0.0||9.4||0.0||0.5|
A query also is a bit of text we might represent in the sparse vector space:
If you multiplied this query vector, element wise with each doc, (aka a dot product), and came out with a number, you’d get something like this set of scores:
|Empire Strikes Back||16.6|
|Cool Hand Luke||1.4|
Viola, you’d get some kind of similarity or relevance score, where the Star Wars movies ranked higher for Star Wars character Luke Skywalker than spurious mentions of Luke.
This is great. Of course, search is not some robotic, dictionary lookup system. It involves dealing with the messiness of natural language. This includes a lot of inference and context which human beings bring to the problem. Relevance engineers and search teams know this well, and work to mitigate the problem by augmenting the indexed terms by turning those terms into elements in taxonomies, entity recognition, manually adding synonyms, and other ‘classic’ workarounds.
As we’ll see in a future article, BERT attempts to help us with many of these problems. As perhaps (spoiler) a slightly better starting point for relevance than TF*IDF for ‘aboutness’. But before we get there, we need to understand how search has been augmented with dense-vector approaches from systems like word2vec.
Dense Vector Search
Dense vector representations attempt to turn a sparse vector into something less precise. For search, this helps us ‘open up the window’ for terms close in meaning to be grouped together.
Instead of thousands of terms in our language, if we could represent the documents above with major topics or themes, we might create a denser vector for the movie space, down to perhaps a few hundred themes, or in the table below, 3 dimensions 🙂 :
|Empire Strikes Back||0.9||0.10||0.8|
|Cool Hand Luke||0.01||0.9||0.0|
And similarly for the terms themselves, here we add a term that doesn’t occur literally in the movie dialog, ‘spaceship’
We can perform two dot products per movie (luke_vector * movie) + (skywalker_vector * movie) to get a score.
First the term ‘luke’
|Luke Dot Product|
|Star Wars||0.9*0.5 + 0.01*0.55 + 0.8*0.5 = 0.8555|
|Empire Strikes Back||0.9*0.5 + 0.10*0.55 + 0.8*0.5 = 0.905|
|Cool Hand Luke||0.01*0.5 + 0.9*0.55 + 0.0*0.5 = 0.5|
Repeating for Skywalker:
|Skywalker Dot Product|
|Star Wars||0.9*0.9 + 0.01*0.1 + 0.8*0.8 = 1.45|
|Empire Strikes Back||0.9*0.9 + 0.10*0.1 + 0.8*0.8 = 1.46|
|Cool Hand Luke||0.01*0.9 + 0.9*0.1 + 0.0*0.8 = 0.099|
Summing each term together, we can get a pretty good ‘dense’ relevance score:
|Empire Strikes Back||1.451|
|Cool Hand Luke||0.599|
Even cooler, we can repeat the process, and even get a good score when someone searches for ‘spaceship’, even though Star Wars movie dialog doesn’t actually use the term ‘spaceship’:
|Empire Strikes Back||2.365|
|Cool Hand Luke||0.018|
(Note I’m cheating a little, as usually we’d normalize these vectors or divide by the magnitude of each vector to to perform true cosine similarity, but the intuition is the same)
This is rather neat, as we get a kind of similarity of a search term, with a document regardless of whether that term is directly mentioned or not.
This is pretty much what word2vec does. However, we don’t tell word2vec what each dimension of the dense-vector encoding means. Word2Vec starts by giving every term a random vector:
|dimension 0||dimension 1||dimension 2|
Word2vec then looks at little windows of text, a window of perhaps 6 words:
Luke skywalker flew a spaceship
Skywalker blew up the spaceship
Word2Vec notice the tuples (
spaceship) and (
spaceship). With this ‘positive’ example of luke and spaceship sharing context, word2vec shoves the two terms a little closer together:
|dimension 0||dimension 1||dimension 2|
This process is repeated over and over using the whole corpus. Each time a term is viewed sharing a context with another term (for example ‘confirm’ and ‘accept’ seem to share ‘reservation’ a lot – as in ‘confirm my reservation’/’reservation accepted’) we tweak it closer together. The process gradually pushes and pulls, bubbling words with shared context close and others farther apart.
Ambiguous terms like Luke, those with high document frequency, will get pulled in a lot of different directions, and will end up towards the ‘middle’ of many different contexts. Less ambiguous terms (ie ‘skywalker’) will be pulled into their own little universe of Star Wars stuff. Some terms that occur only once or twice, will still stay pretty close to their random values. In fact, it’s not uncommon to omit terms that are too rare.
Indexing Dense Vector Representations
One challenge with dense-vector encodings is search engines, like Lucene, don’t yet have a consistent method of storing, matching, and scoring queries and documents using dense vectors. The inverted index data structure is optimized for sparse vector matching & scoring. This is why recently you may have seen an explosion of methods for performing approximate nearest neighbor or ANN algorithms. These algorgithms index vectors, support queries to find close neighbors, and score them in a way that approximates the kind of similarity we performed here. All without having to manually perform millions of multiplications for every document in the index. Be sure to support the various ANN algorithms being developed by search engines:
The Conundrum: Sparse vector (precision) vs word2vec Dense vector (recall)
Selecting a dense vector strategy is usually a tradeoff. By loosening up our ‘match’ definition, and opening up recall, we might score highly for terms actually not appropriate for this context. These little contextual and use-case specific exceptions are where search teams get into trouble with word2vec. For example, a ‘dense vector’ representation for “Spock” might be:
Which is nearly identical to ‘spaceship’ in the examples above. Thus ‘Star Wars’ would be equally relevant for a spock and a spaceship query. Pulling back ‘Star Wars’ for ‘spaceship’ intuitively makes sense. However, users might object to ‘Star Wars’ being considered a match for a character in a completely different science fiction franchise.
If we were to only perform direct, ‘sparse vector’ term matches, we would strongly err towards precision. We would only pull back direct term matches. Of course, this feels too restrictive, as we’ve seen in the ‘spaceship’ use case, where we’d like to open up the recall a bit. Even here, without direct term statistics to score on, the balance gets tricky. We might score a topically related document (Empire Strikes Back) just as highly for one that actually uses the term spaceship, like First Spaceship on Venus.
A lot of engineering goes on to balance recall in precision. Either leveraging dense vector strategically to perhaps improve recall. Or to muck with the inverted index (synonyms, taxonomies, knowledge graphs, etc) to make it strategically less precise, in ways that we feel users will tolerate for this domain.
BERT gets at what both sparse and classic dense representations miss
So the state of the world of search has been:
- Not very precise dense vector representations
- Lots of work to augment sparse vector systems to balance precision/recall
- Relevance based on lots of (semi-)manual feature engineering, balanced manually, or with learning to rank
If you step back and think about it, neither sparse nor dense methods capture what it’s like to read and understand a document. As you read a document, you have an evolving mental state that captures not just a small window, like word2vec, but the full subject matter and context of the document. If you were to stop half way in reading a news article, and predict what comes next, you probably could. You probably are trained enough in the language of how news is written, you have understood the context of what is being written about, that you could write prose that more-or-less looked like a news article on that topic.
As we’ll see Transformers model this task. BERT uses transformers to give us a dense vector prediction of what kind-of-word is appropriate to come next at a given point. Not just to complete simple sentences like word2vec could, but to model and flesh out the likelihood, given some context, of what next words are most likely.
Recall earlier, we mentioned how TF*IDF was trying to approximate a notion of ‘aboutness’ of a term mentioned in a document. What if by predicting what the article might be written about directly, we could simulate a dense vector that modeled ‘aboutness’ directly? Could it help overcome these precision/recall tradeoffs? Are there aspects of the current heavy feature engineering regime that we could overcome?
I’m still figuring out these questions myself, as we all are. More on my thoughts time. I’m sure to make some ‘bold’ predictions on the future of all this stuff for us ‘Average Joe’ non-Google search teams!