In Chapter 4 of Relevant Search, we talk a LOT about Elasticsearch analyzers. Without analyzers, your search engine would be a rather unintelligent string comparison system instead of a smart, powerful search engine. Analyzers are the text-processing pipeline that feed the search engine’s core data structures, controlling whether two tokens (basically words) match during a search.
In this article I want to motivate you to build your OWN analyzers. If you can master analyzers, you can take direct control of the seeming intelligence inside Elasticsearch. There’s no code here, rather I want you to see exactly why you would reach for a custom analyzer. It’s something we do in almost every search project, yet we find many developers feel intimidated using them. Let’s explore more philosophically why analyzers give you tremendous, exacting control over the search engine’s ranking behavior.
What are Analyzers? Why Do We Need Them?
At the very core of the search engine is a highly tuned but pretty dumb data structure for matching strings. It’s so exacting and dumb, that it wouldn’t be able to tell
cat are the same. To the data structure inside the search engine, these two strings are unique arrays of UTF-8 characters, not variants of the same word. CAT (\x43\x41\x54) and cat(\x63\x61\x74) don’t match, so end of story: a search for
CAT fails to find documents with
Not so fast say your analyzers :) Analyzers turn strings into normalized tokens before searching/updating that dumb data structure. One step in analysis, for example, could be forcing text to be lowercased. This converts all instances of
cat, causing the two variants to match, thus overriding the unintelligence inherent in the actual data structure.
Other forms of analysis are more language specific. For example stemming is used to convert words to a root form. For English, the search engine’s dumb data structure can’t tell that
run are the same. Yet with appropriate English stemming,
running might be converted to
run, allowing them to match. Other languages from German, Chinese, to Icelandic of course have their own unique rules for performing these sorts of operations (and even determining what constitutes a word!).
Analyzers Are Ultimate Power!
Ok big whoop, so just pick the analyzer for your language and move on, right! Not so fast. Analyzers actually are a vital control point in your search solution. By creating your own custom analyzers YOU decide when two terms are equivalent and should match in that dumb data structure. You can convert any snippet of text into SOMETHING that’ll match. Or go the other way – ensure two pieces of text are analyzed so that they definitely won’t match. Both could be valid choices, depending on the application.
As an example, let’s say you’re building search for SciFi fans, and you’d like to show them web pages that discuss characters, ideas, etc from their favorite franchise (Star Trek, Star Wars, etc). Users search for Star Trek and Star Wars, expecting documents relevant to their topic to come back. It’d kind of work. But super fans don’t need to really mention the franchise at all in their discussions. Whole documents being searched read like
Captain Picard is way cooler than Han Solo. Han Solo shot first, sure, but Picard stood down The Borg, Admiral Tomalak, and a whole slew of others.
You search for “Star Trek AND Star Wars” and this won’t be found anywhere!
Ah but analysis gives you an opportunity to make terms equivelant. One way you could use an analyzer is to curate a set of synonyms, indicating the relationship between prominent characters and their franchise. Synonyms are another step you can add to the analysis process. For example, here’s our sample synonym file:
Captain Picard => Star Trek Han Solo => Star Wars Ned Stark => Game of Thrones
Ok so next you’d create a custom analyzer with just a synonym step. Now if we take our sentence and run it through analysis:
Captain Picard is way cooler than Han Solo.
this would be transformed into these tokens
[Star Trek] [is] [way] [cooler] [than] [Star Wars]
and of course, if we’re doing stemming, lowercasing, and several other useful steps for English, we could arrive at
[star trek] [is] [way] [cool] [than] [star war]
NOW your search for
star trek ought to match something! You can cast a much wider net now. Moreover, as more names, ideas, etc in the document correspond to stuff from the Star Trek franchise, the relevance score for those documents will increase as they’re counted as additional matches on
star trek! By making terms equivalent, you’re controlling not just how they match, but how they’re ranked and scored!
That first stab was only the tip of the iceberg too, the number of options for controlling how tokens are emitted are endless. You can inject an entire library of filters and other steps into the process. You can reorder them. Do anything you need to solve your problem. Search isn’t hard – its just programming :).
For example, you might decide to preserve the original text by changing how the synonyms are applied (here | means two tokens share space):
[star trek|captain picard] [is] [way] [cool] [than] [star war|han solo]
or filter out all the terms that DONT match your a list (known as keepwords):
[star trek|captain picard] … [star war|han solo]
The options are endless! The real takeaway is with analyzers, you needn’t just pick an analyzer for a language. Instead, you can string together different pieces like lego blocks for manipulating text. Analyzers can be constructed of all sorts of different components, including
- character filters that control the processing of the string before its converted to tokens
- a single tokenizer that control how the string is converted into tokens
- token filters that allow you to manipulate each token (synonyms and lowercasing are token filters).
The point is there’s simple power here that can improve your search results. You can manipulate analyzers and get tremendous gains in better, more relevant search. You can directly understand how matching works, control it, manipulate it precisely to your needs. A simple tweak here can solve a search problem efficiently, leaving you in absolute control.
Better Text! More than Text!
If you want to get a deeper, hands-on appreciation for this analyzer stuff, I really recommend Chapter 4 of Relevant Search. It’s probably one of the better chapters in the book. I didn’t even personally write it, but I have been recommending it to anyone getting started in search and search relevance! And spoiler alert: one of the coolest things we show is you can tokenize FAR MORE than text. Heck look here, I tokenized an image into “tokens” that are really RGB values :) Search engines are awesome!