Elyzer: Step-by-Step Elasticsearch Analyzer Debugging

September 22, 2015 Doug Turnbull
Category: Relevancy

I love stringing together custom analyzers to solve my search problems. Analyzers control how search and document text are transformed, step-by-step into individual terms for matching. This in turn gives you tremendous low-level control of your relevance.

Yet one thing has always bugged me with Elasticsearch. You can’t inspect the step-by-step behavior of an analyzer very easily. You have the _analyze API, which helps a great deal see the final output of the lengthy analysis process. But you can’t pry into each step to see what’s happening.

For example from our book we have an analyzer that turns text into two-word terms for a very specific kind of matching:

GET http://localhost:9200/tmdb/_analyze?analyzer=english_bigramscaptain picard was cool{  "tokens": [    {      "position": 1,      "type": "shingle",      "end_offset": 14,      "start_offset": 0,      "token": "captain picard"    },    {      "position": 2,      "type": "shingle",      "end_offset": 18,      "start_offset": 8,      "token": "picard wa"    },    {      "position": 3,      "type": "shingle",      "end_offset": 23,      "start_offset": 15,      "token": "wa cool"    }  ]}

This gives you SOME information. But there’s plenty of questions left. Why was was turned into wa? What happened in between? How was the text tokenized? What happened at each step?

Introducing elyzer

Instead of just working with the end result, you’d like to see every step. This is exactly what elyzer does. Instead of the JSON output above, it helps you with friendlier step-by-step debugging output like so:

doug$ elyzer --es http://localhost:9200 --index tmdb --analyzer english_bigrams --text "captain picard was cool"TOKENIZER: standard{1:captain} {2:picard}  {3:was} {4:cool}    TOKEN_FILTER: standard{1:captain} {2:picard}  {3:was} {4:cool}    TOKEN_FILTER: lowercase{1:captain} {2:picard}  {3:was} {4:cool}    TOKEN_FILTER: porter_stem{1:captain} {2:picard}  {3:wa}  {4:cool}    TOKEN_FILTER: bigram_filter{1:captain picard}  {2:picard wa}   {3:wa cool} 

Each {} just shows the token position, a colon, then the token string.

Here I can see, for example, was turned into wa because of the porter_stem filter in my analyzer that is before my bigram_filter.

Being able to debug analyzers makes them much easier for me to work with! I can really control exactly how my text is expressed to the search engine to gain really tight control over how the search engine matches and ranks.

I hope you check out Elyzer and give us feedback! Instructions for installation are up at the github repo. Also, check out our other search projects Quepid and Splainer as well. Don’t hesitate to get in touch to chat about how we can help you with your search problems!