Heatmaps in Elasticsearch - OpenSource Connections

June 26, 2017 Scott Stults
Category: Uncategorized

Heatmap facets are a powerful demonstration of Solr’s geospatial capabilities. Given a corpus of indexed shapes, a heatmap facet will show you where the results of your query fit on a map. This capability has been around since David Smiley’s initial patch was added to Solr 5.1.

Elasticsearch has its own set of geospatial features, but the way it accomplishes heatmap-ish features is by using geohash grids. On the surface these two approaches seem similar, but they are very different at a low level. Geohash queries don’t support fields other than geo_points, so you’re not going to be able to distinguish between hits in irregular shaped areas (like geopolitical borders). In fact, the only ES query that supports the geo_shape type is a geo_shape query, and Elasticsearch doesn’t come with any aggregations over geo_shapes.

Unlike geohash grids, Solr’s heatmap facets use Lucene’s “recursive prefix tree” (RPT) that allows quick categorization of whether a given heatmap cell overlaps an indexed shape. It does this recursively to burrow down from parent cells that have hits until it achieves the desired granularity of the heatmap. Cells at any level along the way that don’t have hits are ignored for the rest of the recursion. The end result is a grid of counts that corresponds to how many indexed shapes fit the query area. Solr can then return that as an array of arrays of integers, or graphically as a PNG.

So in order to generate heatmaps of indexed shapes in ES I created an aggregation plugin called elasticsearch-heatmap. Under the covers it uses the same Lucene API to crawl the RPT and collect hits in a particular shard, but it uses ES’s aggregation API instead of Solr’s facets. In fact it takes the same parameters as the facet query except for “format” (no PNGs for now). Here’s what that looks like as a full ES query:

{    "query": {        "match_all": { }    },    "aggs" : {        "viewport" : {            "heatmap" : {                "field" : "location",                 "grid_level" : 4,                "max_cells" : 100,                "geom" : {                    "geo_shape": {                        "location": {                            "shape": {                                "type": "envelope",                                "coordinates" : [[6.0, 53.0], [14.0, 49.0]]                            }                        }                    }                }            }        }    }}

What that says is, make a grid of at most 100 squares (max_cells), covering the given coordinates, and count how many indexed shapes are under those squares. The output looks something like this:

...        {            "grid_level":4,            "rows":7,            "columns":6,            "min_x":6.0,            "min_y":49.0,            "max_x":14.0,            "max_y":53.0,            "counts": [                [0,0,2,1,0,0],                [0,0,1,1,0,0],                [0,1,1,1,0,0],                [0,0,1,1,0,0],                [0,0,1,1,0,0],                [],                []            ]        }

The output contains the requested grid_level, and instead of repeating max_cells it shows how Lucene interpreted that as a rectangle. In that calculation it’s allowed to shrink the size of the square to eliminate sections that don’t have any hits, so here the number of cells is only 42. After that it lists the coordinates of the bounding box that encompasses the heatmap, and finally the counts. You’ll see that the last row doesn’t contain any hits, nor does the last column. That’s because those cells were included at a more granular level, so they didn’t get eliminated when the bounds were calculated.

That’s a relatively small heatmap over a really small index. In order to test how well it performs over larger indexes I got some help from Jorge Martinez at Piensa Labs. He built an easy-to-follow performance test in a Python notebook that shows how well the heatmap aggregation performs at various index sizes (10K, 20K, 30K, 50K and 100K documents). The results were encouraging: Requests usually returned in less than 250ms. This is comparable to the results I saw with a Rally test suite I built for a 30k document index.

What’s next?

After I offered the code to Elasticsearch as a pull request some issues were brought up in the discussion that make it unlikely this aggregation will become part of the main project (at least in this form). The first issue is with the output format: It’s different than geohash grids so it won’t be compatible with Kibana. Relatedly, you can’t chain another stock aggregation on top of the heatmap. My design goal for the ES heatmap aggregation was to be as close to Solr’s facet as possible so that it could be a drop-in replacement, so I don’t think I’ll be modifying the output.

The other, deeper issue is that Lucene is changing its geospatial API to use what’s known as the “Points API”. This has been going on for a while, slowly replacing bits and pieces here and there to maintain backward compatibility. But even though a lot of work has been done at the field level, Solr’s heatmap facet still uses the older API. Elasticsearch is making some interesting plans to reshape its geospatial features as well (some were mentioned in the PR discussion), so the concensus was that it’d be better to hold off. When I spoke to David Smiley about heatmap facets last year at Lucene Revolution he too had some interesting ideas, and I hope we can talk about them in a future Search Disco podcast.

In the meantime, try out the heatmap plugin for all your geo_shape heatmap needs, and definitely reach out to us for this or any other search projects!