Algolia presupposes that we’re all going to want instant search (aka search-as-you-type). So they’ve built extremely good, hosted instant search. Everything you’d want to do with instant search is there. Just read this amazing post on query understanding
- Typo tolerance, “Doug Trunbull”
- Prefix search “Doug Turn”
- Prefix with typo tolerance “Doug Trun”
- Decompounding (sometimes this can be typo tolerance) DougTrun
- A query-time lemmatizer, and more…
One thing though that I’ve learned about Lucene-based search is you can, with a good team, build just about any searchy thing with it. Yet it does take a team. Lucene-based search isn’t really meant to work well “out of the box” for your solution. Regardless of how easy Elasticsearch has made it, Lucene-based search is a framework, not a solution. It’s a set of really amazing search-focussed data structures that you can cobble together relatively easily to do Your Thing™. Even if Your Thing™ means altering details as low-level as how the index gets stored to disk!
Another way to say that is you could build Algolia in Elasticsearch (or get close enough). You can’t build Elasticsearch in Algolia. You gain from Algolia focus on a specific problem. You sacrifice deep customizability and extendability of open source. Yet another way to say it is to compare to search solutions to web apps. In many ways Algolia is like building a site with a site builder like Wix. Lucene is more like building your own web app with developers behind it, and all the associated low-level considerations, annoyances, but also power.
Case in point is Algolia’s performance comparison to Elasticsearch. In Algolia’s tests, Algolia claims up to 200x performance improvement. On average, there is more of a 10-20x performance improvement (still impressive). However, Algolia chose the lowest common denominator in instant search in Elasticsearch: fuzzy queries and prefix queries. As is written in Elasticsearch: The Definitive Guide another common approach that improves speed tremendously is to use ngrams. Basically avoid the query-time fuzzy work and build a giant data structure that can handle it.
Now ngrams have their own problems. They grow your index. However, in the case of 2 million documents with lots of short text, it might not bloat the index that much. And I suspect it would have improvements of orders of magnitude in performance. If a bloated index became a problem, we could produce fewer ngrams of larger size. There’s also caching to consider: I bet both solutions cache results for each keystroke query. So I wonder how that colors the Algolia vs ES consideration.
We might even reverse terms placed in the index to get suffix queries to catch earlier typos. Or do exotic things with fuzzy queries and ngrams simultaneously. We might even write a Lucene query that focusses on typos. Look at all the power here!
The point though is I’ve made you start to think about how you’d solve the problem. Algolia already has built a solution! Why not just run with theirs? Well there’s a couple of reservations I would have going whole-hog into the Algolia camp:
- Turn-key can often turn into lock-in. There are examples of hosted search solutions (and databases being acquired and the new owner (In FoundationDB’s case Apple) not being interested in supporting the existing business.
- You care about “things not strings.” Algolia’s solution strongly focusses on specific string matching. Lucene’s approach to relevance focusses more abstractly on terms as features of content, using TF*IDF as a feature similarity system (our book largely discusses relevance in these terms).
- You’re doing anything close to non-traditional. You have a specific query language to implement. You need to explicitly map vernaculars between experts and lay-people. You want to do learning-to-rank. You want to do use controlled vocabularies, build semantic search. You have specific Geo concerns. All these are features you can build into Solr/ES and you’re locked into what Algolia gives you.
- You want to deeply manipulate the search engine’s behavior. This is a huge part of Lucene’s sweet spot.
But there’s a couple of reasons I would strongly consider Algolia
- You have a small team, but good search is important. Algolia works pretty well out of the box. It’s got a good level of configurability of ranking that can include both text and numeric values like popularity with some geo support.
- You primarily need to support single-item lookups. Algolia’s approach is ideal for cases such as needing to lookup “banana” and match on bananas. Algolia may not make sense for users that type “minion fruit” and expect bananas.
- You need to support typos. Lucene solutions to this are awkward. I hope that they’ll get better and faster, but fuzzy searching is not Lucene’s sweet spot.
Here’s a couple pieces of Algolia marketing I would disagree with:
- Algolia likes to point out that hosting Elasticsearch is going to be hard. I think with options like Bonsai and Elastic Cloud, this is hardly the case. With a good ES host, you basically have a good “API in the cloud” that’s just as easy to work with as any other service.
- Algolia wants you to believe Elasticsearch is not a good search engine. It’s only good at big data analytics and “big data search” (not sure what that means). In the spirit of finding “things not strings” I would disagree. It just takes work and understanding what’s special about your search solution.
- Algolia hopes everything is going to become instant search. Yet in my experience, the preponderance of search experiences (even many driven by Algolia) are autocomplete first to select keywords, and then keyword search. This is still in Lucene’s sweet spot for search.
- I believe the benchmarks Algolia provides, but it’s noted that they don’t try faster instant search strategies on Elasticsearch. We can’t recreate their benchmarks ourselves. Algolia seems trustworthy, but I wish I could test this independently. I also would like to see them rerun against newer Elasticsearch versions.
But Algolia points out important weaknesses in Lucene’s relevance model
- Typos and fuzzy matching: To the extent that the world wants instant search with typo tolerance, Lucene-based search is hard to get working. I’ll also believe it’s slower than Algolia’s focussed solution (though I can’t recreate the benchmarks).
- Elasticsearch/Solr defaults for relevance are hard to tune. As Algolia rightly points out, the dismax ranking function yields fairly confusing results. We write about that phenomenon here. You can and should get away from these defaults, but I wish search made more sense out of the box.
- Elasticsearch and Solr primitives for relevance feel low-level. Algolia’s feel higher level and more focussed on creating a common understanding between the business and developers. This is one big reason we built Quepid. Even still, with all the options in Solr/ES you’ll still get blank stares from the business when you start speaking in terms of boolean queries, function queries, and what-not.
My big takeaway is I’m actually pretty enthusiastic about Algolia for the right use cases. But you need to be certain it’ll satisfy your needs. I hope this has made you a more informed shopper.