How to use Luwak to run preset queries against incoming documents

February 5, 2016 Scott Stults
Category: Lucene

Overview

Quite a while ago Flax released Luwak as a document monitoring and alerting library. It was designed to solve the problem of running a lot of predetermined queries against an incoming document and see which ones match. In that sense it’s a lot like Elasticsearch’s percolator. As a library, some integration work needs to be done in order to reap its benefits, so the bulk of this article will describe different ways you can do that. But first let me give you a brief overview of its internals:

Major Classes

MonitorQuery

Each query Luwak keeps track of is stored in a structure called a MonitorQuery:

    private final String id;    private final String query;    private final Map metadata;

Each MonitorQuery is then stored in a small Lucene index. You can add whatever you like as metadata, for example you might add a name and email address if you’re sending out alerts based on the query. The query itself undergoes some Lucene-style analysis by the Presearcher so that it can be efficiently indexed and matched by the incoming documents.

Presearchers

If you have a query containing just the terms “red” and “blue”, you know that a document that doesn’t have those terms will not match. So you can turn documents into more efficient search strings by just retaining terms in your query index. The job of the Presearcher is to implement that strategy. Luwak has a number of these strategies that strike some balance between accuracy, speed, and memory efficiency. For example, the MultipassTermFilteredPresearcher can be made more accurate by examining more query building strategies, but this accuracy comes at the expense of more memory and longer runtime. The documentation on the Presearchers will give you some idea of how they will perform, but you’ll likely want to experiment to find the right one for your application.

CandidateMatchers

Once the Presearcher has selected candidate queries that could possibly match the incoming document, a CandidateMatcher then makes the final determination on which queries do match. The process is a lot more similar to searches you’re used to seeing in that this is where you collect, score, and highlight hits. The different implementations allow you to skip things you don’t need. For example, if you just want to use the MonitorQuery as a way to add a category field to the document, you don’t care about the score of the queries that matched and you don’t need to highlight the sections that matched. In that case a SimpleMatcher may be what you need. On the other hand if your queries are complex and a match on a specific clause is important, you might want to use the HighlightingMatcher so that you can show which parts of the incoming document triggered the match.

Monitor

The Monitor is what orchestrates the indexing of queries and matching them against documents. You can have it keep its index completely in-memory, serialized to some specific place on disk, or delegate the actual index writing to another class if you’re using Luwak in a Solr plugin. In any case, the Monitor is what you’ll use for basic Create/Read/Update/Delete (CRUD) operations on the queries. And most importantly, the Monitor is what takes documents and matches them against queries. Each time you call match() you can use a different CandidateMatcher to receive different details about the match. For example the first run might be just a SimpleMatcher, and depending on the metadata of the MonitorQuery you might run match again to highlight the documents.

Integration Strategies

Make a Stand-alone Executable

If you have a fairly static set of queries you need to run against, it might make sense to just build an executable jar and run it. This has the advantage that it scales independently of Solr and does not impact its resources. It would also fit in well with an index pipeline composed mainly of shell scripts. The downside is that you’ll have to design a separate process to maintain the queries themselves, and you’d have another program to build, maintain, monitor, and distribute.

Add it to Your Ingester Utility

If you already have a Java process that uses SolrJ to send documents to Solr, adding Luwak to it would be a simple way to implement query-based classification or alerting independent of Solr. You’ll still have the issue of query maintenance to address, but again, your ingester scales independently of Solr and the overall impact to your existing architecture will be minimal.

Build a Luwak Web Service

Like the stand-alone approach, a web service would be a flexible way to use Luwak to match queries to documents posted to it. It also has the advantage that CRUD operations on the queries can be implemented in a RESTful way. This also opens up Luwak capabilities to other services on your network (as in a microservice or service bus). Scaling would be easy, but the main drawbacks to this approach would be latency and the added infrastructure complexity.

Create a Solr RequestHandler

Solr itself isn’t too shabby at building and maintaining indexes, and it has a handy web interface for adding documents and querying. So if your documents are already going to Solr it might make sense to implement Luwak as a RequestHandler. You’ll still have a choice to make though: Should Solr or Luwak write the query index. For smaller query collections that can easily fit in memory, Luwak might be the better choice (just like Solr keeps synonyms and stopwords in memory.) For huge collections of queries though, Solr’s sharding capability might be necessary. A distributed RequestHandler is a little more complex than an everything-is-local RequestHandler, so your choice will depend on your intended query volume. Creating a Solr RequestHandler will get you a lot closer to a complete solution, but you’ll still need to add calls to that handler in order to finish things.

Implement an UpdateRequestProcessor

Lastly, if you know you want to run Luwak matchers against every incoming document to Solr you could add Luwak as an UpdateRequestProcessor (URP). URPs are configured as a chain in your solrconfig.xml and can do things like add/change/delete fields on incoming documents or even reject the document altogether. As with the RequestHandler approach you’ll still need to decide whether Solr or Luwak maintains the query index. In an URP chain, Luwak can add fields to your documents for query-based classification or generate new notification documents. Since it’ll be an intimate part of Solr you’re going to have to monitor its impact on indexing speed and latency, and shard your document collection if these are degraded to an unacceptable degree.

Parting thoughts

So far I’ve only worked Luwak into a web service, so I can’t say for certain whether there are pitfalls in tighter integrations with Solr. If you go the URP route I think you’ll still want to make a RequestHandler to do CRUD on the queries. For a web service talking to non-Java clients, take a look at Luwak’s InputDocument.Builder class. You can use it to build arbitrary InputDocuments without having to hardcode a POJO resource beforehand.

The Luwak code itself is a pleasure to work with in that it’s flexible, modular, documented, and tested. There’s even a benchmark project included that gives some idea of what a minimal application would look like. If you’ve used Luwak or are planning to, send me an email or tweet – I’d love to hear about it!