Playing with Thoth

At LuceneRevolution last week, one of the sessions that got me really excited was about Thoth, presented by Damiano Braga and Praneet Mhatre. It was very nicely done, especially considering a 30 minute timeslot! Thoth is a new Solr monitoring solution open sourced by Trulia.

Ho hum I can hear you saying, yet another logging solution. Well, the part that got me excited was this line from the program guide:

Damiano and Praneet will also summarize their application of machine learning algorithms and its results to the process of query analysis and pattern recognition.

Thoth not only collects the data about your Solr queries, including both the query and how long it took, but then can actually use that data, via magic machine learning, to make smarter decisions. Specifically, Trulia wanted to split up incoming queries into fast and slow queries, and send those queries to specific clusters of Solr servers. Now, we can debate if its a search architecture smell to need to have separate clusters for slow versus fast queries, versus one large pool, however if it works for Trulia, then the machine learning to build the classifier is cool.

Data collection happens very simply. As requests come in to what they call the intercepted instance, the query and duration are sent to another Solr server:

Collecting Query Data

3 Collecting Query Data

The data is used to extract the training dataset of slow and fast queries, which in turn is used by a simple Machine Learning tool (that I dont recall from the presentation) that is used to classify incoming queries and send them to the correct cluster. The Thoth project has published its various modules on GitHub. The Thoth ML module is going to be available in the next little while according to Damiano, and then youll be able to actually see this use of machine learning in anger. When that is published, hopefully youll be able to duplicate what Trulia did with this setup (slightly annotated!):

Complete Cycle

6 Complete Cycle Architecture

You can try out Thoth very easily via the Thoth Demo project.

So I do have one nitpick: the use of ActiveMQ as the mechanism to collect the query data. If you are already using ActiveMQ, then fine, but otherwise its a fairly significant thing to add to your stack. Since the source of the data is Solr, and the target location for the data is Solr, I would have just added a component that would send that data directly into the target Thoth queue. The data is pretty tiny, and I dont think it would add any more latency than using ActiveMQ. Not as robust, but one big moving piece less to deal with. Weve done this for powering an analytics interface, instead of writing out to a logging tool like LogStash, we just insert the data directly from our API into the target Solr, and it works great. Id love to see some clear documentation on how to put in your own connector.

Overall, Im really excited to see this tool. Especially the special sauce around the machine learning. If you arent already invested in a monitoring tool for Solr, and like the Solr specific analytics, then take a look at Thoth.

The source of the diagrams, and a good summary presentation done by the developers is available here.