Blog

Log Every Document Added To Solr

I often want to intercept the complete Solr updates sent to Solr in a format I can use offline. Clients have complex ingestion systems. I shouldn’t need to have the full ingestion apparatus to do some Solr work. With documents offline, I can script something simple and stupid that throws documents at Solr to test my search relevancy work without having the full system at hand to populate Solr.

Add Update Processor Chain to Solr Config

Solr lets you hook in some code to run prior to indexing documents. The list of hooks run is known as the UpdateRequestProcessorChain. This code can of course be Java. For simpler tasks, however, you can simply use one of the Java scripting engines available to Solr. In the update chain, you can do all sorts of crazy things to Solr documents. Here we’ll simply set up some logging through some Javascript scripting to capture the full contents of every document added. To do that, first lets walk through configuring an update processor chain.

First, setup the update handler to use a specific update chain, add the update.chain property as so:

 name="/update">   name="defaults">      name="update.chain">script  

You need to actually define “script”. Update chains can have a bunch of steps in processor blocks, here we’ll just add one step at the very beginning of the update chain, pointing it at a Javascript file log-solr-docs.js.

 name="script">   class="solr.StatelessScriptUpdateProcessorFactory">     name="script">log-solr-docs.js     name="engine">js     class="solr.RunUpdateProcessorFactory" />

As an aside, you may have seen Solr’s LogUpdateProcessor in this context. Important to note that LogUpdateProcessor simply logs an id. Here we’re after the full doc!

Add Javascript Logging

So now we simply need to define log-solr-docs.js to log some Solr docs! This file expects several callbacks (all apparently mandatory, hence the empty functions). Below is the Javascript code you’ll need. Place this code in log-solr-docs.js in the same directory as your solr config.

Examining the code, you can see it simply iterates through the Solr document, building a corresponding Javascript object (jDoc). Finally, we log the JSON object to the log.

function logDoc(doc) {    if (logger.isTraceEnabled()) {            var fieldNames = doc.getFieldNames()            var jDoc={};            for (var namesIter = fieldNames.iterator(); namesIter.hasNext();) {                var fieldName = namesIter.next();                jDoc[fieldName] = []                var fieldValues = doc.getFieldValues(fieldName)                for (var valIter = fieldValues.iterator(); valIter.hasNext();) {                        fieldValue = valIter.next();                        jDoc[fieldName].push(String(fieldValue.toString()))                }            }            logger.trace("NEWDOC: " + JSON.stringify(jDoc));     }}function processAdd(cmd) {  doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument  logDoc(doc);}function processDelete(cmd) {  // no-op}function processMergeIndexes(cmd) {  // no-op}function processCommit(cmd) {  // no-op}function processRollback(cmd) {  // no-op}function finish() {  // no-op}

Turn on Logging

You’ll notice the code above outputs the JSON string to a logger at the TRACE logging level. The simplest thing to do to turn this on is to set TRACE in the Solr admin UI for the script handler. These settings are temporary, they go away when you restart Solr:

alt text

And now you should start seeing every document logged to Solr! With this in place, now you can quickly identify every solr document being added. Of course for large indices, this become untenable. But for an ad-hoc way to grab all the Solr docs its quite handy!

And of course, if you have a tricky Solr problem, be sure to Contact Us!