Tika Tuesdays: Parsing Tika & Tesseract formatted HOCR output inside Solr ingestion pipeline

December 17, 2019 Eric Pugh
Category: Uncategorized

In our first couple of posts, we ended up doing a lot of the processing work outside of Solr. It gave me a chance to polish my PowerShell skills, which was cool, and gave me a nice appreciation for having a scripting language that works on both Windows and Unix systems!

However, sometimes you want the search engine to be a black box. I take documents and put them into the black box, and now they are searchable. We added some additional complexity by moving to a parent/child relationship for the PDF’s because each page ended up being it’s own document in Solr. This meant another parsing script that dumped more intermediate format documents.

What if we could do everything inside of Solr? What if we could take the output from Tika with the Tesseract generated OCR content, and then convert that to a set of parent/child documents that are indexed into Solr?

Time for one of my favorite Get out of Jail Free cards from Solr, the awkwardly named StatelessScriptUpdateProcessorFactory which would let us put all that parsing logic into a script run inside of Solr. I’ve used this in the past a couple of times, but would it work with the extraction code?

We started with setting up a custom extract end point, but this time included a update.chain parameter:

     parseContext.xml          attr_      20480       process-speech-from-extracted-text      

The update.chain is what lets us override the normal execution flow, and inject the call to our Scripting step:

                    process-speech.js

This then lets us call a custom Javascript script (though we can use other languages like Ruby etc), to deal with the text.

I’ll let you go through the process-speech.js script yourself. The big thing is that the Extract handler, for some odd reason, gives us not XML content, but the XML content with none of the wrapping < or > tags! So we can’t use the XML parsing logic that we’ve used previously, instead we do lots of string splitting!

Other things to note:

We are able to invoke any Java methods we want by prepending Packages. to the class name, like this example of base64 encoding:logger.info("Here comes some base 64: " + Packages.org.apache.solr.common.util.Base64.byteArrayToBase64(id.getBytes()));
Shockingly, we can create a brand new Solr input document:var childDoc = new Packages.org.apache.solr.common.SolrInputDocument();and then add it to our parent document via calling the Java method on the object:doc.addChildDocument(childDoc);
The process-speech.js script is parsed at startup, so if you have syntax errors, or non compatible Javascript, then you will get an error. While not positive, I believe the version of Javascript supported by the Rhino engine in Java (or maybe called Nashorn?) is Javascript 6, so stay with the simplest Javascript syntax you can.
It’s nice that you can deploy this script via your regular SolrCloud friendly deploy scripts, and it would be interesting to see what other use cases there might be for this.

Read other posts in the Tika Tuesday series here.