Tika Tuesdays: Parsing Tika & Tesseract formatted HOCR output inside Solr ingestion pipeline

Eric PughDecember 17, 2019

In our first couple of posts, we ended up doing a lot of the processing work outside of Solr. It gave me a chance to polish my PowerShell skills, which was cool, and gave me a nice appreciation for having a scripting language that works on both Windows and Unix systems!

However, sometimes you want the search engine to be a black box. I take documents and put them into the black box, and now they are searchable. We added some additional complexity by moving to a parent/child relationship for the PDF’s because each page ended up being it’s own document in Solr. This meant another parsing script that dumped more intermediate format documents.

What if we could do everything inside of Solr? What if we could take the output from Tika with the Tesseract generated OCR content, and then convert that to a set of parent/child documents that are indexed into Solr?

Time for one of my favorite Get out of Jail Free cards from Solr, the awkwardly named StatelessScriptUpdateProcessorFactory which would let us put all that parsing logic into a script run inside of Solr. I’ve used this in the past a couple of times, but would it work with the extraction code?

We started with setting up a custom extract end point, but this time included a update.chain parameter:

 <requestHandler name="/update/speeches"
                  class="solr.extraction.ExtractingRequestHandler" >
    <str name="parseContext.config">parseContext.xml</str>
    <lst name="defaults">
      <str name="uprefix">attr_</str>
      <str name="multipartUploadLimitInKB">20480</str> <!--Limit to 20 MB PDF-->
      <str name="update.chain">process-speech-from-extracted-text</str>
    </lst>
  </requestHandler>

The update.chain is what lets us override the normal execution flow, and inject the call to our Scripting step:

  <updateRequestProcessorChain name="process-speech-from-extracted-text">
    <processor class="solr.StatelessScriptUpdateProcessorFactory">
      <arr name="script">
        <str name="script">process-speech.js</str>
      </arr>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

This then lets us call a custom Javascript script (though we can use other languages like Ruby etc), to deal with the text.

I’ll let you go through the process-speech.js script yourself. The big thing is that the Extract handler, for some odd reason, gives us not XML content, but the XML content with none of the wrapping < or > tags! So we can’t use the XML parsing logic that we’ve used previously, instead we do lots of string splitting!

Other things to note:

  • We are able to invoke any Java methods we want by prepending Packages. to the class name, like this example of base64 encoding: logger.info("Here comes some base 64: " + Packages.org.apache.solr.common.util.Base64.byteArrayToBase64(id.getBytes()));

  • Shockingly, we can create a brand new Solr input document: var childDoc = new Packages.org.apache.solr.common.SolrInputDocument(); and then add it to our parent document via calling the Java method on the object: doc.addChildDocument(childDoc);

  • The process-speech.js script is parsed at startup, so if you have syntax errors, or non compatible Javascript, then you will get an error. While not positive, I believe the version of Javascript supported by the Rhino engine in Java (or maybe called Nashorn?) is Javascript 6, so stay with the simplest Javascript syntax you can.

  • It’s nice that you can deploy this script via your regular SolrCloud friendly deploy scripts, and it would be interesting to see what other use cases there might be for this.

Read other posts in the Tika Tuesday series here.


More blog articles:


We've been Solr-istas since day one!

Our founder wrote the first book on Solr, now in 3rd edition. We've helped organizations from the US Patent and Trademark Office to Cisco build smarter search solutions with Solr.

Solr Services Solr Relevance Engineer Training