In our first couple of posts, we ended up doing a lot of the processing work outside of Solr. It gave me a chance to polish my PowerShell skills, which was cool, and gave me a nice appreciation for having a scripting language that works on both Windows and Unix systems!
However, sometimes you want the search engine to be a black box. I take documents and put them into the black box, and now they are searchable. We added some additional complexity by moving to a parent/child relationship for the PDF’s because each page ended up being it’s own document in Solr. This meant another parsing script that dumped more intermediate format documents.
What if we could do everything inside of Solr? What if we could take the output from Tika with the Tesseract generated OCR content, and then convert that to a set of parent/child documents that are indexed into Solr?
Time for one of my favorite Get out of Jail Free cards from Solr, the awkwardly named StatelessScriptUpdateProcessorFactory which would let us put all that parsing logic into a script run inside of Solr. I’ve used this in the past a couple of times, but would it work with the extraction code?
We started with setting up a custom extract end point, but this time included a
<requestHandler name="/update/speeches" class="solr.extraction.ExtractingRequestHandler" > <str name="parseContext.config">parseContext.xml</str> <lst name="defaults"> <str name="uprefix">attr_</str> <str name="multipartUploadLimitInKB">20480</str> <!--Limit to 20 MB PDF--> <str name="update.chain">process-speech-from-extracted-text</str> </lst> </requestHandler>
update.chain is what lets us override the normal execution flow, and inject the call to our Scripting step:
<updateRequestProcessorChain name="process-speech-from-extracted-text"> <processor class="solr.StatelessScriptUpdateProcessorFactory"> <arr name="script"> <str name="script">process-speech.js</str> </arr> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
I’ll let you go through the process-speech.js script yourself. The big thing is that the Extract handler, for some odd reason, gives us not XML content, but the XML content with none of the wrapping < or > tags! So we can’t use the XML parsing logic that we’ve used previously, instead we do lots of string splitting!
Other things to note:
We are able to invoke any Java methods we want by prepending
Packages.to the class name, like this example of base64 encoding:
logger.info("Here comes some base 64: " + Packages.org.apache.solr.common.util.Base64.byteArrayToBase64(id.getBytes()));
Shockingly, we can create a brand new Solr input document:
var childDoc = new Packages.org.apache.solr.common.SolrInputDocument();and then add it to our parent document via calling the Java method on the object:
Rhinoengine in Java (or maybe called
It’s nice that you can deploy this script via your regular SolrCloud friendly deploy scripts, and it would be interesting to see what other use cases there might be for this.
Read other posts in the Tika Tuesday series here.