Tika Tuesdays: Using Tika and Tesseract as an API exposed by Solr

December 3, 2019 Eric Pugh
Category: Uncategorized

Don’t want to deploy a separate Tika server? But need Tika server-like capabilities and you already have Solr? Then this is the solution for you! What I am going to show has been available out of the box in Solr for quite a while, but has been very opaque to configure, so hopefully this demystifies the process.

First we figured out the magic incantation to configure Tika from inside of Solr, which is via the parseContext.config parameter and a specific XML formatted file:

<entries>
  <entry class="org.apache.tika.parser.pdf.PDFParserConfig" impl="org.apache.tika.parser.pdf.PDFParserConfig">
    <property name="extractInlineImages" value="true"/>
    <property name="ocrStrategy" value="OCR_AND_TEXT_EXTRACTION"/>
  </entry>
  <entry class="org.apache.tika.parser.ocr.TesseractOCRConfig" impl="org.apache.tika.parser.ocr.TesseractOCRConfig">
    <property name="outputType" value="HOCR"/>
    <property name="language" value="eng"/>
    <property name="pageSegMode" value="1"/>
  </entry>
</entries>

You might be tempted to think that this is the same file format as a tika-config.xml that you may have seen before, and you’d be wrong ;-). While visually very similar, this file is loaded by ParseContextConfig, which is part of the Solr extraction contrib module. Yes, there are many different ways to specify configuration settings for PDF extraction and Tesseract OCR!

We then tweaked the default /update/extract request handler to refer to the parseContext.xml file. We want any fields that we don’t already have defined in schema.xml to be prepended with the name attr_, which triggers dynamic field generation. So if the field from Tika is Creator, it becomes in Solr a text field called attr_creator.

Here is our config file:

<requestHandler name="/update/extract"
                class="solr.extraction.ExtractingRequestHandler" >
  <str name="parseContext.config">parseContext.xml</str>
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="multipartUploadLimitInKB">20480</str> Limit to 20 MB PDF
  </lst>
</requestHandler>

Because PDFs can be really big, we also needed to bump the size on the requestDispatcher:

<requestDispatcher handleSelect="true" >
  <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="20480" formdataUploadLimitInKB="20480" />
</requestDispatcher>

You can now hit Solr via:

curl 'http://localhost:8983/solr/documents/update/extract?literal.id=doc2&commit=true&extractOnly=true' -F "myfile=@files/alvarez20140715a.pdf"

and get back from Solr the Tika processed content in a relatively easy to process structure that is very similar to what Tika server returns!

Read other posts in the Tika Tuesday series here.