Extracting content from file formats using Tika as a standalone service is the traditional architectural approach, and what my most recent project is built around. You can try out a demo online at http://pdf-discovery-demo.dev.o19s.com:8080/.
This article walks you through the steps we followed to run Tika, both as a server process and using it as a command line tool.
If you want to follow along locally, follow the Quickstart in the README to run the demo. To get an overview of all the steps, follow the Text Extraction instructions. The key bit is an extraction script that calls Tika to extract from a PDF the information we need.
A couple things that are interesting:
It’s super simple to swap between Tika the Command Line App and Tika the Server Process. The nice thing about using the
tika-app.jaris that all your dependencies for parsing are packaged up into one 78 MB file. Very easy to include that in your project. However, if you are going for scale, then you might want to run a cluster of Tika server processes with a load balancer in front, and then you would want to swap to making a
curlcommand against a deployed Tika server.
Deploying Tika server in dockerized world is super simple: https://github.com/o19s/pdf-discovery-demo/blob/master/docker-compose.yml#L47. However, I do wish the Apache Tika project had a official image that was released every time Tika was released. ;-) I also wish that for non Docker setups, there was a nice set of service scripts provided to manage starting/restarting Tika.
I’m very happy to report that in Tika-1.23, you can now configure the PDF and OCR Parsers via a single
tika-config.xmlfile. In 1.22 and earlier, you needed to have on the filesystem a
tika-propertiesdirectory that was included in the classpath. You can see an old commit where this was done: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser. It was awkward! Now you can use your tika-config.xml to set everything.
- I learned about the magic header parameters that you can send to the Tika server to configure your parser. This is an alternative to either the properties file configuration or the
tika-config.xmlconfiguration. It’s cool, but also more magic… For example, the parameters don’t follow any over all pattern of naming them, so you need to be very careful to not misspell them:
curl -T ./path/mypdf.pdf http://pdf-discovery-demo.dev.o19s.com:9998/rmeta --header "X-Tika-OCRLanguage: eng" --header "X-Tika-PDFOcrStrategy: ocr_and_text_extraction" --header "X-Tika-OCRoutputType: hocr"
Parsing out the HOCR output looks daunting at first, till you realize you just care about
<span class="ocrx_word">tagged content. Check out https://github.com/o19s/pdf-discovery-demo/blob/master/ocr/extract.ps1#L63 to see both the HOCR pulled out of the XML as well as the raw text pulled out.
You can store lots of different data in your payloads! We have the bounding box from HOCR, but also store the page number as well in the payload. Base64 encode it all to store in Solr.
- We did a crazy thing to allow us to do traditional highlighting of snippets of text in our SERP page, but then link each snippet to the payload based highlights in the PDF document, even though there was no explicit connection between the two. We track the offset of our highlights, and pass that along in the front end, in order to give the front end additional data to narrow down the payload highlighting. We did this via custom formatter which injects into the HTML
<em/>tag additional data that lets the front end figure out how to make a connection between a snippet highlight and a payload highlight:
<em data-num-tokens="1" data-score="1.0" data-end-offset="2110" data-start-offset="2079">HELOCs|NiAzNTEgNTAyIDQwNSA1Mjc=</em>
Read other posts in the Tika Tuesday series here.