Tika was originally built as a pure library that you would embed in other applications, but soon people wanted to access it over HTTP as a standalone service. In this post I’m going to summarize (as of February 2020) the various options that are out there for you!
The venerable Tika Application
You probably thought that the
tika-app.jar artifact was just for use on the command line or for firing up the simple built in GUI for Tika? Nope, it was also the very first option for accessing Tika via HTTP, by passing in the
--server parameter. You probably shouldn’t be using this approach ;-). The Tika App is nicely documented on the Getting Started page.
Now we are getting somewhere! Tika Server introduced accessing Tika via a REST interface, and also provides a lot of features around making sure misbehaving files don’t crash your Tika Server by supporting spawning sacrificial child Java processes for extraction ;-) Shockingly, all the details about Tika Server are NOT documented in the main Getting Started page. There is some good documentation available on the Wiki at https://cwiki.apache.org/confluence/display/tika/TikaJAXRS, though again, the wiki page could do with a better name! You can also learn about the individual REST api endpoints through some generated documentation, however at this time the 1.23 version didn’t seem to get published, though look at the 1.22 version and that should cover you.
And yes, I know, patches welcome to fix this!
Dropwizard based Server
If you are a fan of Dropwizard, then you might like this approach which wraps the Tika library in a Dropwizard based REST service. It was written by my colleague Matt Pearce who has done a lot of different work with Dropwizard. I find that Dropwizard is a much easier framework to reason about than the Apache CXF framework that Tika Server uses, so if I’m going to customize my Tika, I’d look at this as a base to start from.
LogicalSpark Docker image
Since January 2015 David Meikle has maintained a great Docker image for running Tika Server available from https://hub.docker.com/r/logicalspark/docker-tikaserver. The project source is at https://github.com/LogicalSpark/docker-tikaserver. He’s done a yeoman’s job of updating fairly quickly after every Tika release for many years. I’ve used this image quite a few times, thanks David!
Upcoming Tika Project “Blessed” Docker image
There has been some good discussion about if the Tika project itself should be releasing Docker images as part of the full release process. This is actually a topic of discussion across the Apache Software Foundation, as many groups are wrestling with what this means.
Right now that means you should keep an eye on a new repo at https://github.com/apache/tika-docker, which, pending guidance on ASF projects publishing Docker images, may the future.
There is also a Dockerfile that is also part of the Tika Server sub project: https://github.com/apache/tika/tree/master/tika-server that is very much inspired by the LogicalSpark version. This Dockerfile is provided to you to use to build your own image, it is NOT published to Docker Hub the way the LogicalSpark version is. If you need to install your own languages for Tesseract, or want to otherwise customize your image, this is where you should start!
Deep learning and Machine Learning Flavours of Tika!
No blog post is complete without mentioning DL and ML, and Tika is no exception! You might look at Tika api and say “It’s for content extraction”, and you’d be right. But if you squint slightly, you might say “It’s for content enrichment”, and that is where the https://github.com/USCDataScience/tika-dockers repository comes in.
Nothing has happened in about two years, so this is a bit of an orphan project, but interesting to look at.
Cloud native Deployment!
I may be repeating myself, but no blog post is complete without mentioning Cloud Native and Kubernetes, and yes, Tika is no exception here as well!
I can’t speak in detail, but the gist of it is that Quarkus and GraalVM give you super super fast startup times for Java based applications, which makes them suitable to be run in serverless environments. Quarkus is billed as:
A Kubernetes Native Java stack tailored for OpenJDK HotSpot and GraalVM, crafted from the best of breed Java libraries and standards.
In terms of Tika, it means that instead of lazy loading all of the dependent parsers, you specify up front what parsers you need, and you build a optimized “native” executable that meets those needs. Kind of like running a
The specific steps for doing this is available here: https://quarkus.io/guides/tika, and I found this talk from Sergey Beryozkin at ApacheCon 2019 super helpful in understanding the why of this: https://aceu19.apachecon.com/session/apache-tika-goes-native-graalvm-and-quarkus.
Install as a Service on Linux
This is my contribution to the pool of options that
I hope to have committed soon is available as of Tika 1.24! Check out the wiki to learn more. This is perfect for folks who want to install Tika as standard service, that starts up when the box starts, and is managed via standard commands:
service tika start and
service tika stop. All files are written to standard Linux locations, so the log files are where you would expect them ;-) This also supports running Tika Server on multiple ports if you wanted to do that, as well as supporting the
-spawnChild option by default.
This enhancement shamelessly copies the Solr service scripts, so if you like what you see at https://lucene.apache.org/solr/guide/8_4/taking-solr-to-production.html#taking-solr-to-production, then expect similar functionality for Tika.
This is still very much a work in progress, but I hope to have it committed for Tika 1.24, and then move on to cleaning up the rest of the Getting Started documentation.
I will endeavor to keep this post updated over time as deployment approaches evolve, but don’t hesitate to ping me about updates!
Read other posts in the Tika Tuesday series here.