It’s okay to run Tika (and Tesseract!) inside of Solr ;-) If and Only If….

Recently I saw this post on solr-user mailing list asking about running Tika for text extraction in Solr, which if you follow the thread led to chorus of people saying:

In any case, you may want to run Tika externally to avoid theconversion/extraction process be a burden to Solr itself.

Indeed, this advice is captured in the Solr Reference Guide.

So I want to tell you when you are safe to ignore this advice! And yes, this is the minority opinion.

The common wisdom around Tika comes out of the experience of folks who are processing massive volumes of varied document types, where they have no influence over how these various documents were generated. Indeed, it is really an adversarial experience… “Am I going to receive a zip bomb that will blow up my environment?”, “Does this .scrpt file represent a virus?”, or even “Who the hell created a 1 GB PDF document?”.

In those situations all the standard advice makes sense, and I would point you to the batch mode parameters that are specifically around increasing the robustness of Tika:

>> java -jar tika-app-1.22.jar --helpBatch Options:    -i  or --inputDir          Input directory    -o  or --outputDir         Output directory    -numConsumers              Number of processing threads    -bc                        Batch config file    -maxRestarts               Maximum number of times the                               watchdog process will restart the child process.    -timeoutThresholdMillis    Number of milliseconds allowed to a parse                               before the process is killed and restarted    -fileList                  List of files to process, with                               paths relative to the input directory    -includeFilePat            Regular expression to determine which                               files to process, e.g. "(?i).pdf"    -excludeFilePat            Regular expression to determine which                               files to avoid processing, e.g. "(?i).pdf"    -maxFileSizeBytes          Skip files longer than this value

However, what about the rest of us? The folks whose use case looks like:

  • I have a smallish number of documents, say 1000’s of PDFs? Not > 100,000 PDF’s
  • My organization created the documents, so there won’t/shouldn’t be any adversarial Zip Bombs.
  • I only index a couple of documents a day, so load really doesn’t matter.
  • I don’t want to stand up yet another service… Solr does everything I need.
  • If Solr craps out for some reason, well, I can restart it, and everything will be okay.

I’d also like to point out that as we extend the types of workloads that we use Solr for, for example all of the exciting Streaming Expressions type jobs, that we need to think more about how to make Solr really robust to handle misbehaving processes of all kinds, not just Tika. So let’s improve the ExtractingRequestHandler 😉

Oh, and I’d love some more eyes on my effort to put together a demo of using Tesseract and Tika inside of Solr: