Recently I saw this post on solr-user mailing list asking about running Tika for text extraction in Solr, which if you follow the thread led to chorus of people saying:
In any case, you may want to run Tika externally to avoid the conversion/extraction process be a burden to Solr itself.
Indeed, this advice is captured in the Solr Reference Guide.
So I want to tell you when you are safe to ignore this advice! And yes, this is the minority opinion.
The common wisdom around Tika comes out of the experience of folks who are processing massive volumes of varied document types, where they have no influence over how these various documents were generated. Indeed, it is really an adversarial experience… “Am I going to receive a zip bomb that will blow up my environment?”, “Does this .scrpt file represent a virus?”, or even “Who the hell created a 1 GB PDF document?”.
In those situations all the standard advice makes sense, and I would point you to the batch mode parameters that are specifically around increasing the robustness of Tika:
>> java -jar tika-app-1.22.jar --help Batch Options: -i or --inputDir Input directory -o or --outputDir Output directory -numConsumers Number of processing threads -bc Batch config file -maxRestarts Maximum number of times the watchdog process will restart the child process. -timeoutThresholdMillis Number of milliseconds allowed to a parse before the process is killed and restarted -fileList List of files to process, with paths relative to the input directory -includeFilePat Regular expression to determine which files to process, e.g. "(?i)\.pdf" -excludeFilePat Regular expression to determine which files to avoid processing, e.g. "(?i)\.pdf" -maxFileSizeBytes Skip files longer than this value
However, what about the rest of us? The folks whose use case looks like:
- I have a smallish number of documents, say 1000’s of PDFs? Not > 100,000 PDF’s
- My organization created the documents, so there won’t/shouldn’t be any adversarial Zip Bombs.
- I only index a couple of documents a day, so load really doesn’t matter.
- I don’t want to stand up yet another service… Solr does everything I need.
- If Solr craps out for some reason, well, I can restart it, and everything will be okay. <– Very common Disaster Recovery plan ;-)
I’d also like to point out that as we extend the types of workloads that we use Solr for, for example all of the exciting Streaming Expressions type jobs, that we need to think more about how to make Solr really robust to handle misbehaving processes of all kinds, not just Tika. So let’s improve the ExtractingRequestHandler ;-)
Oh, and I’d love some more eyes on my effort to put together a demo of using Tesseract and Tika inside of Solr: https://github.com/o19s/pdf-discovery-demo/tree/crazy_tika_tesseract_inside_of_solr