Tesseract 3 and Tika

Eric PughDecember 10, 2019

Tesseract 4 is a major upgrade to this venerable OCR library, incorporating neural networks and lots of other great improvements, but not everyone has upgraded to it (including one of our customers), so we investigated using Tesseract 3 instead. It turns out that the HOCR support in Tesseract 3 is identical to Tesseract 4, which means that Tika doesn’t mind that it’s an older version.

Later on I found out the first integration of Tesseract with Tika was with Tesseract 3, so it shouldn’t have been a surprise it worked so well!

Want to try out Tesseract 3 inside of Tika? Checkout https://github.com/o19s/pdf-discovery-demo/tree/master/tika-server-tesseract docker image, which is based on the https://logicalspark.github.io/docker-tikaserver/ project. I got lucky and built the docker image on an older Linux distro, so it dragged in Tesseract 3 versus 4!

Read other posts in the Tika Tuesday series here.


More blog articles:


We've been Solr-istas since day one!

Our founder wrote the first book on Solr, now in 3rd edition. We've helped organizations from the US Patent and Trademark Office to Cisco build smarter search solutions with Solr.

Solr Services Solr Relevance Engineer Training