Tesseract 4 is a major upgrade to this venerable OCR library, incorporating neural networks and lots of other great improvements, but not everyone has upgraded to it (including one of our customers), so we investigated using Tesseract 3 instead. It turns out that the HOCR support in Tesseract 3 is identical to Tesseract 4, which means that Tika doesn’t mind that it’s an older version.
Later on I found out the first integration of Tesseract with Tika was with Tesseract 3, so it shouldn’t have been a surprise it worked so well!
Want to try out Tesseract 3 inside of Tika? Checkout https://github.com/o19s/pdf-discovery-demo/tree/master/tika-server-tesseract docker image, which is based on the https://logicalspark.github.io/docker-tikaserver/ project. I got lucky and built the docker image on an older Linux distro, so it dragged in Tesseract 3 versus 4!
Read other posts in the Tika Tuesday series here.