Tesseract 3 and Tika
Tesseract 4 is a major upgrade to this venerable OCR library, incorporating neural networks and lots of other great improvements, but not everyone has upgraded to it (including one…
Tesseract 4 is a major upgrade to this venerable OCR library, incorporating neural networks and lots of other great improvements, but not everyone has upgraded to it (including one…
Welcome back, dear reader! In this post, we unwrap the mystery behind two popular search relevance metrics through visualization, and discuss their pros and cons. Our subjects for this…
Intro I joined OSC a month ago as their first data-scientist, so I’ve been drinking from a firehose trying to get up to speed on Solr. After breezing through…
In a search engine, the “document” is the basic unit of indexing and retrieval. It’s the “result” on the search results screen when a user enters a query. Many…
Don’t want to deploy a separate Tika server? But need Tika server-like capabilities and you already have Solr? Then this is the solution for you! What I am going…
Extracting content from file formats using Tika as a standalone service is the traditional architectural approach, and what my most recent project is built around. You can try out…
What is Tika Tuesdays? Over the past few months I’ve finally accomplished the long time personal goal of being able to easily search PDF documents with in context hit…
Greetings! I was very fortunate to attend OpenSource Connections (OSC) Haystack Europe 2019 conference in Berlin, Germany, on October 28th, and below are my notes from the conference. Thank…
There is a growing topic in search these days. The hype of BERT is all around us, and while it is an amazing breakthrough in contextual representation of unstructured…
Recently I saw this post on solr-user mailing list asking about running Tika for text extraction in Solr, which if you follow the thread led to chorus of people…
For the last few years I’ve run free Lucene Hackdays around the same time of year as the largest conference in our open source search sector, Activate (previously known…
In August’s Summer of Relevance we took 19 attendees through intense, small-class search relevance training workshops – including “Think Like a Relevance Engineer” and “Hello LTR – Learning to…