Tika Tuesdays: Parsing Tika & Tesseract formatted HOCR output inside Solr ingestion pipeline
In our first couple of posts, we ended up doing a lot of the processing work outside of Solr. It gave me a chance to polish my PowerShell skills,…
In our first couple of posts, we ended up doing a lot of the processing work outside of Solr. It gave me a chance to polish my PowerShell skills,…
Summary Search relevance expert Doug Turnbull revisits the concept of search relevance, exploring why people disagree on what makes results relevant. Using “Rocky” vs “Rocky Horror Picture Show” as…
Tesseract 4 is a major upgrade to this venerable OCR library, incorporating neural networks and lots of other great improvements, but not everyone has upgraded to it (including one…
Welcome back, dear reader! In this post, we unwrap the mystery behind two popular search relevance metrics through visualization, and discuss their pros and cons. Our subjects for this…
Intro I joined OSC a month ago as their first data-scientist, so I’ve been drinking from a firehose trying to get up to speed on Solr. After breezing through…
In a search engine, the “document” is the basic unit of indexing and retrieval. It’s the “result” on the search results screen when a user enters a query. Many…
Don’t want to deploy a separate Tika server? But need Tika server-like capabilities and you already have Solr? Then this is the solution for you! What I am going…
Extracting content from file formats using Tika as a standalone service is the traditional architectural approach, and what my most recent project is built around. You can try out…
What is Tika Tuesdays? Over the past few months I’ve finally accomplished the long time personal goal of being able to easily search PDF documents with in context hit…
Greetings! I was very fortunate to attend OpenSource Connections (OSC) Haystack Europe 2019 conference in Berlin, Germany, on October 28th, and below are my notes from the conference. Thank…
There is a growing topic in search these days. The hype of BERT is all around us, and while it is an amazing breakthrough in contextual representation of unstructured…
Recently I saw this post on solr-user mailing list asking about running Tika for text extraction in Solr, which if you follow the thread led to chorus of people…