Search Relevance Blog - OpenSource Connections

December 18, 2019 Doug Turnbull

What problem does BERT hope to solve for search?

I’m sure if you run in search or NLP circles, you’ve heard of BERT. It’s what Google famously used to improve 1 out of 10 searches, in what they…

December 17, 2019 Eric Pugh

Tika Tuesdays: Parsing Tika & Tesseract formatted HOCR output inside Solr ingestion pipeline

In our first couple of posts, we ended up doing a lot of the processing work outside of Solr. It gave me a chance to polish my PowerShell skills,…

December 11, 2019 Doug Turnbull

What is a ‘Relevant’ Search Result?

Summary Search relevance expert Doug Turnbull revisits the concept of search relevance, exploring why people disagree on what makes results relevant. Using “Rocky” vs “Rocky Horror Picture Show” as…

December 10, 2019 Eric Pugh

Tesseract 3 and Tika

Tesseract 4 is a major upgrade to this venerable OCR library, incorporating neural networks and lots of other great improvements, but not everyone has upgraded to it (including one…

December 9, 2019 Max Irwin

Demystifying nDCG and ERR

Welcome back, dear reader! In this post, we unwrap the mystery behind two popular search relevance metrics through visualization, and discuss their pros and cons. Our subjects for this…

December 6, 2019 Nate Day

A noob’s guide to indexing data with Solr’s classic schema

Intro I joined OSC a month ago as their first data-scientist, so I’ve been drinking from a firehose trying to get up to speed on Solr. After breezing through…

December 4, 2019 Doug Turnbull

What Should Your Search Document Be?

In a search engine, the “document” is the basic unit of indexing and retrieval. It’s the “result” on the search results screen when a user enters a query. Many…

December 3, 2019 Eric Pugh

Tika Tuesdays: Using Tika and Tesseract as an API exposed by Solr

Don’t want to deploy a separate Tika server? But need Tika server-like capabilities and you already have Solr? Then this is the solution for you! What I am going…

November 26, 2019 Eric Pugh

Tika Tuesday: Using Tika and Tesseract outside of Solr

Extracting content from file formats using Tika as a standalone service is the traditional architectural approach, and what my most recent project is built around. You can try out…

November 22, 2019 Eric Pugh

It’s time for Tika Tuesdays!

What is Tika Tuesdays? Over the past few months I’ve finally accomplished the long time personal goal of being able to easily search PDF documents with in context hit…

November 17, 2019 Bertrand Rigaldies

Haystack Europe 2019, Berlin, Germany, Conference Notes

Greetings! I was very fortunate to attend OpenSource Connections (OSC) Haystack Europe 2019 conference in Berlin, Germany, on October 28th, and below are my notes from the conference. Thank…

November 5, 2019 Max Irwin

Understanding BERT and Search Relevance

There is a growing topic in search these days. The hype of BERT is all around us, and while it is an amazing breakthrough in contextual representation of unstructured…

OpenSource Connections Blog

OpenSource
Connections Blog