Tesseract 3 and Tika

Eric PughDecember 10, 2019

In which we deal with learning that sometimes you don't get to use the latest version of Tesseract...

Demystifying nDCG and ERR

Max IrwinDecember 9, 2019

In this post, we unwrap the mystery behind two popular search relevance metrics, and discuss their pros and cons. Our subjects for this exercise are Normalized Discounted Cumulative Gain, and Expected Reciprocal Rank, commonly acronymified as nDCG and ERR. We'll start with some refresher background, visualize what these metrics actually look like, and paint a picture of how each can be either helpful or misleading, depending on the situation. Afterwards, you'll have a better understanding of their behavior and which ones to use when (and why).

It's time for Tika Tuesdays!

Eric PughNovember 22, 2019

It's time for Tika Tuesdays! Three years ago I started messing around with OCRing documents with Tika, and today that process is relatively straightforward. This weekly series will share what I've learned.

Understanding BERT and Search Relevance

Max IrwinNovember 5, 2019

There is a growing topic in search these days. The hype of BERT is all around us, and while it is an amazing breakthrough in contextual representation of unstructured text, newcomers to NLP are left scratching their heads wondering how and why it is changing the field. Many of the examples are tailored for tasks such as text classification, language understanding, multiple choice, and question answering. So what about just plain-old findin' stuff? This article gives an overview into the opportunities and challenges when applying advanced transformer models such as BERT to search.

event

TALMIRI - Talent meets IR Industry

Charlie HullSeptember 18, 2019

TALMIRI is a symposium to bring together talented early-stage researchers, academics and information retrieval industry with invited speakers in an iconic location in Bedfordshire, UK. Practioners and researchers, in particular but not exclusively PhD/MSc students and early-stage researchers, are invited to present (as poster or demo) their exciting search and information retrieval related work to IR experts and industry. Charlie Hull of OSC will be presenting on Working in Open Source Search, giving attendees an insight into the sort of projects and technology he has dealt with over a 20 year career in search.

training

E-commerce Search for Product Managers (MICES, Berlin)

Charlie HullJune 18, 2019

This full day class helps you understand how working on onsite search for e-commerce requires different thinking than other engineering problems. We teach you to measure search quality, take a hypothesis-driven approach to search projects, and safely 'fail fast' towards ever improving business KPIs.

Falsehoods Programmers Believe About Search

Max IrwinMay 29, 2019

Search is a deceptively complex field, where competence is hard-won through training, practice, and experience. In that vein dear friends, I've exhumed the venerable "Falsehoods Programmers Believe" party from 4 years ago to bring you one about, no less, Search.