LLMs as Judges for Search Result Quality

As Large Language Models (LLMs) continue to transform how we build and evaluate search systems, it’s crucial to understand how to use them effectively as “judges.” This hands-on training will guide you through the principles, practical techniques, and real-world considerations for implementing “LLM as a Judge” to evaluate search result quality.

We’ll start with some fundamentals of search evaluation: How can search result quality be measured? How do LLM-based judgements differ from human ratings and behavioral signals? You’ll learn how to design clear evaluation frameworks, craft effective prompts, and define output structures that make LLM-based judgments robust, interpretable, and aligned with your rules and goals. We’ll then cover advanced techniques such as incorporating user personas and using critique models.

The class will cover these areas:

Introduction to search evaluation in the age of LLMs: how can we leverage LLMs for evaluation and how do they differ from human judgments and from using behavioral data for evaluation?
Designing robust LLM evaluation frameworks
Pointwise vs comparative (pairwise) judgments
Evaluation of LLM judges
Adding reasoning to LLM-based evaluation
Context engineering: improving evaluation quality by adding contextual information, working with user personas
Using critique models to improve evaluation quality
Testing your LLM evaluation framework for robustness

Who should attend this training?

Suitable for everyone with beginner to intermediate expertise in search. The course will give them a kickstart into using LLMs as Judges for search result quality in real-world applications.

The class will use Python Jupyter Notebooks for the hands-on labs. The code will be explained step-by-step. Nevertheless some basic knowledge of Python will be beneficial.

LLMs as Judges for Search Result Quality

Who should attend this training?

Event Dates

Speakers

Get Notified about Upcoming Trainings