As Large Language Models (LLMs) continue to transform how we build and evaluate search systems, it’s crucial to understand how to use them effectively as “judges.” This hands-on training will guide you through the principles, practical techniques, and real-world considerations for implementing “LLM as a Judge” to evaluate search result quality.
We’ll start with some fundamentals of search evaluation: How can search result quality be measured? How do LLM-based judgements differ from human ratings and behavioral signals? You’ll learn how to design clear evaluation frameworks, craft effective prompts, and define output structures that make LLM-based judgments robust, interpretable, and aligned with your rules and goals. We’ll then cover advanced techniques such as incorporating user personas and using critique models.
The class will cover these areas:
- Introduction to search evaluation in the age of LLMs: how can we leverage LLMs for evaluation and how do they differ from human judgments and from using behavioral data for evaluation?
- Designing robust LLM evaluation frameworks
- Pointwise vs comparative (pairwise) judgments
- Evaluation of LLM judges
- Adding reasoning to LLM-based evaluation
- Context engineering: improving evaluation quality by adding contextual information, working with user personas
- Using critique models to improve evaluation quality
- Testing your LLM evaluation framework for robustness
Who should attend this training?
Suitable for everyone with beginner to intermediate expertise in search. The course will give them a kickstart into using LLMs as Judges for search result quality in real-world applications.
The class will use Python Jupyter Notebooks for the hands-on labs. The code will be explained step-by-step. Nevertheless some basic knowledge of Python will be beneficial.
Get Notified about Upcoming Trainings
René Kriegler
Daniel Wrigley