Using GPT for Relevancy Judgements

Human ratings are essential to relevance tuning but quite resource-intensive to generate. However, once we’ve collected them we can quickly iterate over changes to our retrieval algorithm and estimate how effective our changes will be. As large language models (LLMs) approach human-level capabilities we might use them as part of our ratings collection program to gain more confidence in our offline relevance testing and increase the pace of improvements. Our initial tests with general knowledge search result evaluation show that LLMs like GPT perform promisingly well at this task, but they’re not a drop-in human replacement.

About the test

Overall our approach was to introduce an LLM rater to an existing set of ratings and evaluate how close the model’s ratings were to the average human rating. We used the training set Home Depot published as part of their Kaggle competition in 2016 since this is a public dataset with over 74k query-result pairs rated by a team of three human raters. 

To evaluate how closely the model rated to the average human rating we used Spearman’s correlation. Over a sample size of 100 we saw a correlation coefficient of 0.344 at a p-value of 0.00045. This points to statistically significant, moderately positive correlation between the two rating methods.


Home Depot included great instructions for how the ratings should be applied to each query-result pair, so we paraphrased this and provided it as context to each rating call, ending with:

Decide how relevant each product is to the query using these ratings:

- Reply with "1" if the product is Irrelevant

- Reply with "2" if the product is Partially or somewhat relevant

- Reply with "3" if the product is a Perfect Match

We chose to run the full test against OpenAI’s gpt-3.5-turbo model. Since this is a chat model (as opposed to a text completion model) we structured our input as a sequence of messages:

messages = [

     {"role": "system", "content": instructions},

     {"role": "user", "content": f"Rate the following query-result pair: {prompt}\n"}


In the full dataset Home Depot also included attributes for each product and the rating instructions asked the rater to examine the product image. We omitted both for the sake of cost and time, at the expense of accuracy.

The reply from the model included an explanation of why it chose a particular rating. For example:

Relevance: 3
Explanation: Perfect Match. The product title and description both contain the exact phrase "Hampton Wall Bridge Cabinet" which matches the query "bridge cabinet". The description also provides relevant information about the product's construction, design, and features, which further confirms its relevance to the query.

Next Steps

For a general knowledge use case like this we’ve shown that an off-the-shelf LLM can perform well in judging the relevance of a search result to a query. Some other areas to explore with this are changing the prompt to mimic specific user personas, digging into query-result pairs where the humans greatly disagreed with the model, and expanding the output so that the model can explain its ratings.

Recently Chain-of-Thought and Tree-of-Thought approaches have been shown to outperform the basic prompting approach we did here. This would be a great way to couple judgements with specific rating instructions that were followed by the model. This level of explainability can then lead to further improvements.

Lastly, fine-tuning with fully explained judgements might help the model anticipate which rating factors matter most by observing how a human reasons. There aren’t many, if any, datasets like this so it’s likely that it would have to be built as part of the experiment. 


Perspectives on Large Language Models for Relevance Judgment

Enhancing AMBOSS search evaluation with ChatGPT-generated judgment lists

If your team would like to extend its judgement list coverage through LLMs – or explore other applications of LLMs in search, we’d love to hear from you!

Image from Cyborg Vectors by Vecteezy