SIGIR LiveRAG Challenge Report - OpenSource Connections

August 7, 2025 Matthias Krueger
Category: Relevancy

This year’s SIGIR conference featured the “LiveRAG Challenge”. In this competition participants received 500 questions synthesized from a given corpus (FineWeb10BT) and had 2 hours time to generate answers using a fixed LLM (Falcon-10B). The final answers were LLM-judged (using a different model) on correctness and faithfulness and later evaluated by human judges. We were intrigued by the challenge’s constraints and took part to explore how different elements of a retrieval pipeline contribute to overall answer quality and if we could build a system that runs with decent performance on our local laptops.

Our contribution was a straight-forward linear retrieval and generation pipeline:

BM25 and dense retrieval with Snowflake/snowflake-arctic-embed-l embeddings (using “truncated”, i.e. non-chunked input) using the original question and a hypothetical answer paragraph (an approach usually called HyDE for “Hypothetical Document Embeddings”) that we generated using the given Falcon model
Reciprocal Rank Fusion of all 4 retrievers
Reranking using the relatively small unicamp-dl/InRanker-base cross-encoder
(LLM based filter to remove remaining irrelevant documents)
Multi-shot answer generation trying different ranking thresholds and context lengths

We had some more ideas (query-specific adjustments of the pipeline, gap-analysis of queries with low recall and possibly query rewriting) but ran out of time to implement them. We still ended up in a decent 6th spot (of 25 teams) and were ranked 8th after human re-evaluation. We provide some documentation of our results and the code. Our main learnings were:

For this task and mix of questions BM25 and dense retrieval complemented each other nicely and hybrid search led to better overall result quality.
When properly instructed, LLMs can generate hypothetical paragraphs that additionally improve recall of the relevant documents.
Point-wise binary LLM precision filtering is prone to misjudgment and false positives/negatives.
For answer generation, trying multiple attempts can provide an overall better result if negative outcomes can be detected reliably.
Even for a dataset of the given size no distributed system is strictly necessary and quick local experimentation is possible. bm25s and USearch are libraries that have proven to work well.

A key part of the LiveRAG Challenge was a workshop day in Padua as part of SIGIR 2025 which provided participants the opportunity to discuss their approaches and present the results. We had a brief presentation and poster slot and enjoyed the conversation with the other teams.

While the organizers have released a report, here’s a summary of what we learned from the other teams:

The team from RMIT-ADMS+S (paper, code) built a pipeline very similar to ours. They also successfully applied querying with hypothetical answers. Their LLM-based reranker uses the logits of the “yes” token as the ranking signal and they populate a fixed context of 15k words from the retrieved documents.
The Magikarp team (paper) used the quite heavy jina-reranker-m0 reranker (based on Qwen2-VL-2B-Instruct) on a large result set. They added an LLM based reranking step that is based on matching key expected “knowledge elements” from the query with the knowledge elements from retrieved documents.
The RAGtifier contribution (paper, code) combined a single dense retriever with reranking the top 200 results with the bge-reranker-v2-m3 model. They concentrated on evaluating answer prompt techniques and settled on an InstructRAG prompt that provided the best results. They populated the prompt using the top 5 documents after reranking in reverse document order.
The CIIR team’s mRAG (paper, code) used a complex multi-agent setup which they train using Self-Training using LLM-judged faithfulness and correctness as the rewards. Their search used a 1B sparse neural retriever.
The HLTCOE team (paper) also built their own index using a compressed ColBERT retriever. They used the original input as well as two LLM generated search queries for retrieval. The retrieved passages were post-filtered using a long-context retrieval model with a fixed similarity threshold.
The UDInfo team submitted PreQRAQ (paper). They developed a rule-based query classifier that was able to detect multi-document queries with high accuracy and were using class-specific query rewriters to generate two queries per original question. A hybrid BM25 + dense (using e5-base) retriever provided the input of a bge-reranker-v2 reranker.

In conclusion, we think this challenge brought a great opportunity to learn from a diverse set of approaches to solve the same problem. We want to thank the sponsors and the organizers for all the work they put into making this challenge possible and for considering our contribution. We would be glad to participate in a possible next iteration.