Background
Silverchair provides flexible technology solutions to the academic publishing industry, with clients including Oxford University Press, McGraw-Hill Medical, and MIT Press. Silverchair takes a collaborative approach to platform development, and with new opportunities in the field of Generative AI, they have been committed to creating solutions in partnership with clients. A key question that has repeatedly come up during the discovery and exploration process is how one can measure the quality of AI-generated results, as the question of quality directly impacts the level of investment in AI for many scholarly publishers.
Retrieval Augmented Generation (RAG) is a method for reducing hallucination, a common problem with Generative AI systems where the language model can invent plausible but incorrect answers. RAG uses a Retrieval step (a search engine) to supply a Generative system (powered by a large language model or LLM) with a series of search results as context, and carefully crafts the instructions (or prompt) sent to the LLM to rely on this context when giving answers. The prompt may also instruct the LLM to provide references drawn from the search results.
Silverchair worked with the search & AI specialists at OpenSource Connections (OSC) to build a system for evaluating different RAG architectures and models. This system will empower Silverchair’s team to test various different models and strategies with their customers and to derive best-of-breed approaches to RAG, solving customer requirements and maintaining high levels of quality.
“When ramping up in a new technology pattern such as RAG, we generally seek to get a boost from experts in the field. We had previously engaged with OSC for their search expertise and they were a clear fit for this engagement.”
Stuart Leitch, CTO at Silverchair
Creating a RAG evaluation framework
The OSC team started by understanding some earlier work on AI carried out by the Silverchair team and then delivered some essential background information on AI systems and RAG architectures. The first step was to describe a framework for evaluating RAG systems:
This framework shows how an evaluation dataset can be prepared using an LLM to generate a number of questions for a given context, the context being taken from a number of stored documents. Effectively this means we have a set of questions that can be answered by the content of these documents. This produces a number of question-context pairs that allow us to generate ground truths, which can be combined into triplets of question-context-truth. Human review makes sure that our questions and ground truths make sense.
We now ask our RAG system under test to generate answers using the question and context, and use the LLM to grade these answers against our ground truth, using a number of metrics.
Building the framework and calculating metrics
The Silverchair and OSC teams worked together to implement this architecture in Python with LlamaIndex, using the RAGAS framework for measuring RAG performance. We used a LangSmith dashboard for tracking the evaluation metrics. RAGAS provides the following metrics:
- Faithfulness: This measures the factual consistency of the generated answer against the retrieved context.
- Answer Relevance: Focuses on assessing how pertinent the generated answer is to the given question
- Context Precision: Evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. This metric is computed using the question and the contexts.
- Context Recall: Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.
- Answer Correctness: Involves gauging the accuracy of the generated answer when compared to the ground truth.
These relate to the question, context, answer and ground truth as shown below:
Running the RAG evaluation pipeline
Using a small set of sample data, OSC demonstrated how the evaluation pipeline can be run end to end, showing results on the Langsmith dashboard. A number of RAG strategies were tested and metrics calculated. The OSC team also helped build a configuration system so that mutiple different RAG strategies could be tested automatically.
Next steps with RAG
The Silverchair team were now able to take over the project and begin to use it to test a number of RAG strategies for their client use cases. Further work may include storing the results from each evaluation in a database, creating more in-depth evaluation datasets and ensuring their quality using human expert review.
“A naive RAG solution is simple to create. From there it rapidly gets complicated with lots of tradeoffs, particularly around performance, cost and latency. Empirically testing different approaches allowed us to refine our strategy intuitions and perform hyperparameter optimization. The evaluation pipeline has also allowed us to quickly evaluate new models, often swapping in a new model within days of their release.”
“We extended this work to provide our publishers a dynamic playground environment where they could experiment themselves with various strategies, models and parameters while seeing direct per-chat inference costs.”
Stuart Leitch, CTO at Silverchair
The Silverchair Lab environment is shown below:
Charlie Hull, OSC lead consultant
“Quality measurement is the key to developing effective generative AI, helping us to choose which models and technologies will work to solve particular problems. We were very happy to help Silverchair develop strategies and tools to serve their customers in this fast-moving field.”
For help with creating & evaluating RAG and other Generative AI solutions contact OSC today.