How to Evaluate Response of a RAG Empowered LLM

Years of relentless advancements in artificial intelligence have led us to a pivotal breakthrough: significantly enhancing the accuracy of Large Language Models (LLMs) and effectively reducing their tendency to generate ‘hallucinated‘ responses. 

At the heart of this evolution is Retrieval Augmented Generation (RAG) – a technology widely recognized by experts as a significant step forward for LLMs.

RAG ingeniously blends sophisticated information retrieval with advanced text generation, notably refining the capabilities of celebrated models like GPT-4 and BERT, especially in handling complex queries.

While enterprises are keen to harness the potential of RAG, a critical aspect lies in effectively evaluating the quality of responses generated. The question of how to accurately measure and validate the efficacy of RAG-enhanced models is paramount. 

In this article, we delve into the nuances of response evaluation, introducing you to key metrics that are instrumental in gauging the precision and reliability of answers provided by RAG-powered LLMs

‘Retrieval’ and ‘Generation’ are two important parts of the process, each requiring a different set of metrics to be evaluated. Let’s take a look.

Retrieval Role in RAG

In the Retrieval-Augmented Generation (RAG) system, the retrieval process is crucial. It searches a wide range of sources, like documents and web pages, to find information that matches a specific query or task. 

The effectiveness of the RAG system largely depends on how accurately this retrieval step gathers relevant information. Therefore, when evaluating the RAG system, special attention is given to the retrieval aspect. 

To assess its performance, various metrics as listed are used to measure the precision and relevance of the information it retrieves:

  1. Context Relevance
  2. MMR (Maximum Marginal Relevance)
  3. Context Recall

Let’s take a look and understand each one by one.

1. Context Relevance

This assesses how relevant and appropriate the provided context is for a given query or question. Let’s break it down to understand better:

  • Context: Refers to the information or background data provided along with a question or query. In RAG-empowered LLMs, context can include preceding conversations, relevant documents, or any added external information that might help understand and answer a query accurately.
  • Relevance: Measures how well this context aligns with the query. The more relevant the context, the more likely it is to yield accurate and pertinent responses from an AI model.

How is Context Relevance Measured?

To measure context relevance, cosine similarity is employed. This involves first transforming the context and the query into numerical forms, known as embeddings. These are high-dimensional vectors capturing the text’s semantic meaning.

In cosine similarity, the values range from -1 to 1. A score of 1 indicates identical vectors, suggesting high relevance, while 0 shows no similarity, and -1 implies complete oppositeness. For relevance, we aim for values close to 1.

When dealing with long contexts, they are broken down into smaller chunks. Each chunk’s relevance to the query is measured individually, helping to identify the most pertinent parts of the context.

This method not only compares context and query but also evaluates the quality of the embeddings and the effectiveness of the chunking process. Importantly, calculating context relevance doesn’t require a pre-defined correct answer; it’s about analyzing the relationship between context and query independently.

2. MMR (Maximal Marginal Relevance)

MMR is an algorithm designed to optimize the selection of a set of documents or results based on two criteria: relevance and diversity.

  • Relevance: How well a document or piece of content matches the user’s query or input.
  • Diversity: How different is each selected document from the others, ensuring a variety of perspectives or information types?

How to Measure MMR?

For Relevance Assessment, an objective must be to determine how closely the documents selected by MMR align with the input query.

The metrics that can be used to measure :

  • Precision: Measures the proportion of selected documents relevant to the query. High precision indicates that most of the retrieved documents are relevant.
  • Recall: Assesses the proportion of all relevant documents that were successfully retrieved. High recall indicates that the algorithm effectively captures most of the relevant documents.
  • F1 Score: A harmonic mean of precision and recall, balancing the two. It’s particularly useful when you need a single metric to reflect both aspects.

In Contrast, for diversity measurement, there must be an objective to evaluate the range and variance in the content of the documents selected, ensuring they cover different aspects or perspectives related to the query.

The metrics that can be used to measure :

  • Inter-Document Dissimilarity: This involves calculating the dissimilarity between each pair of documents in the retrieved set. Methods like cosine similarity on document embeddings can be used, where lower similarity scores indicate greater diversity.
  • Topic Coverage: Analyzing the range of topics or themes covered by the documents. This can be done through topic modeling techniques or manual analysis, ensuring that the selected documents span a broad spectrum of relevant topics.
  • Content Redundancy: A measure of how much repetitive information is present in the retrieved documents. Lower redundancy indicates a higher effectiveness of the MMR in diversifying the content.

3. Context Recall

This metric quantifies the system’s ability to accurately retrieve and incorporate relevant external knowledge, essential for generating contextually relevant responses. This metric assesses how well RAG identifies all relevant instances within a specific context.

Context Recall needs a labeled Ground Truth answer, which acts as a benchmark to compare the context retrieved by the system. It enables us to judge how well the retrieval system has performed in retrieving relevant context to answer a given query. 

The presence of a clear, labeled Ground Truth helps in determining the accuracy of the True Positives and False Negatives identified by the system.

How to Measure Context Recall?

Begin by creating a confusion matrix that summarizes the performance of your model.

The confusion matrix typically includes four categories: True Positives (TP), False Negatives (FN), False Positives (FP), and True Negatives (TN).

Use the Formula for Context Recall: The formula for context recall is:

 Context Recall = True Positives / (True Positives + False Negatives)

  • True Positives (TP): These are instances correctly identified as relevant within the context.
  • False Negatives (FN): These are instances that are relevant but missed by the model.

Simply plug the values of TP and FN into the formula to calculate context recall. The result will be a value between 0.0 (no recall) and 1.0 (full or perfect recall).

Generation Role in RAG

While the retrieval stage in the RAG (Retrieval-Augmented Generation) pipeline is crucial, it’s the generation stage that ultimately shapes the final output. This stage involves a Large Language Model (LLM), which synthesizes and generates responses based on the information retrieved in the first stage. 

Understanding what the LLM is generating is true, not hallucination is very crucial, and for that, the model can be evaluated under various metrics listed down below:

  1. Faithfulness Evaluator/ Groundedness
  2. Correctness Evaluator
  3. Confidence Score/ Certainty Score
  4. Answer Relevancy 

Let’s take a look and understand each one by one.

1. Faithfulness Evaluator/ Groundedness

This is a metric designed to assess whether the response generated by an AI is factually accurate or if it is ‘hallucinated’ (i.e., fabricated or not based on facts). This metric is essential for determining the fallacy levels in the response.

It establishes the truthfulness of the AI’s response to the provided context, ensuring that the AI model generates plausible and factually incorrect information.

How Faithfulness is Measured?

The Faithfulness Evaluator/Groundedness metric can be complex, so let’s break it down further into a more detailed step-by-step process to understand how it actually works.

Step 1: Generating the AI Response

  • Starting Point: The process begins when a user poses a question or a prompt to an AI system.
  • AI Response: Using a language model, the AI generates a response to this question or prompt.

Step 2: Breaking Down the AI Response into Statements

  • Statement Identification: The generated response is then dissected into individual statements or claims. This is done by the AI itself, which identifies distinct assertions or facts that the response presents.

Step 3: Verifying Each Statement

  • Verification Process: Each identified statement is then subjected to a verification process. This is where the ‘additional API calls’ come into play. The AI uses another tool or system (which can be part of the same language model or an external system) to check the factual accuracy of each statement.
  • Binary Outcome: The verification tool returns a binary outcome for each statement: ‘1’ if the statement is factually true or supported by the given context, and ‘0’ if it is false or hallucinated.

Step 4: Calculating the Faithfulness Score

  • Score Calculation: The faithfulness score is calculated by dividing the number of statements that are verified as true (those receiving a ‘1’) by the total number of statements made in the response.
  • Result: This score represents the proportion of the factually accurate response grounded in truth.

In a RAG-empowered LLM system, the faithfulness evaluation is not just about the generated text but also about how well this text aligns with the information retrieved by the RAG component.

This measure aims to ensure that the AI’s responses are factually accurate and based on reliable information, thereby enhancing the trustworthiness and reliability of the system in providing informed answers.

2. Correctness Evaluator

In a RAG-empowered LLM, the Correctness Evaluator plays a critical role in validating that the AI-generated responses are not only contextually relevant but also factually accurate and aligned with a verified source of truth (golden answer).

This process is crucial in ensuring the trustworthiness and reliability of AI responses, particularly in scenarios where accurate information is paramount.

How is Correctness Measured?

The Correctness Evaluator verifies that AI responses align with the query’s intent and are factually accurate. It primarily uses two methods: Direct Comparison and the BERT Score Method. 

Together, these approaches robustly assess AI outputs, ensuring responses are relevant and meet high factual accuracy standards.

Direct Comparison

The AI’s response is compared with a “golden answer” — a pre-validated, accurate response to the same query. This comparison, typically involving additional API calls, results in a binary score: ‘1’ indicates a correct and relevant response; ‘0’ signifies inaccuracy or irrelevance.

BERT Score Method

Alternatively, the BERT score evaluates the similarity between the AI’s response and the golden answer, focusing on precision and information depth. The outcome is an F1 score, balancing precision and recall.

In essence, the Correctness Evaluator, whether through direct comparison or the BERT score, ensures that the AI’s responses are not only relevant but also factually accurate, upholding the reliability and trustworthiness of AI communications.

3. Confidence Score / Certainty score

The Confidence or Certainty Score in AI evaluates the reliability of AI-generated responses. It assesses how accurately and coherently the response is formulated and checks for any factual inaccuracies, referred to as ‘hallucinations.’ This metric is essential for gauging the AI model’s assurance in its own outputs.

How is the Confidence Score Measured?

The score is computed using techniques like the sequence score, which analyzes the overall coherence and the probability of tokens, assessing the likelihood of each word in context. Another method is to compute the transition score, evaluating the logical flow of the response.

These calculations are part of the response generation, providing insights into both the overall response and individual word confidence levels.

The Confidence Score is crucial for understanding how confident the AI is in its response, offering a detailed view of the model’s proficiency in generating accurate and relevant answers.

The token-level assessment also aids in pinpointing less reliable sections of the response.

This metric does not rely on a comparison with a pre-determined correct answer, making it 

an independent measure of the response’s quality.

4. Answer Relevancy 

Answer Relevancy is a metric in AI that gauges how precisely an AI-generated response aligns with the intent and context of a given question. It’s essential for verifying that the AI not only responds coherently but also directly addresses the specific query without veering off-topic or providing irrelevant information.

How is the Answer Relevancy Measured?

To determine answer relevancy, we methodically generate related questions and then evaluate their alignment with the standard query through a series of precise steps:

Creating Related Questions

The process starts by generating a set of questions related to the AI’s response through a specialized prompt. This step aims to explore a variety of potential questions that the response could address.

Assessing Similarity

The crux of the calculation lies in comparing these generated questions with the actual query. This assessment focuses on how well the AI’s response matches the original question in terms of relevance and specificity.

Unlike some metrics, Answer Relevancy doesn’t require a ‘golden answer’ for comparison. Instead, it purely evaluates the response’s relevance to the original question, highlighting the AI’s capacity for targeted and contextually accurate responses.

The Metrics Use Depends on the Use Case of LLMs

Now that we have a broad idea of the evaluation metrics that we can use to evaluate the retrieval and generation pipelines of the LLMs. We must be aware that the metrics used depend on the use case. The metrics used for the QA system are different from what we use to evaluate the Summarisation system. 

Precise Metrics For QnA Systems

In the QA pipeline, there are generally two configurations. The first involves passing specific context that is closed QA, and the second is open-ended QA. In our scenario, we will be utilizing a closed QA system, which means the system operates within a given context. 

This approach significantly reduces the likelihood of the system generating hallucinations or inaccurate responses compared to an open-ended setup. Furthermore, the QA methodologies are divided into two main categories: abstractive and extractive.

Extractive Question Answering

In extractive QA, the LLM operates by identifying and extracting the most relevant segments of text from a given context. This process typically involves the LLM selecting the ‘k’ most significant sentences or phrases that directly answer the query. This approach is more straightforward as it involves lifting portions of the text verbatim. 

However, it also presents challenges, especially when it comes to evaluating the effectiveness of these responses.

Metrics for Extractive QA

  1. BLEU Score 
  2. Context Coverage 
  3. Precision
  4. F1-Score
  5. ROUGE
BLEU Score 

The BLEU Score is a metric for evaluating the quality of the language model’s response. It measures how closely the output aligns with high-quality human translations by comparing sequences of words (n-grams) in both responses.

#Calculation To calculate the BLEU Score in QA, the system quantifies the overlap of n-grams between the model’s answer and the reference answers. It counts the number of n-gram matches and calculates precision scores for these matches. The final BLEU Score, ranging from 0 (no overlap) to 1 (complete overlap), is derived from the average of these precision scores, emphasizing the matching word sequences.

#Usefulness The BLEU Score is crucial for evaluating how closely a model’s answers mirror the phrasing and content of standard human responses. A high BLEU Score indicates that the model can accurately extract and replicate key information from texts, which is vital for the effectiveness and reliability of QA systems.

Context Coverage 

This metric is used to evaluate how comprehensively an answer captures information from the provided context. It assesses the extent to which the answer includes relevant content from that context.

#Calculation Context Coverage is based on recall, which involves comparing the n-grams (sequences of n consecutive words) in the answer with those in the context. This process quantifies the proportion of n-grams in the answer that are also found in the context. A higher proportion indicates better context coverage.

#Usefulness This is vital for understanding the effectiveness of a QA system in utilizing the given context to formulate answers. It helps in determining whether the model is effectively incorporating key information from the context into its responses, ensuring that the answers are not only correct but also contextually relevant and comprehensive.

Precision

These metrics specifically measure how much of the information provided in the model’s response is relevant and accurate without including misinformation or irrelevant details.

#Calculation It is calculated by determining the ratio of the number of overlapping n-grams (sequences of n consecutive words) found both in the model’s answer and the reference answer to the total number of n-grams in the model’s answer. 

#Usefulness Is crucial for evaluating the reliability of a QA system. It ensures that the answers provided by the model are factually correct and relevant to the question, minimizing the inclusion of extraneous or incorrect information. High precision indicates a model that can deliver concise and accurate responses, which is essential for user trust and effective information retrieval.

F1-Score

The F1-Score in Extractive Question Answering (QA) is a combined metric that evaluates the balance between the comprehensiveness of an answer (Context Coverage) and its accuracy (Precision). It serves as a single measure to assess how effectively the model’s response covers relevant context while maintaining accuracy.

#Calculation F1-Score is calculated as the harmonic mean of Context Coverage and Precision. Context Coverage measures the proportion of relevant information from the context included in the answer, while precision assesses the accuracy of the answer by comparing the ratio of correct information to the total information provided. By combining these two metrics, the F1-Score provides a balanced view of the model’s performance in terms of both relevance and accuracy.

#Usefulness The F1-Score is crucial for evaluating QA systems as it ensures a balanced consideration of both relevance and accuracy. It helps in identifying models that not only provide accurate answers but also include a comprehensive range of information from the given context. A high F1-Score indicates a model that effectively balances these aspects, making it more reliable and useful in practical applications.

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a metric similar to BLEU for evaluating text summarization and machine translation. It assesses output quality by calculating the overlap in n-grams (sequences of n consecutive words) between the generated text and a reference text.

#Calculation It calculates the overlap by counting the number of n-grams in the generated text (candidate n-grams) that match the n-grams in the reference text. The focus can be on different lengths of n-grams, such as 1-gram (single word), 2-gram (two consecutive words), and so on. The score is a measure of how many n-grams are common between the candidate text and the reference, reflecting the extent of similarity in content and structure.

#Usefulness By measuring how much of the reference content is captured in the generated text, ROUGE helps in assessing the effectiveness of models in producing coherent and contextually relevant summaries or translations. It is a crucial metric for developers and researchers in the field of natural language processing to refine and validate their models.

Abstractive Question Answering

Abstractive QA is more complex. Here, the LLM is not just finding and extracting text but paraphrasing and reinterpreting the context to form new, coherent answers. 

This requires a deeper understanding and processing capability from the LLM, making the evaluation metrics slightly different from extractive QA.

Metrics for Abstractive QA

  1. Confidence Score
  2. Embedding Similarity
  3. BERTScore
  4. Human-Centric Evaluation Score
  5. METEOR
Confidence Score

This metric quantifies the likelihood of a language model’s response being correct and relevant to a given query. It represents the model’s self-assessment of its prediction accuracy in generating a concise and coherent answer abstracted from the source material.

#Calculation: For each response generated by the model, a Confidence Score is calculated. This score, typically ranging between 0 and 1, is determined based on the model’s internal algorithms and its training on various data sets. A higher score indicates greater confidence in the response’s accuracy and relevance to the question, while a lower score suggests uncertainty.

#Usefulness Confidence Scores are essential for evaluating the reliability of the model’s answers. They help distinguish between high- and low-certainty responses, guiding further improvement of the model. In practical applications, these scores can be used to trigger additional verification steps or to refine the model’s training, ensuring that the answers generated are not only contextually accurate but also closely aligned with the user’s information needs.

Embedding Similarity

This metric measures the degree of semantic similarity between the vector representations (embeddings) of the model-generated answer and the reference answer or source text. This metric evaluates how closely the concepts and meaning in the model’s response align with those in the reference material.

#Calculation For Embedding Similarity calculation, both the generated answer and the reference text are first converted into vector representations using embedding techniques (like word2vec, GloVe, or BERT embeddings). The similarity between these vectors is then computed, often using cosine similarity, which measures the cosine of the angle between the two vectors. The resulting value, ranging from -1 (completely dissimilar) to 1 (identical), quantifies the semantic closeness of the generated answer to the reference text.

#Usefulness Embedding Similarity is key for ensuring that the model’s answers in QnA are not only factually accurate but also contextually and semantically relevant. It helps in confirming that the model captures the essence of the question or source material, providing answers that are meaningfully aligned with the user’s query.

BERTScore

This metric uses BERT’s pre-trained contextual embeddings to compare the generated text (candidate sentences) with reference sentences. By measuring the cosine similarity between embeddings, BERTScore evaluates how well the candidate text aligns semantically with the reference text.

#Calculation BERTScore matches words in candidate and reference sentences by cosine similarity of their embeddings. For each pair of sentences, it computes precision, recall, and F1 measure. Precision reflects the proportion of the candidate sentence’s embeddings that are similar to those in the reference, indicating the quality of the information in the response. Recall measures the extent to which reference sentence embeddings are captured in the candidate, showing the response’s coverage. The F1 score combines these two to provide an overall assessment.

#Usefulness BERTScore has shown a strong correlation with human judgment, making it useful for various language generation tasks, including QnA. Its ability to understand context and semantics through BERT embeddings provides a nuanced evaluation of text quality. The metric helps ensure that generated text is not only contextually aligned with the reference but also maintains semantic integrity. This is crucial in applications where accuracy in both information and context is a must.

Human-Centric Evaluation Score (HCES) 

Human-Centric Evaluation Score (HCES) is a method used to assess the performance of Large Language Models (LLMs) by employing human judges. Unlike automated metrics, HCES relies on human evaluators who review and rate the model’s outputs based on various criteria, capturing the subtleties and nuances of language that automated systems might miss.

#Calculation In HCES, human judges rate the quality of the model’s outputs on a specified scale, such as 1 to 5. These ratings are subjective and depend on the criteria set by the evaluators, which might include factors like relevance, coherence, factual accuracy, and fluency. The scores from multiple judges are then averaged to provide a general sense of the model’s accuracy or overall quality. This approach has no standardized accuracy value and varies based on the evaluators’ judgment.

#Usefulness The primary benefit of HCES is its ability to capture the subtle aspects of language and the overall effectiveness of the generated text, which automated metrics may overlook. It offers a more comprehensive understanding of a model’s performance, especially in terms of its ability to produce contextually appropriate and human-like responses. However, HCES can be time-consuming and costly to conduct on a large scale and is subject to human bias and variability, which can affect the consistency and reliability of the evaluations.

METEOR 

This is another metric used to evaluate the quality of machine translations. Unlike BLEU, METEOR pays attention to synonyms and sentence structure, offering a more nuanced understanding of language fluency and coherence. It assesses how well the translation captures the actual meaning of the original text, even if different words or sentence structures are used.

#Calculation METEOR Scores range from 0 to 1, with higher scores indicating better translations. The score reflects not just word-for-word accuracy but also the use of synonyms and the overall structure of the sentences. A high METEOR score suggests a translation that is accurate in meaning and fluency, even with changes in wording. A low score might indicate a translation that, while broadly correct, strays from the precise wording or sentence structure of the reference.

#Usefulness METEOR is beneficial because it evaluates translations more like a human would, considering synonyms and rephrasing. This makes it particularly useful for assessing translations that must be fluent and coherent, not just correct. However, it is more complex to compute than BLEU or ROUGE and can be influenced by the choice of reference translations.

An Overall Score: A Creative Approach to LLM Evaluation

The evaluation of a RAG-empowered LLM isn’t just about individual metric scores; it’s about how these scores converge to reflect the overall effectiveness and reliability of the system. This brings us to the concept of the ‘Overall Score’ – a composite measure that encapsulates the multifaceted nature of AI response evaluation.

The Weighted Score Approach: Customization for Specific Domains

The ground research led us to the weighted score approach as the ideal method for comprehensive evaluation. In this approach, different weights are assigned to various metrics depending on the specific use case. 

For instance, in critical domains like finance and medicine, greater weight might be given to ‘Faithfulness’ and ‘Correctness’ due to the high stakes involved in accuracy. Conversely, in closed QA systems for creative tasks, ‘Answer Relevancy’ and ‘Context Relevance’ might take precedence.

Adapting the Formula for Diverse Use Cases

Our proposed formula for calculating the Overall Score is: 

S = Wcr * context_relevancy + War * answer_relevancy + Wcs * confidence_score + Wf * faithfulness + Wc * correctness

However, considering the limitations in certain scenarios, such as the absence of a ‘Correctness Score’ due to a lack of domain-expert validated answers, we adapt our formula accordingly. Also, for LLM systems like OpenAI, where the ‘Confidence Score’ isn’t available, we exclude this from our calculation.

Here is the new modified formula:

S = Wcr * context_relevancy + War * answer_relevancy + Wf * faithfulness

Now to understand the impact of these scoring, let’s consider two hypothetical responses with values and calculated score as:

Context Relevancy: 0.86

Answer Relevancy: 0.78

Faithfulness: 1

Calculated Score: S = 0.4 * 1 + 0.3 * 0.86 + 0.3 * 0.78 = 0.892

This score indicates a response that is not only grounded in factual accuracy (high faithfulness) but also highly relevant to both the context and the query.

Another scorecard says,

Context Relevancy: 0.86

Answer Relevancy: 0.78

Faithfulness: 0

Calculated Score: S = 0.4 * 0 + 0.3 * 0.86 + 0.3 * 0.78 = 0.492

Despite the relevancy in context and answer, the zero score in faithfulness drastically lowers the overall score, indicating a response that lacks factual grounding and reliability.

RAG Empowered LLM Evaluation with Dataworkz

For enterprises adopting Retrieval-Augmented Generation in LLMs, especially for building question-and-answer systems, it’s essential to ensure that the Q&A systems retrieve accurate information and generate coherent responses as expected.

This is where Dataworkz’s expertise in implementing reliable RAG systems gives the confidence the enterprises need. 

Dataworkz’s process for evaluating Q&A system responses is thorough yet user-friendly:

  • User Engagement: Users interact with the QnA system by asking questions, with the system tracking all interactions for insights.
  • Response Analysis: Dataworkz offers two methods for evaluating responses:
    • Verification Against Source Documentation.
    • Insight Analytics for LLM Responses, providing objective analytics on correctness, answer relevancy, and context relevancy.

You can check out the video given below to see the LLM response evaluation in action:

Here are some of the compelling reasons for you to partner with Dataworkz for your RAG deployment:

  • Proven Expertise: Dataworkz excels in deploying advanced large language models tailored for business use.
  • Customized Solutions: They specialize in creating RAG systems that align precisely with specific business requirements.
  • End-to-End Support: From setup to maintenance, Dataworkz offers comprehensive and ongoing support.
  • Secure and Compliant: Their solutions adhere to the highest standards of data security and industry compliance.
  • Ease of Use: User-friendly interfaces make their systems accessible to all skill levels.
  • Constant Evaluation: They provide detailed analytics for ongoing system performance evaluation.
  • Empowering Training: Dataworkz offers extensive training to help businesses maximize the use of their RAG systems.

RAG deployment with LLM can drastically bring down the time it takes to harness enterprise data for actionable insights from months to weeks. 

Here is a teaser that shows the creation of a simple QnA application in just five minutes:

 

If you found that exciting, imagine Dataworkz’s impact on your business operations. Click to book your free demo today!

Scroll to Top