Maintaining the naturality of the responses generated by LLMs is crucial for conversional AI of this age. RAG or Retrieval-Augmented Generation is a hybrid approach to allow language models access knowledge from external sources. An important step here is to evaluate the responses generated by the model using evaluation datasets. RAG evaluation dataset.
Building that evaluation dataset takes a step-by-step approach. First, you need to define the objective of the dataset. After identifying and curating the document sources, you need to develop a set of evaluation queries. Ideal responses must also be prepared before pairing queries with document passages. Only then you can format and use the dataset for RAG evaluation.
Learn how to build a RAG evaluation dataset from scratch from this detailed guide. You will also learn how to handle various challenges while building a RAG evaluation dataset.
Basics of a RAG Evaluation Dataset
A RAG evaluation dataset helps us measure the performance of LLMs in retrieving and generating responses based on documents. Three key fields are included in every evaluation dataset.
Input Queries
These are the questions or prompts that users may ask the language model. Creating relevant input queries ensure the language model is familiar with the queries. The model also learns to find relevant passages from the sources used to build the dataset.
Document Sources
Traditional datasets don’t heavily rely on document sources. But RAG evaluation datasets are opposite in terms of sources. In this case, each input query is paired with passages from document sources. It helps models draw the relevant information. The better the document quality, the higher the chance of the model generating accurate responses.
Expected or Ideal Answers
The responses from the model must be compared with ideal answers to evaluate the model accurately. The set of ideal answers are created while building the evaluation dataset.
With a high-quality evaluation dataset, the model provides multiple benefits. Firstly, it generates more grounded and factually accurate responses. The dataset also helps identify the areas for improvement. With granular analysis LLMs can be improved in terms of performance to a great extent. Let’s now move on to the actual building process.
Step-By-Step Guide to Building a RAG Evaluation Dataset
Building a RAG evaluation dataset takes sound planning and effective execution. We want you to know the tiniest details necessary for the perfect evaluation. Make sure to follow each of the following steps to build a good evaluation dataset.
Step 1: Defining the Objective of the Dataset
The very first step in building a RAG evaluation dataset is to define its objective. With a clear objective, the evaluation dataset is expected to give better results.
Use Case Identification
The input queries, the document sources, and the ideal answers vary greatly depending on the use case. For example, the input queries of a dataset used in healthcare industry will be vastly different from that used in the software development industry. Figure out where the model will be used and what type of document sources are required for that.
Setting Evaluation Goals
Setting goals refers to determining factors that you want to measure with the dataset. The content and structure of the dataset will depend on the factors you want to test during the evaluation. Here are a few examples of evaluation goals.
- Factual Accuracy: It means testing if the response generated by the answer is backed by accurate information. The higher the accuracy, the more grounded the model is.
- Retrieval Relevance: This indicates how closely related the retrieved passages are to the query. If a model can retrieve highly relevant passages from the document source, it is considered to have a high retrieval relevance.
- Comprehensive Answer: This is the measurement of how well the response generated by the model answers the input query. A detailed and accurate response shows the model is able to generate comprehensive answers.
Selecting Appropriate Metrics
Another important thing to do in this step is to select the evaluation metrics, like precision, recall, and more. Metrics like F1 score, BLEU score, or ROUGE scores can help determine how well the model is performing.
Step 2: Identifying and Curating Document Sources
Now that the objectives and evaluation metrics are clear, the next step is to gather reliable document sources to build the dataset. As these sources build the foundation for the responses the model generates, you should follow these steps closely.
Using Diverse Sources
The diverse the document sources, the more accurate will be the response of the model. You should use the following types of document sources to improve the model’s ability.
- Public Sources: In any field, public databases like Wikipedia or government databases can be used as reliable knowledge bases. These databases usually have a wide range of documents for general knowledge queries. When the model needs to deliver a generalized response, these public sources work great.
- Industry-Specific Databases: Language models specific to an industry needs document sources related to that industry. For example, you can use biotech, pharmaceutical, or similar research papers as document sources if you want the model to generate in-depth responses in the healthcare industry.
- Internal Document Sources: If the language model is designed for more specific tasks, it will need more critical documents. For example, a company needs to use policy documents, SOPs, or other materials for a model that will be used to provide customer support.
Choosing High-Quality and Reliable Document Sources
Not every document may be up to the mark when building the evaluation dataset. The ideal document source must be credible and authentic so that the model can generate accurate responses. It prevents the propagation of misinformation through the model. Government databases, high-authority websites, peer-reviews journals, etc., can be the most reliable document sources.
Organizing and Indexing Documents
Unorganized document sources make the retrieval much more difficult and increases the chance of a faulty retrieval. To get the best results, you must organize the documents according to their relevance.
Step 3: Developing Evaluation Queries
This is a critical step as you need to define the evaluation queries that reflect the real-world questions coming to the model’s way. Keep the following factors in mind when building the set of queries.
Creating Different Types of Queries
Multiple types of queries must be used to evaluate the model. These queries should have different user intents, just like the real-world use case of the model. Take a closer look at the common query types below.
- Factual Queries: These queries require a clear and straightforward answer in short sentences or paragraphs. Focus on queries that demand a direct, factual answers from the model. A good example of such queries is – Who is the Secretary General of the United Nations?
- Comparative Queries: Language models need to compare two or more things to answer such queries. Set the queries in a way that the model has the ability to respond to varied comparative queries. We have an example for you – What is the difference between fruits and fruit juices?
- Exploratory Queries: These queries demand comprehensive responses. For example, someone might ask the language model about how to install WordPress and the model needs to generate a detailed process in response.
Different Levels of Query Complexity
Having all the simple queries for the system is a great start. However, you can’t miss complex queries as the model is expected to face such queries in real-world use cases. Focus on creating queries with varying complexity to figure out how the model performs if the user query is far from expected.
Including User Intents in Queries
When the query has an informational intent, the model should generate precise responses. For queries with an educational or commercial intent, the model might need to generate procedures, steps, or detailed responses. Your evaluation dataset should have queries with different intents to test the capability of the model.
Step 4: Preparing Ideal Responses
An ideal response is the benchmark for evaluating the output. So, you need to create high-quality response for each query. Here is how you do that.
Drafting Accurate and Relevant Responses
Design answers to the queries so that the answers are accurate and directly related to the query. Depending on the query type, the answers should have varying lengths and formats. All the essential information should be present in the answer without being unnecessarily long.
Validating Answers by Industry Experts
In specialized fields, the importance of factually and contextually correct answers is much higher. You need to consult with industry experts to ensure the answers you craft are valid and accurate. Doing this may seem like an extra step that could be avoided. But we ensure you that the model will generate much more reliable answers this way.
Keeping Track of Answer Sources
The document used for crafting a specific answer must be organized for easier tracking in the future. This improves the transparency of the dataset as well as provides an effective reference for reviewing the model. Future documents added to the model can also be categorized easily if the answer sources are tracked.
Step 5: Pairing Queries with Document Passages
Getting high-quality responses from a model requires effective pairing of the queries and the document passages. This is done in multiple steps to ensure the model can retrieve the most relevant information in a useful way.
Mapping Sources to Queries
Creating the correct query-passage pair is important for getting an accurate and relevant response. Let’s say you have a query like “What is the most popular food in Turkey?” You should pair passages that contain information about the most popular food in Turkey in your document sources. This will help the model retrieve the information more efficiently.
Highlighting Key Sections
Some passages or sections in your document sources may need to be retrieved over and over again for many different queries. For such multi-source retrieval, annotation is an effective technique. The RAG model uses these annotations as guidelines to retrieve the marked passages for generating proper responses.
Evaluating Passage Quality
Not all passages in your document sources might be informative or useful to the model. It is a good idea to select clear and informative passages. If there is any vague or outdated passage, you should avoid it to prevent the model from generating inaccurate responses. Highly technical passages should also be avoided unless the model needs to generate technical answers.
Step 6: Formatting the Dataset for RAG Evaluation
Incorrect dataset formatting can lead to inaccurate evaluation of the model. You should remember the following things while formatting the RAG evaluation dataset.
Using a Standard Format
Always use common formats like JSON or CSV so that all stakeholders can easily use the dataset for evaluation. These formats are compatible with most machine learning frameworks. Make sure your dataset has the following fields.
- Queries: The input query created in previous steps
- Responses: Ideal answers created in previous steps
- Passages: Text passages that contain the relevant information
- Document Sources: Links to the passages or documents used for the response
Keeping the Structure Consistent
Unless the dataset is consistent, the model might not be able to absorb the patterns correctly. It will result in faulty responses and reduce the efficiency of the model. Always keep the data types, field names, etc., consistent to reduce the preprocessing time of text data.
Including Metadata
Adding metadata increases the relevance of the document sources, resulting in a higher accuracy and efficiency. Complex queries can be handled better if the dataset has proper metadata.
Final Words
Once the evaluation dataset is ready, you can run the model on sample queries to determine its performance. Use detailed metrics to assess the accuracy and relevance of the responses against the dataset you have built earlier. You can always fine-tune the dataset depending on the test results.
Before building real-world applications using RAG models, you must create evaluation datasets that possess the qualities mentioned above. This will lay the groundwork for a reliable and responsive model. We hope this guide helps you build an ideal evaluation dataset for your RAG models.