A blog cover on the topic about TRAG Benchmark, how does it work?

Generative AI

Krittanon Kaewtawee

min read

August 5, 2024

Thai Retrieval Augmented Generation (TRAG) Benchmark

In the rapidly evolving world of AI, the Thai Retrieval Augmented Generation (TRAG) Benchmark stands out as a pioneering tool designed to evaluate the performance of large language models (LLMs) in understanding and generating human-like responses in the Thai language. This benchmark is crucial for advancing AI technology tailored to Thai, ensuring that models can handle the unique linguistic and contextual challenges presented by the language.

TRAG Benchmark Objective

The primary objective of the TRAG Benchmark is to provide a standardized framework for assessing the capabilities of LLMs in processing and generating Thai language content. By focusing on enterprise Q&A data from various industries, the benchmark aims to simulate real-world scenarios where accurate and contextually appropriate responses are essential. This helps identify the strengths and weaknesses of different models, guiding further improvements and innovations in Thai language AI.

TRAG Test Methodology

The TRAG Benchmark employs a rigorous testing methodology using enterprise Q&A data synthesized from diverse industries such as airlines, automotive, banking, healthcare, and more. The test data comprises two main components:

Questions: These are queries that users might pose in a real-world setting.
Document Context: This contains the necessary information required to answer the questions accurately.

To generate answers, these two data points are combined with standard prompts. Due to variations in how different families of models handle prompts, two standard prompts are used for each model evaluation, and the score set with the higher sum is taken as the official score of that respective model. The generated answers are then graded based on predefined criteria.

TRAG Benchmark Criteria

The scoring criteria of the TRAG Benchmark are designed to ensure a comprehensive evaluation of the models. The criteria include:

Accuracy of Answerable

This metric measures the percentage of questions that a model can answer correctly. It reflects the model's ability to understand the question and retrieve the correct information from the provided context. The accuracy of answerable questions is processed by a separate set of impartial LLMs with a different predefined prompt to judge the answer as either right or wrong.

Accuracy of Unanswerable

This metric assesses the model's ability to correctly respond with "Sorry, I don't know" to questions that intentionally do not have the right answer in the accompanying context. This is crucial for evaluating the model's capability to recognize when it lacks sufficient information to provide a valid response. Similar to the accuracy of answerable questions, the accuracy of unanswerable questions is also processed by a separate set of impartial LLMs with a different predefined prompt.

LLM Judge

Inspired by the paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," this criterion uses GPT-4 to act as an impartial judge. The LLM Judge evaluates the responses based on the following sub-criteria:

Correctness & Helpfulness: This assesses whether the answer is factually correct and useful to the user.
Translation & Grammar: This evaluates the grammatical accuracy of the response in Thai, ensuring that the language used is correct and appropriate.
Additional Constraint: This tests the controllability of the prompt by injecting specific requirements, such as the need to answer with "ค่ะ" (a polite particle in Thai) and to call yourself by some explicit name.

The LLM Judge then provides an arbitrary rating based on these criteria, which is used as the final score for the model. This comprehensive evaluation ensures that the models are not only accurate but also contextually and linguistically appropriate.

Total Response Time

Another dimension measured is the total response time, which is the time the model takes to complete the response (not just the time-to-first-token). This metric is crucial for evaluating the efficiency and practicality of the model in real-world applications.

Benchmark Results

Accuracy of Answerable

Good Answer Example:

Q: How to care for diabetic patients in dental treatment?
Document Context: <Guidelines for handling people with diabetes in a dental clinic>
A: For caring for diabetic patients in dental treatment, here are some precautions and guidelines:

Check the patient's blood sugar level before treatment. If the level is too high, the treatment might need to be postponed until better control is achieved.
Schedule the appointment in the morning after breakfast, as blood sugar levels are usually the most stable at this time.
Advise the patient to eat and take their medication as usual before visiting the dentist to prevent hypoglycemia.
Avoid using anesthetics with a high epinephrine content, as it can raise blood sugar levels.
Be especially cautious when treating wounds in the mouth, as diabetic patients tend to heal more slowly.
Recommend good oral hygiene practices because diabetic patients are at higher risk for periodontal disease and tooth decay.
If antibiotics are needed, choose ones that do not affect blood sugar levels.
Watch for symptoms of hypoglycemia during treatment, such as sweating, palpitations, and dizziness. If these occur, stop treatment and have the patient eat or drink something sugary immediately.
For complex or lengthy treatments, consult with the patient's diabetes care doctor to plan the treatment together.

I hope this information is helpful for the care of diabetic patients in dental treatment. If you have any additional questions, feel free to ask.

Bad Answer Example

(wrong info from what’s given in context document):

Q: How will passengers be compensated if the flight is delayed?
Document Context: <Information about delay compensation>
A: According to the available information, if a flight is delayed, the airline will compensate passengers. However, it depends on the length and cause of the delay. Generally, if the delay is less than 2 hours, there will be no compensation. If the delay is between 2 to 4 hours, passengers might receive water. For delays over 4 hours, the airline might provide meal vouchers worth up to 100 baht. However, it is recommended to contact the airline directly for more details, as policies may vary.

Passengers should contact the airline staff immediately upon learning about the delay and keep all receipts, such as for food or accommodation, if they had to pay upfront. Additionally, each airline may have different compensation policies, so it is best to check the specific conditions of the airline you are flying with.

Accuracy of Unanswerable

This criterion highlights the model's ability to recognize when it lacks sufficient information to provide a valid response. Models that perform well in this area can effectively avoid hallucinations and provide appropriate responses such as "Sorry, I don't know." This is particularly important for maintaining the reliability and trustworthiness of the model.

Good Answer Example:

Q: Does this car model have ABS brakes?
Document Context: <Not provided>
A: I apologize, but the available information does not include details about the ABS brakes for this car model. Therefore, I cannot confirm whether this car model has ABS brakes. If you need specific information about the safety features of this car model, I recommend asking the dealer or checking the official website of the car manufacturer directly. If you have any other questions about cars, I am happy to help with the information available.

Bad Answer Example:

Q: Which hospitals can health insurance be claimed at?
Document Context: <Not provided>
A: Thank you for your question. I can provide a list of private hospitals designated by the company in Bangkok and its vicinity as follows: <List of hospitals>

LLM Judge Ratings

The LLM Judge provides a nuanced evaluation of the model's responses, considering correctness, helpfulness, translation quality, and adherence to specific constraints. High scores in this area indicate that the model not only provides accurate information but also communicates it effectively and appropriately in Thai.

Good Answer Example:

Q: What documents are needed to authorize someone to close a current account?
A: You can authorize someone else to close a current account. The documents required are:

Power of attorney document
Bank account passbook
The principal’s ID card (account holder) with a live signature certifying the copy
The agent’s ID card
Withdrawal form (signed by the agent on behalf of the account holder)

Preparing the power of attorney document for account closure:

Name and surname of the principal (account holder)
Name and surname of the agent
Details of the account closure request
Affix a 10 baht stamp duty

In cases of authorizing someone to close an account, the amount received from the account closure will be issued as a bank check (Cashier's Cheque) payable in the name of the account holder and stamped 'A/C Payee Only' or transferred to an account in the same name as the account holder. No cash will be paid to the agent.

Bad Answer Example:

Q: What are the different models of the Apple Watch?
A: Sorry, which specific model are you interested in? Currently, I have information about Apple Watch Ultra 2, Apple Watch Series 9, and Apple Watch SE2.

Total Response Time

Efficient models that can generate responses quickly without compromising accuracy are highly valued. This metric ensures that the models are practical for real-world applications where timely responses are crucial.

Limitations

One of the significant challenges in the TRAG Benchmark is the variation in prompt structures and best practices across different LLM models. It is difficult to tailor the benchmark to suit all models with their optimal prompt versions without being biased toward overly tuning a prompt for one model over another.

For example, if we add the instruction "try to provide a helpful response," some models, such as the Claude 3 family models, see a massive drop in the accuracy of unanswerable questions due to an increased hallucination rate. Further investigation reveals that the model is following the instructions to be helpful and therefore tries not to provide a true negative answer.

Consult with our experts at Amity Solutions for additional information here