TRAG Overview

The Thai Retrieval Augmented Generation (TRAG) Benchmark is designed to comprehensively evaluate the performance of large language models (LLMs) in understanding and generating human-like responses in the Thai language. This page provides a detailed explanation of the benchmark's structure, evaluation criteria, and scoring methodology.

Benchmark Structure

The TRAG Benchmark consists of 56 test cases, 8 categories, and 7 distinct scenarios.
Each test case comprises a user query and its corresponding document context. The distribution of the test cases is as follows:

Scenarios

The TRAG Benchmark features 7 distinct scenarios based on three key factors:

1. Types of Questions: Single-turn Q&A: Simple questions answered in one go. Follow-up Questions: In-depth queries with chat history provided.
2. Language of Document Context: English Thai Intentionally Empty
3. Availability of Information: Information Available: The necessary data to generate an answer is provided. No Information Available: The required data is missing, but the answer should still align with the prompt.

Evaluation Criteria

The TRAG Benchmark assesses the performance of LLMs across several key dimensions:

Factual Consistency

The generated response must be factually consistent with the information provided in the document context. If the answer is not found within the given context, the model is expected to respond with "ขออภัยค่ะ ไม่พบข้อมูล".

Language Quality

Responses should be in Thai and free of grammatical, spelling, or wording errors.
The tone should be polite and formal, with sentences ending in "ค่ะ".
Proper sentence structure and well-formed responses are expected.

Format

Responses should be in plain text format without any markdown or special formatting.

Response Speed

Models are expected to generate responses within a 15-second time limit.

The TRAG Benchmark employs a two-step scoring process:

1. LLM Overall Grading: gpt-4o-2024-05-13 an advanced LLM, serves as an impartial judge to evaluate the quality of the responses generated by the tested models. It assesses each response based on the evaluation criteria mentioned above.
2. LLM Answerable classify: gpt-4-0613 uses LLM to classify hallucination answer from response to know that llm use only provided document to answer user or not.

Benchmark Results

The TRAG Benchmark leaderboard showcases the performance of various LLMs on the benchmark.
The leaderboard is regularly updated as new models are evaluated on the benchmark.

LLM Judge Overall – The overall accuracy of each model in generating answers, considering factors such as correctness, grammar, translation, and additional constraints, as judged by GPT-4-2024-05-13.
Accuracy of Unanswerable Questions – The model's ability to accurately respond with "Not found" when the necessary information is not present in the document, as judged by GPT-4-0613.
Accuracy of Answerable Questions – The model's ability to provide correct answers when the necessary information is present in the document, as judged by GPT-4-0613.
Response Time (sec) – The average time each model takes to generate a response.

Get Started with TRAG Benchmark

Explore the forefront of Thai language AI with the TRAG Benchmark, a pioneering evaluation platform meticulously designed to assess and enhance the capabilities of Language Learning Models (LLMs). By rigorously evaluating models across diverse dimensions and test case categories, TRAG empowers researchers and developers to benchmark, refine, and innovate their AI solutions. Join us in driving the future of Thai language AI technology. We encourage researchers and developers to utilize the TRAG Benchmark to evaluate and compare the performance of their models, contributing to the advancement of Thai language AI technology.

Contact Us