Chatbot

Announcing the TRAG Benchmark: A New Standard for Evaluating LLM in Thai Language

Krittanon Kaewtawee

•

July 30, 2024

min read

A blog cover on the topic Announcing the TRAG Benchmark: A New Standard for Evaluating LLM in Thai Language

We are thrilled to introduce the Thai Retrieval Augmented Generation (TRAG) Benchmark, a groundbreaking evaluation platform designed to assess the performance of large language models (LLMs) in understanding and generating high-quality responses in the Thai language. This benchmark represents a significant step forward in the field of AI, providing a robust framework for evaluating LLMs across various dimensions and test case categories.

Overview of the TRAG Benchmark

The TRAG Benchmark has been meticulously developed to evaluate LLMs on their ability to comprehend document context and generate accurate, contextually appropriate answers in Thai. The benchmark comprises 56 test cases, categorized into 8 distinct areas, and features 7 unique scenarios. Each test case includes a user query and its corresponding document context, ensuring a comprehensive assessment of the model's capabilities.

Categories and Scenarios

The test cases are distributed across the following categories:

Airline: Policies and procedures related to seat reservations, ticket pricing, and promotion deadlines. Sample questions include:

How much does it cost to change the travel date?
Can I reserve a seat?

Automotive: Knowledge about vehicle accessories, load capacity, and current promotions. Sample questions include:

Is there a door lock?
Where is the SOS button?

Bank: Information on required documents for account closure, authorization procedures, and opening new accounts. Sample questions include:

What documents are needed to authorize the closure of a current account?

CRM: Membership details, including sign-up processes, point checks, and reward calculations. Sample questions include:

How do I check my points?

Health Care: Medical knowledge such as dental care and hypertension medications. Sample questions include:

Treatment for tuberculosis patients.
Medications to be cautious about in hypertension patients.

Human Resources: Employee healthcare benefits, insurance coverage, and reimbursement policies. Sample questions include:

Can unused social security benefits be carried over to the next year?
Requesting a list of hospitals or dental clinics specified by the company.

IT Gadget: User questions about mobile phones and smartphone comparisons. Sample questions include:

Recommend a good camera phone with a budget of 8,000 THB?
Which tablets are available under 10,000 THB?

Tech Support: LAN configuration, password retrieval, and WiFi router setup. Sample questions include:

Where can I find the ID and password?
I forgot the WiFi password.

Donut chart showing category distribution in the TRAG Benchmark: Tech Support, HR, Automotive, Airline, IT Gadget, Health Care, Bank, CRM. — TRAG Benchmark evaluates LLMs across various categories and scenarios

The scenarios are based on 3 key factors:

1. Types of Questions: Single-turn Q&A and follow-up questions.

2. Language of Document Context: English, Thai, and intentionally empty contexts.

3. Availability of Information: Scenarios where information is available or missing, requiring the model to align its response with the prompt.

Diagram showing 3 key factors: Types of Questions, Language of Document Context, and Availability of Information. — Scenarios based on Types of Questions, Language of Document Context, and Availability of Information.

Evaluation Criteria

The TRAG Benchmark assesses LLM performance based on:

Factual Consistency: Ensuring responses are consistent with the provided document context.
Language Quality: Responses should be in Thai, free of grammatical errors, and maintain a polite and formal tone.
Format: Responses should be in plain text without special formatting.
Response Speed: Models must generate responses within a 15-second time limit.

Scoring and Results

The benchmark employs a two-step scoring process:

1. LLM Overall Grading: An advanced LLM, GPT-4-2024-05-13, evaluates the quality of responses based on the criteria mentioned above.

2. LLM Answerable Classification: GPT-4-0613 classifies responses to determine if the model used only the provided document to answer.

Radar chart comparing the performance of various AI models across different — Performance comparison of AI models in TRAG Benchmark categories.

Bar graph showing accuracy and response time metrics for different AI models. — Accuracy and response time comparison of AI models in TRAG Benchmark.

Join Us in Advancing Thai Language AI

We invite researchers and developers to utilize the TRAG Benchmark to evaluate and compare the performance of their models. By participating, you contribute to the advancement of Thai language AI technology, helping to refine and innovate AI solutions.

‍

Explore the forefront of Thai language AI with the TRAG Benchmark and join us in driving the future of AI technology.

For more information and to get started with the TRAG Benchmark, visit our website.