LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
Kuupäev
2025-03
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
University of Tartu Library
Abstrakt
This paper evaluates the language understanding capabilities of various large language models (LLMs) through an analysis of 112 translated and human-edited questions from the Multitask Language Understanding (MMLU) dataset, focusing specifically on two underrepresented languages: Latvian and Giriama. The study compares the performance of six state-of-the-art (SOTA) models, with OpenAI's o1-preview model demonstrating superior performance across all languages, significantly outperforming non-proprietary models in Latvian and all other models in Giriama. Human editing of automated translations from English to Latvian yielded only a small, statistically insignificant improvement in performance estimates, suggesting that machine-translated benchmarks may be sufficient for comparing model performance in languages with established digital resources like Latvian. However, automated translation to Giriama proved infeasible, and model performance in Giriama remained poor, highlighting the persistent challenges LLMs face with low-resource languages. These findings underscore the need for more comprehensive datasets and improved machine translation capabilities for underrepresented languages, while emphasizing the importance of localized benchmarks and human evaluation in addressing cultural and contextual limitations in AI models.