WikiQA-IS: Assisted Benchmark Generation and Automated Evaluation of Icelandic Cultural Knowledge in LLMs

Arnardóttir, Þórunn; Einarsson, Elías Bjartur; Ingvarsson Juto, Garðar; Helgason, Þorvaldur Páll; Einarsson, Hafsteinn

WikiQA-IS: Assisted Benchmark Generation and Automated Evaluation of Icelandic Cultural Knowledge in LLMs

Failid

2025_resourceful_1_13.pdf (725.64 KB)

Kuupäev

2025-03

Autorid

Arnardóttir, Þórunn

Einarsson, Elías Bjartur

Ingvarsson Juto, Garðar

Helgason, Þorvaldur Páll

Einarsson, Hafsteinn

Kirjastaja

University of Tartu Library

Abstrakt

This paper presents WikiQA-IS, a novel question-answering dataset focusing on Icelandic culture and history, along with an automated pipeline for dataset generation and evaluation. Leveraging GPT-4 to create questions and answers based on Icelandic Wikipedia articles and news sources, we produced a high-quality corpus of 2,000 question-answer pairs. We introduce an automatic evaluation method using GPT-4o as a judge, which shows strong agreement with human evaluations. Our benchmark reveals varying performances across different language models, with closed-source models generally outperforming open-weights alternatives. This work contributes a resource for evaluating language models' knowledge of Icelandic culture and offers a replicable framework for creating similar datasets in other cultural contexts.

URI

https://aclanthology.org/2025.resourceful-1.0/
https://hdl.handle.net/10062/107117

Kollektsioonid

Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Kirje täielik lehekülg

WikiQA-IS: Assisted Benchmark Generation and Automated Evaluation of Icelandic Cultural Knowledge in LLMs

Failid

Kuupäev

Autorid

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Abstrakt

Kirjeldus

Märksõnad

Viide

URI

Kollektsioonid