A Collection of Question Answering Datasets for Norwegian

Mikhailov, Vladislav; Mæhlum, Petter; Langø, Victoria Ovedie Chruickshank; Velldal, Erik; Øvrelid, Lilja

A Collection of Question Answering Datasets for Norwegian

dc.contributor.author	Mikhailov, Vladislav
dc.contributor.author	Mæhlum, Petter
dc.contributor.author	Langø, Victoria Ovedie Chruickshank
dc.contributor.author	Velldal, Erik
dc.contributor.author	Øvrelid, Lilja
dc.contributor.editor	Johansson, Richard
dc.contributor.editor	Stymne, Sara
dc.coverage.spatial	Tallinn, Estonia
dc.date.accessioned	2025-02-18T13:48:23Z
dc.date.available	2025-02-18T13:48:23Z
dc.date.issued	2025-03
dc.description.abstract	This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian – Bokmål and Nynorsk – our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokmål than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.
dc.identifier.uri	https://hdl.handle.net/10062/107235
dc.language.iso	en
dc.publisher	University of Tartu Library
dc.relation.ispartofseries	NEALT Proceedings Series, No. 57
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	A Collection of Question Answering Datasets for Norwegian
dc.type	Article

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1

Nimi:: 2025_nodalida_1_43.pdf
Suurus:: 191.48 KB
Formaat:: Adobe Portable Document Format

Lae alla

Kollektsioonid

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)