Proceedings of the 1st Workshop on Nordic-Baltic Responsible Evaluation and Alignment of Language Models (NB-REAL 2025)

Selle kollektsiooni püsiv URIhttps://hdl.handle.net/10062/107156

Sirvi

Viimati lisatud

Nüüd näidatakse 1 - 7 7
  • Kirje
    The 1st Workshop on Nordic-Baltic Responsible Evaluation and Alignment of Language Models. Proceedings of the Workshop
    (University of Tartu Library, 2025-03) Einarsson, Hafsteinn; Simonsen, Annika; Nielsen, Dan Saattrup
  • Kirje
    The Danish Idiom Dataset: A collection of 1000 Danish idioms and fixed expressions
    (University of Tartu Library, 2025-03) Sørensen, Nathalie Hau; Nimb, Sanni; Mikkelsen, Agnes Aggergaard; Jensen, Jonas; Einarsson, Hafsteinn; Simonsen, Annika; Nielsen, Dan Saattrup
    Interpreting idiomatic expressions is a challenging task for learners and LLMs alike, as their meanings cannot be deduced directly from their individual components and often reflect nuances that are specific to the language in question. This makes idiom interpretation an ideal task for assessing the linguistic proficiency of large language models (LLMs). In order to test how LLMs handle this task, we introduce a new dataset comprising 1000 Danish idiomatic expressions sourced from the Danish Dictionary DDO (ordnet.dk/ddo). The dataset has been made publicly available at sprogteknologi.dk. For each expression, the dataset includes a correct dictionary definition, a literal false definition, a figurative false definition, and a random false definition. In the paper, we also present three experiments that demonstrate diverse applications of the dataset and aim to evaluate how well LLMs are able to identify the correct meanings of idiomatic expressions.
  • Kirje
    Image-Text Relation Prediction for Multilingual Tweets
    (University of Tartu Library, 2025-03) Rikters, Matīss; Marrese-Taylor, Edison; Einarsson, Hafsteinn; Simonsen, Annika; Nielsen, Dan Saattrup
    Various social networks have been allowing media uploads for over a decade now. Still, it has not always been clear what is their relation with the posted text or even if there is any at all. In this work, we explore how multilingual vision-language models tackle the task of image-text relation prediction in different languages, and construct a dedicated balanced benchmark data set from Twitter posts in Latvian along with their manual translations into English. We compare our results to previous work and show that the more recently released vision-language model checkpoints are becoming increasingly capable at this task, but there is still much room for further improvement.
  • Kirje
    What's Wrong With This Translation? Simplifying Error Annotation For Crowd Evaluation
    (University of Tartu Library, 2025-03) Debess, Iben Nyholm; Karakanta, Alina; Scalvini, Barbara; Einarsson, Hafsteinn; Simonsen, Annika; Nielsen, Dan Saattrup
    Machine translation (MT) for Faroese faces challenges due to limited expert annotators and a lack of robust evaluation metrics. This study addresses these challenges by developing an MQM-inspired expert annotation framework to identify key error types and a simplified crowd evaluation scheme to enable broader participation. Our findings based on an analysis of 200 sentences translated by three models demonstrate that simplified crowd evaluations align with expert assessments, paving the way for improved accessibility and democratization of MT evaluation.
  • Kirje
    Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching
    (University of Tartu Library, 2025-03) Kostiuk, Yevhen; Vitman, Oxana; Kiulian, Artur; Gagała, Łukasz; Einarsson, Hafsteinn; Simonsen, Annika; Nielsen, Dan Saattrup
    In this work, we address the challenge of evaluating large language models (LLMs) on the short answer matching task for Latvian and Lithuanian languages. We introduce novel datasets consisting of 502 Latvian and 690 Lithuanian question-answer pairs. For each question-answer pair, we generated matched and non-matched answers using a set of alteration rules specifically designed to introduce small but meaningful changes in the text. These generated answers serve as test cases to assess the ability of LLMs to detect subtle differences in matching of the original answers. A subset of the datasets was manually verified for quality and accuracy. Our results show that while larger LLMs, such as QWEN2.5 72b and LLaMa3.1 70b, demonstrate near-perfect performance in distinguishing matched and non-matched answers, smaller models show more variance. For instance, LLaMa3.1 8b and EuroLLM 9b benefited from few-shot examples, while Mistral Nemo 12b underperformed on detection of subtle text alteration, particularly in Lithuanian, even with additional examples. QWEN2.5 7b and Mistral 7b were able to obtain a strong and comparable performance to the larger 70b models in zero and few shot experiments. Moreover, the performance of Mistral 7b was weaker in few shot experiments. The code and the dataset are available on our GitHub.
  • Kirje
    Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History
    (University of Tartu Library, 2025-03) Kostiuk, Yevhen; Vitman, Oxana; Gagała, Łukasz; Kiulian, Artur; Einarsson, Hafsteinn; Simonsen, Annika; Nielsen, Dan Saattrup
    In this work, we evaluated Lithuanian and general history knowledge of multilingual Large Language Models (LLMs) on a multiple-choice question-answering task. The models were tested on a dataset of Lithuanian national and general history questions translated into Baltic, Nordic, and other languages (English, Ukrainian, Arabic) to assess the knowledge sharing from culturally and historically connected groups. We evaluated GPT-4o, LLaMa3.1 8b and 70b, QWEN2.5 7b and 72b, Mistral Nemo 12b, LLaMa3 8b, Mistral 7b, LLaMa3.2 3b, and Nordic fine-tuned models (GPT-SW3 and LLaMa3 8b). Our results show that GPT-4o consistently outperformed all other models across language groups, with slightly better results for Baltic and Nordic languages. Larger open-source models like QWEN2.5 72b and LLaMa3.1 70b performed well but showed weaker alignment with Baltic languages. Smaller models (Mistral Nemo 12b, LLaMa3.2 3b, QWEN 7B, LLaMa3.1 8B, and LLaMa3 8b) demonstrated gaps with Lithuanian national history related questions (LT-related) alignment with Baltic languages while performing better on Nordic and other languages. The Nordic fine-tuned models did not surpass multilingual models, indicating that shared cultural or historical context alone does not guarantee better performance.
  • Kirje
    The 1st Workshop on Nordic-Baltic Responsible Evaluation and Alignment of Language Models
    (University of Tartu Library, 2025-03-02) Einarsson, Hafsteinn; Simonsen, Annika; Nielsen, Dan Saattrup