Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Selle kollektsiooni püsiv URIhttps://hdl.handle.net/10062/107190

Sirvi

Nüüd näidatakse 1 - 20 83

A Collection of Question Answering Datasets for Norwegian
(University of Tartu Library, 2025-03) Mikhailov, Vladislav; Mæhlum, Petter; Langø, Victoria Ovedie Chruickshank; Velldal, Erik; Øvrelid, Lilja; Johansson, Richard; Stymne, Sara
This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian – Bokmål and Nynorsk – our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokmål than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.
A Comparative Study of PEFT Methods for Python Code Generation
(University of Tartu Library, 2025-03) Männistö, Johanna; Attieh, Joseph; Tiedemann, Jörg; Johansson, Richard; Stymne, Sara
Fine-tuning language models incurs high costs in training, inference and storage. Parameter-efficient fine-tuning (PEFT) methods have emerged as a more cost-effective alternative to full fine-tuning. However, limited work has compared different PEFT approaches for tasks like code generation. In this study, we examine the effect of various PEFT training methods on model performance in the task of Python code generation. We fine-tune four model families, ranging from 124M to 7B parameters, using three PEFT approaches alongside standard full fine-tuning. Our findings reveal that the effectiveness of each PEFT method varies with the model size and the corpus used.
Adding Metadata to Existing Parliamentary Speech Corpus
(University of Tartu Library, 2025-03) Parsons, Phoebe; Solberg, Per Erik; Kvale, Knut; Svendsen, Torbjørn; Salvi, Giampiero; Johansson, Richard; Stymne, Sara
Parliamentary proceedings are convenient data sources for creating corpora for speech technology. Given its public nature, there is an abundance of extra information about the speakers that can be legally and ethically harvested to enrich this kind of corpora. This paper describes the methods we have used to add speaker metadata to the Stortinget Speech Corpus (SSC) containing over 5,000 hours of Norwegian speech with non-verbatim transcripts but without speaker metadata. The additional metadata for each speech segment includes speaker ID, gender, date of birth, municipality of birth, and counties represented. We also infer speaker dialect from their municipality of birth using a manually designed mapping between municipalities and Norwegian dialects. We provide observations on the SSC data and give suggestions for how it may be used for tasks other than speech recognition. Finally, we demonstrate the utility of this new metadata through a dialect identification task. The described methods can be adapted to add metadata information to parliamentary corpora in other languages.
Aligning Language Models for Icelandic Legal Text Summarization
(University of Tartu Library, 2025-03) Harðarson, Þórir Hrafn; Loftsson, Hrafn; Ólafsson, Stefán; Johansson, Richard; Stymne, Sara
The integration of language models in the legal domain holds considerable promise for streamlining processes and improving efficiency in managing extensive workloads. However, the specialized terminology, nuanced language, and formal style of legal texts can present substantial challenges. This study examines whether preference-based training techniques, specifically Reinforcement Learning from Human Feedback and Direct Preference Optimization, can enhance models' performance in generating Icelandic legal summaries that align with domain-specific language standards and user preferences. We compare models fine-tuned with preference training to those using conventional supervised learning. Results indicate that preference training improves the legal accuracy of generated summaries over standard fine-tuning but does not significantly enhance the overall quality of Icelandic language usage. Discrepancies between automated metrics and human evaluations further underscore the importance of qualitative assessment in developing language models for the legal domain.
An Icelandic Linguistic Benchmark for Large Language Models
(University of Tartu Library, 2025-03) Ármannsson, Bjarki; Ingimundarson, Finnur Ágúst; Sigurðsson, Einar Freyr; Johansson, Richard; Stymne, Sara
This paper introduces a linguistic benchmark for Icelandic-language LLMs, the first of its kind manually constructed by native speakers. We report on the scores obtained by current state-of-the-art models, which indicate room for improvement, and discuss the theoretical problems involved in creating such a benchmark and scoring a model's performance.
Analyzing the Effect of Linguistic Instructions on Paraphrase Generation
(University of Tartu Library, 2025-03) Vahtola, Teemu; Hu, Songbo; Creutz, Mathias; Korhonen, Anna; Vulić, Ivan; Tiedemann, Jörg; Johansson, Richard; Stymne, Sara
Recent work has demonstrated that large language models can often generate fluent and linguistically correct text, adhering to given instructions. However, to what extent can they execute complex instructions requiring knowledge of fundamental linguistic concepts and elaborate semantic reasoning? Our study connects an established linguistic theory of paraphrasing with LLM-based practice to analyze which specific types of paraphrases LLMs can accurately produce and where they still struggle. To this end, we investigate a method of analyzing paraphrases generated by LLMs prompted with a comprehensive set of systematic linguistic instructions. We conduct a case study using GPT-4, which has shown strong performance across various language generation tasks, and we believe that other LLMs may face similar challenges in comparable scenarios. We examine GPT-4 from a linguistic perspective to explore its potential contributions to linguistic research regarding paraphrasing, systematically assessing how accurately the model generates paraphrases that adhere to specified transformation rules. Our results suggest that GPT-4 frequently prioritizes simple lexical or syntactic alternations, often disregarding the transformation guidelines if they overly complicate the primary task.
Annotating and Classifying Direct Speech in Historical Danish and Norwegian Literary Texts
(University of Tartu Library, 2025-03) Al-Laith, Ali; Conroy, Alexander; Degn, Kirstine Nielsen; Bjerring-Hansen, Jens; Hershcovich, Daniel; Johansson, Richard; Stymne, Sara
Analyzing direct speech in historical literary texts provides insights into character dynamics, narrative style, and discourse patterns. In late 19th century Danish and Norwegian fiction direct speech reflects characters' social and geographical backgrounds. However, inconsistent typographic conventions in Scandinavian literature complicate computational methods for distinguishing direct speech from other narrative elements. To address this, we introduce an annotated dataset from the MeMo corpus, capturing speech markers and tags in Danish and Norwegian novels. We evaluate pre-trained language models for classifying direct speech, with results showing that a Danish Foundation Model (DFM), trained on extensive Danish data, has the highest performance. Finally, we conduct a classifier-assisted quantitative corpus analysis and find a downward trend in the prevalence of speech over time.
Applying and Optimising a Multi-Scale Probit Model for Cross-Source Text Complexity Classification and Ranking in Swedish
(University of Tartu Library, 2025-03) Andersson, Elsa; Falkenjack, Johan; Jönsson, Arne; Johansson, Richard; Stymne, Sara
We present results from using Probit models to classify and rank texts of varying complexity from multiple sources. We use multiple linguistic sources including Swedish easy-to-read books and investigate data augmentation and feature regularisation as optimisation methods for text complexity assessment. Multi-Scale and Single Scale Probit models are implemented using different ratios of training data, and then compared. Overall, the findings suggest that the Multi-Scale Probit model is an effective method for classifying text complexity and ranking new texts and could be used to improve the performance on small datasets as well as normalize datasets labelled using different scales.
Assessed and Annotated Vowel Lengths in Spoken Icelandic Sentences for L1 and L2 Speakers: A Resource for Pronunciation Training
(University of Tartu Library, 2025-03) Richter, Caitlin Laura; Friðriksdóttir, Kolbrún; Bergsson, Kormákur Logi; Maher, Erik Anders; Benediktsdóttir, Ragnheiður María; Gudnason, Jon; Johansson, Richard; Stymne, Sara
We introduce a dataset of time-aligned phonetic transcriptions focusing on vowel length (quantity) in Icelandic. Ultimately, this aims to support computer assisted pronunciation training (CAPT) software, to automatically assess length and possible errors in Icelandic learners' pronunciations. The dataset contains a range of long and short vowel targets, including the first acoustic description of quantity in non-native Icelandic. Evaluations assess how manual annotations and automatic forced alignment characterise quantity contrasts. Initial analyses also imply partial acquisition of phonologically conditioned quantity alternations by non-native speakers.
Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles
(University of Tartu Library, 2025-03) Touileb, Samia; Mikhailov, Vladislav; Kroka, Marie Ingeborg; Velldal, Erik; Øvrelid, Lilja; Johansson, Richard; Stymne, Sara
We introduce a dataset of high-quality human-authored summaries of news articles in Norwegian. The dataset is intended for benchmarking of the abstractive summarisation capabilities of generative language models. Each document in the dataset is provided with three different candidate gold-standard summaries written by native Norwegian speakers and all summaries are provided in both of the written variants of Norwegian – Bokmål and Nynorsk. The paper describes details on the data creation effort as well as an evaluation of existing open LLMs for Norwegian on the dataset. We also provide insights from a manual human evaluation, comparing human-authored to model generated summaries. Our results indicate that the dataset provides a challenging LLM benchmark for Norwegian summarisation capabilities.
Better Benchmarking LLMs for Zero-Shot Dependency Parsing
(University of Tartu Library, 2025-03) Ezquerro, Ana; Gómez-Rodríguez, Carlos; Vilares, David; Johansson, Richard; Stymne, Sara
While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.
BiaSWE: An Expert Annotated Dataset for Misogyny Detection in Swedish
(University of Tartu Library, 2025-03) Kukk, Kätriin; Petrelli, Danila; Casademont, Judit; Orlowski, Eric J. W.; Dzielinski, Michal; Jacobson, Maria; Johansson, Richard; Stymne, Sara
In this study, we introduce the process for creating BiaSWE, an expert-annotated dataset tailored for misogyny detection in the Swedish language. To address the cultural and linguistic specificity of misogyny in Swedish, we collaborated with experts from the social sciences and humanities. Our interdisciplinary team developed a rigorous annotation process, incorporating both domain knowledge and language expertise, to capture the nuances of misogyny in a Swedish context. This methodology ensures that the dataset is not only culturally relevant but also aligned with broader efforts in bias detection for low-resource languages. The dataset, along with the annotation guidelines, is publicly available for further research.
Braxen 1.0
(University of Tartu Library, 2025-03) Tånnander, Christina; Edlund, Jens; Johansson, Richard; Stymne, Sara
With this paper, we release a Swedish pronunciation lexicon resource, Braxen 1.0, which is the result of almost 20 years development carried out at the Swedish Agency for Accessible Media (MTM). The lexicon originated with a basic word list, but has continuously been exanded with new entries, mainly acquired from university textbooks and news text. Braxen consists of around 850 000 entries, of which around 150 000 are proper names. The lexicon is released under the CC BY 4.0 license and is accessible for public use.
Can summarization approximate simplification? A gold standard comparison
(University of Tartu Library, 2025-03) Magnifico, Giacomo; Barbu, Eduard; Johansson, Richard; Stymne, Sara
This study explores the overlap between text summarization and simplification outputs. While summarization evaluation methods are streamlined, simplification lacks cohesion, prompting the question: how closely can abstractive summarization resemble gold-standard simplification? We address this by applying two BART-based BRIO summarization methods to the Newsela corpus, comparing outputs with manually annotated simplifications and achieving a top ROUGE-L score of 0.654. This provides insight into where summarization and simplification outputs converge and differ.
Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway
(University of Tartu Library, 2025-03) Enstad, Tita; Trosterud, Trond; Røsok, Marie Iversdatter; Beyer, Yngvil; Roald, Marie; Johansson, Richard; Stymne, Sara
Optical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.
Comparative Concepts or Descriptive Categories: a UD Case study
(University of Tartu Library, 2025-03) Boyer, Matthieu Pierre; Dehouck, Mathieu; Johansson, Richard; Stymne, Sara
In this paper, we present a series of methods used to quantify the soundness of using the same names to annotate cases in different languages. We follow the idea described by Martin Haspelmath that descriptive categories and comparative concepts are different objects and we look at the necessary simplification taken by the Universal Dependencies project. We thus compare cases in closely related languages as belonging to commensurable descriptive categories. Then we look at the corresponding underlying comparative concepts. We finally looked at the possibility of assigning cases to adpositions.
Comparing Human and Machine Translations of Generative Language Model Evaluation Datasets
(University of Tartu Library, 2025-03) de Vroe, Sander Bijl; Stampoulidis, George; Hakala, Kai; Rouhe, Aku; van Heeswijk, Mark; Karlgren, Jussi; Johansson, Richard; Stymne, Sara
The evaluation of Large Language Models (LLMs) is one of the crucial current challenges in the field of Natural Language Processing (NLP) and becomes even more challenging in the multilingual setting. Since the majority of the community's benchmarks exist only in English, test sets are now being machine translated at scale into dozens of languages. This work explores the feasibility of that approach, comparing a Finnish machine translation (MT) of ARC-Challenge with a new human translated version. Our findings suggest that since absolute scores are fairly close and model size rankings are preserved, machine translation is adequate in this case. Surprisingly, however, the datasets reverse the order of base models compared to their chat-finetuned counterparts.
Constructions and Strategies in Universal Dependencies
(University of Tartu Library, 2025-03) Nivre, Joakim; Johansson, Richard; Stymne, Sara
Is the framework of Universal Dependencies (UD) compatible with findings from linguistic typology? One way to find out is to investigate whether UD can adequately represent constructions of the world's languages, as described in William Croft's recent book Morphosyntax. This paper discusses how such an investigation could be carried out and why it would be useful.
Danoliteracy of Generative Large Language Models
(University of Tartu Library, 2025-03) Vejlgaard Holm, Søren; Hansen, Lars Kai; Nielsen, Martin Carsten; Johansson, Richard; Stymne, Sara
The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate Danoliteracy, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $\rho \sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95\%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.
Database of Latvian Morphemes and Derivational Models: ideas and expected results
(University of Tartu Library, 2025-03) Kalnača, Andra; Pakalne, Tatjana; Leväne-Petrova, Kristīne; Johansson, Richard; Stymne, Sara
In this paper, we describe “The Database of Latvian Morphemes and Derivational Models” – a large-scale corpus-based and manually validated database of Latvian derivational morphology currently in development at the University of Latvia. The database contains morpheme-level data – morphemes, incl. morpheme variants (allomorphs), morpheme types, morpheme homonymy/ homography resolution, hierarchical relations between root morphemes, links to word families, and lemma-level data – incl. base form, morphemic segmentation, POS, grammatical features, derivational motivation (incl. compounding), word-family membership. The focus of the database is on providing linguistically accurate comprehensive data as a reliable basis for future work in different fields.

Sirvi

Sirvi Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) Pealkiri järgi