Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Selle kollektsiooni püsiv URIhttps://hdl.handle.net/10062/107190

Sirvi

Nüüd näidatakse 1 - 20 83

Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles
(University of Tartu Library, 2025-03) Touileb, Samia; Mikhailov, Vladislav; Kroka, Marie Ingeborg; Velldal, Erik; Øvrelid, Lilja; Johansson, Richard; Stymne, Sara
We introduce a dataset of high-quality human-authored summaries of news articles in Norwegian. The dataset is intended for benchmarking of the abstractive summarisation capabilities of generative language models. Each document in the dataset is provided with three different candidate gold-standard summaries written by native Norwegian speakers and all summaries are provided in both of the written variants of Norwegian – Bokmål and Nynorsk. The paper describes details on the data creation effort as well as an evaluation of existing open LLMs for Norwegian on the dataset. We also provide insights from a manual human evaluation, comparing human-authored to model generated summaries. Our results indicate that the dataset provides a challenging LLM benchmark for Norwegian summarisation capabilities.
Analyzing the Effect of Linguistic Instructions on Paraphrase Generation
(University of Tartu Library, 2025-03) Vahtola, Teemu; Hu, Songbo; Creutz, Mathias; Korhonen, Anna; Vulić, Ivan; Tiedemann, Jörg; Johansson, Richard; Stymne, Sara
Recent work has demonstrated that large language models can often generate fluent and linguistically correct text, adhering to given instructions. However, to what extent can they execute complex instructions requiring knowledge of fundamental linguistic concepts and elaborate semantic reasoning? Our study connects an established linguistic theory of paraphrasing with LLM-based practice to analyze which specific types of paraphrases LLMs can accurately produce and where they still struggle. To this end, we investigate a method of analyzing paraphrases generated by LLMs prompted with a comprehensive set of systematic linguistic instructions. We conduct a case study using GPT-4, which has shown strong performance across various language generation tasks, and we believe that other LLMs may face similar challenges in comparable scenarios. We examine GPT-4 from a linguistic perspective to explore its potential contributions to linguistic research regarding paraphrasing, systematically assessing how accurately the model generates paraphrases that adhere to specified transformation rules. Our results suggest that GPT-4 frequently prioritizes simple lexical or syntactic alternations, often disregarding the transformation guidelines if they overly complicate the primary task.
Efficient Elicitation of Fictitious Nursing Notes from Volunteer Healthcare Professionals
(University of Tartu Library, 2025-03) Vaaben Bornerup, Jesper; Hardmeier, Christian; Johansson, Richard; Stymne, Sara
Reliable automatic solutions to extract structured information from free-text nursing notes could bring important efficiency gains in healthcare, but their development is hampered by the sensitivity and limited availability of example data. We describe a method for eliciting fictitious nursing documentation and associated structured documentation from volunteers and a resulting dataset of 397 Danish notes collected and annotated through a custom web application from 98 participating nurses. After some manual refinement, we obtained a high-quality dataset containing nurse notes with relevant entities identified. We describe the implementation and limitations of our approach as well as initial experiments in a named entity tagging setup.
Entailment Progressions: A Robust Approach to Evaluating Reasoning Within Larger Discourse
(University of Tartu Library, 2025-03) Shastry, Rishabh; Chiril, Patricia; Charney, Joshua; Uminsky, David; Johansson, Richard; Stymne, Sara
Textual entailment, or the ability to deduce whether a proposed hypothesis is logically supported by a given premise, has historically been applied to the evaluation of language modelling efficiency in tasks like question answering and text summarization. However, we hypothesize that these zero-shot entailment evaluations can be extended to the task of evaluating discourse within larger textual narratives. In this paper, we propose a simple but effective method that sequentially evaluates changes in textual entailment between sentences within a larger text, in an approach we denote as "Entailment Progressions". These entailment progressions aim to capture the inference relations between sentences as an underlying component capable of distinguishing texts generated from various models and procedures. Our results suggest that entailment progressions can be used to effectively distinguish between machine-generated and human-authored texts across multiple established benchmark corpora and our own EP4MGT dataset. Additionally, our method displays robustness in performance when evaluated on paraphrased texts a technique that has historically affected the performance of well-established metrics when distinguishing between machine generated and human authored texts.
Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models
(University of Tartu Library, 2025-03) Stenlund, Mathias; Myneni, Hemanadhan; Riedel, Morris; Johansson, Richard; Stymne, Sara
Segmenting languages based on morpheme boundaries instead of relying on language independent segmenting algorithms like Byte-Pair Encoding (BPE) has shown to benefit downstream Natural Language Processing (NLP) task performance. This can however be tricky for polysynthetic languages like Inuktitut due to a high morpheme-to-word ratio and the lack of appropriately sized annotated datasets. Through our work, we display the potential of using pre-trained Large Language Models (LLMs) for surface-level morphological segmentation of Inuktitut by treating it as a binary classification task. We fine-tune on tasks derived from automatically annotated Inuktitut words written in Inuktitut syllabics. Our approach shows good potential when compared to previous neural approaches. We share our best model to encourage further studies on down stream NLP tasks for Inuktitut written in syllabics.
Generative AI for Technical Writing: Comparing Human and LLM Assessments of Generated Content
(University of Tartu Library, 2025-03) Souza, Karen de; Nikolaev, Alexandre; Koponen, Maarit; Johansson, Richard; Stymne, Sara
Large language models (LLMs) have recently gained significant attention for their capabilities in natural language processing (NLP), particularly generative artificial intelligence (AI). LLMs can also be useful tools for software documentation technical writers. We present an assessment of technical documentation content generated by three different LLMs using retrieval-augmented technology (RAG) with product documentation as a knowledge base. The LLM-generated responses were analyzed in three ways: 1) manual error analysis by a technical writer, 2) automatic assessment using deterministic metrics (BLEU, ROUGE, token overlap), and 3) evaluation of correctness by LLM as a judge. The results of these assessments were compared using a Network Analysis and linear regression models to investigate statistical relationships, model preferences, and the distribution of human and LLM scores. The analyses concluded that human quality evaluation is more related to the LLM correctness judgment than deterministic metrics, even when using different analysis frameworks.
Mixed Feelings: Cross-Domain Sentiment Classification of Patient Feedback
(University of Tartu Library, 2025-03) Rønningstad, Egil; Storset, Lilja Charlotte; Mæhlum, Petter; Øvrelid, Lilja; Velldal, Erik; Johansson, Richard; Stymne, Sara
Sentiment analysis of patient feedback from the public health domain can aid decision makers in evaluating the provided services. The current paper focuses on free-text comments in patient surveys about general practitioners and psychiatric healthcare, annotated with four sentence-level polarity classes - positive, negative, mixed and neutral - while also attempting to alleviate data scarcity by leveraging general-domain sources in the form of reviews. For several different architectures, we compare in-domain and out-of-domain effects, as well as the effects of training joint multi-domain models.
Interactive maps for corpus-based dialectology
(University of Tartu Library, 2025-03) Scherrer, Yves; Kuparinen, Olli; Johansson, Richard; Stymne, Sara
Traditional data collection methods in dialectology rely on structured surveys, whose results can be easily presented on printed or digital maps. But in recent years, corpora of transcribed dialect speech have become a precious alternative data source for data-driven linguistic analysis. For example, topic models can be advantageously used to discover both general dialectal variation patterns and specific linguistic features that are most characteristic for certain dialects. Multilingual (or rather, multilectal) language modeling tasks can also be used to learn speaker-specific embeddings. In connection with this paper, we introduce a website that presents the results of two recent studies in the form of interactive maps, allowing visitors to explore the effects of various parameter settings. The website covers two tasks (topic models and speaker embeddings) and three language areas (Finland, Norway, and German-speaking Switzerland). It is available at https://www.corcodial.net/ .
Small Languages, Big Models: A Study of Continual Training on Languages of Norway
(University of Tartu Library, 2025-03) Samuel, David; Mikhailov, Vladislav; Velldal, Erik; Øvrelid, Lilja; Charpentier, Lucas Georges Gabriel; Kutuzov, Andrey; Oepen, Stephan; Johansson, Richard; Stymne, Sara
Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
MC-19: A Corpus of 19th Century Icelandic Texts
(University of Tartu Library, 2025-03) Steingrímsson, Steinþór; Sigurðsson, Einar Freyr; Jasonarson, Atli; Johansson, Richard; Stymne, Sara
We present MC-19, a new Icelandic historical corpus containing texts from the period 1800-1920. We describe approaches for enhancing a corpus of historical texts, by preparing the texts so that they can be processed using state-of-the-art NLP tools. We train encoder-decoder models to reduce the number of OCR errors while leaving other orthographical variation be. We generate a separate modern spelling layer by normalizing the spelling to comply with modern spelling rules, using a statistical modernization ruleset as well as a dictionary of the most common words. This allows for the texts to be PoS-tagged and lemmatized using available tools, facilitating usage of the corpus for researchers and language technologists. The published version of the corpus contains over 270 million tokens.
Braxen 1.0
(University of Tartu Library, 2025-03) Tånnander, Christina; Edlund, Jens; Johansson, Richard; Stymne, Sara
With this paper, we release a Swedish pronunciation lexicon resource, Braxen 1.0, which is the result of almost 20 years development carried out at the Swedish Agency for Accessible Media (MTM). The lexicon originated with a basic word list, but has continuously been exanded with new entries, mainly acquired from university textbooks and news text. Braxen consists of around 850 000 entries, of which around 150 000 are proper names. The lexicon is released under the CC BY 4.0 license and is accessible for public use.
Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) : Proceedings of the Conference : March 3-4, 2025
(University of Tartu Library, 2025-03) Johansson, Richard; Stymne, Sara
Applying and Optimising a Multi-Scale Probit Model for Cross-Source Text Complexity Classification and Ranking in Swedish
(University of Tartu Library, 2025-03) Andersson, Elsa; Falkenjack, Johan; Jönsson, Arne; Johansson, Richard; Stymne, Sara
We present results from using Probit models to classify and rank texts of varying complexity from multiple sources. We use multiple linguistic sources including Swedish easy-to-read books and investigate data augmentation and feature regularisation as optimisation methods for text complexity assessment. Multi-Scale and Single Scale Probit models are implemented using different ratios of training data, and then compared. Overall, the findings suggest that the Multi-Scale Probit model is an effective method for classifying text complexity and ranking new texts and could be used to improve the performance on small datasets as well as normalize datasets labelled using different scales.
Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
(University of Tartu Library, 2025-03) Johansson, Richard; Stymne, Sara
Annotating and Classifying Direct Speech in Historical Danish and Norwegian Literary Texts
(University of Tartu Library, 2025-03) Al-Laith, Ali; Conroy, Alexander; Degn, Kirstine Nielsen; Bjerring-Hansen, Jens; Hershcovich, Daniel; Johansson, Richard; Stymne, Sara
Analyzing direct speech in historical literary texts provides insights into character dynamics, narrative style, and discourse patterns. In late 19th century Danish and Norwegian fiction direct speech reflects characters' social and geographical backgrounds. However, inconsistent typographic conventions in Scandinavian literature complicate computational methods for distinguishing direct speech from other narrative elements. To address this, we introduce an annotated dataset from the MeMo corpus, capturing speech markers and tags in Danish and Norwegian novels. We evaluate pre-trained language models for classifying direct speech, with results showing that a Danish Foundation Model (DFM), trained on extensive Danish data, has the highest performance. Finally, we conduct a classifier-assisted quantitative corpus analysis and find a downward trend in the prevalence of speech over time.
Diachronic Analysis of Phrasal Verbs in English Scientific Writing
(University of Tartu Library, 2025-03) Alves, Diego; Johansson, Richard; Stymne, Sara
Phrasal verbs (PVs) are a specific type of multi-word expressions and a specific feature of the English language. However, their usage in scientific prose is limited. Our study focuses on the analysis of phrasal verbs in the scientific domain using information theory methods to describe diachronic phenomena such as conventionalization and diversification regarding the usage of PVs. Thus, we analysed their developmental trajectory over time from the mid-17th century to the end of the 20th century by measuring the relative entropy (Kullback-Leibler divergence), predictability in context of the phrasal verbs particles (surprisal), and the paradigmatic variability using word embedding spaces. We were able to identify interesting phenomena such as the process of conventionalization over the 20th century and the peaks of diversification throughout the centuries.
Better Benchmarking LLMs for Zero-Shot Dependency Parsing
(University of Tartu Library, 2025-03) Ezquerro, Ana; Gómez-Rodríguez, Carlos; Vilares, David; Johansson, Richard; Stymne, Sara
While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.
GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian
(University of Tartu Library, 2025-03) Dorkin, Aleksei; Sirts, Kairit; Johansson, Richard; Stymne, Sara
We present GliLem—a novel hybrid lemmatization system for Estonian that enhances the highly accurate rule-based morphological analyzer Vabamorf with an external disambiguation module based on GliNER—an open vocabulary NER model that is able to match text spans with text labels in natural language. We leverage the flexibility of a pre-trained GliNER model to improve the lemmatization accuracy of Vabamorf by 10% compared to its original disambiguation module and achieve an improvement over the token classification-based baseline. To measure the impact of improvements in lemmatization accuracy on the information retrieval downstream task, we first created an information retrieval dataset for Estonian by automatically translating the DBpedia-Entity dataset from English. We benchmark several token normalization approaches, including lemmatization, on the created dataset using the BM25 algorithm. We observe a substantial improvement in IR metrics when using lemmatization over simplistic stemming. The benefits of improving lemma disambiguation accuracy manifest in small but consistent improvement in the IR recall measure, especially in the setting of high k.
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
(University of Tartu Library, 2025-03) Etori, Naome A.; Kanepajs, Arturs; Lu, Kevin; Karisa, Randu; Johansson, Richard; Stymne, Sara
This paper evaluates the language understanding capabilities of various large language models (LLMs) through an analysis of 112 translated and human-edited questions from the Multitask Language Understanding (MMLU) dataset, focusing specifically on two underrepresented languages: Latvian and Giriama. The study compares the performance of six state-of-the-art (SOTA) models, with OpenAI's o1-preview model demonstrating superior performance across all languages, significantly outperforming non-proprietary models in Latvian and all other models in Giriama. Human editing of automated translations from English to Latvian yielded only a small, statistically insignificant improvement in performance estimates, suggesting that machine-translated benchmarks may be sufficient for comparing model performance in languages with established digital resources like Latvian. However, automated translation to Giriama proved infeasible, and model performance in Giriama remained poor, highlighting the persistent challenges LLMs face with low-resource languages. These findings underscore the need for more comprehensive datasets and improved machine translation capabilities for underrepresented languages, while emphasizing the importance of localized benchmarks and human evaluation in addressing cultural and contextual limitations in AI models.
Investigating the effectiveness of Data Augmentation and Contrastive Learning for Named Entity Recognition
(University of Tartu Library, 2025-03) Chia, Noel; Rehbein, Ines; Ponzetto, Simone Paolo; Johansson, Richard; Stymne, Sara
Data Augmentation (DA) and Contrastive Learning (CL) are widely used in NLP, but their potential for NER has not yet been investigated in detail. Existing work is mostly limited to zero- and few-shot scenarios where improvements over the baseline are easy to obtain. In this paper, we address this research gap by presenting a systematic evaluation of DA for NER on small, medium-sized and large datasets with coarse and fine-grained labels. We report results for a) DA only, b) DA in combination with supervised contrastive learning, and c) DA with transfer learning. Our results show that DA on its own fails to improve results over the baseline and that supervised CL works better on larger datasets while transfer learning is beneficial if the target dataset is very small. Finally, we investigate how contrastive learning affects the learned representations, based on dimensionality reduction and visualisation techniques, and show that CL mostly helps to separate named entities from non-entities.

Sirvi

Sirvi Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) Kuupäev järgi