Sirvi Autor "Trosterud, Trond" järgi
Nüüd näidatakse 1 - 20 26
- Tulemused lehekülje kohta
- Sorteerimisvalikud
Kirje A constraint grammar for Faroese(2009-11-14T21:43:41Z) Trosterud, TrondKirje A Grammar-Based Method for Instilling Empirical Dependency Structure in LLMs(University of Tartu Library, 2025-03) Torstensson, Olle; Holmström, Oskar; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieWe investigate whether synthetic pretraining data generated from a formal grammar modeling syntactic dependencies can improve English language models. Building upon the structured pretraining data approach of Papadimitriou and Jurafsky (2023), we develop a grammar that more closely mirrors empirical dependency structures. Our results are negative – this type of pretraining significantly degrades model performance, with both our and their pretraining approach performing worse than no pretraining at all. We analyze potential explanations for these findings and discuss implications for future work on structured-data pretraining.Kirje A grammatical analyser for Tokelau(University of Tartu Library, 2025-03) Trosterud, Trond; Vonen, Arnfinn Muruvik; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieThis article will present a grammatical aunalyser, disambiguator and dependency analysis of Tokelau. The grammatical analyser is written as a finite-state transducer (FST), whereas the disambiguator and dependency analyser are written in Constraint Grammar (CG), both within the GiellaLT infrastructure. Contrary to most languages analyzed within this framework, Being a Polynesian language, Tokelau is a predominantly isolating language, with reduplication and affixation as the main morphological processes. The article will discuss how FST and CG deal with Polynesian languages.Kirje A Mansi FST and spellchecker(University of Tartu Library, 2025-03) Rueter, Jack; Horváth, Csilla; Trosterud, Trond; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieThe article presents a finite state transducer and spellchecker for Mansi, an Ob-Ugric Uralic language spoken in northwestern Siberia. Mansi has a rich but mostly agglutinative morphology, with a morphophonology dominated by sandhi phenomena. With a small set of morphophonological rules (32 twolc rules) and a lexicon consisting of 12,000 Mansi entries and a larger set of propernouns we were able to build a transducer covering 98.9 % of a large (700k) newspaper corpus. Being a part of the GiellaLT infrastructure, the transducer was turned into a spellchecker. The most common spelling error in Mansi is the omission of length marks on vowels, and for the 1000 most common words containing long vowels, the spellchecker was able to give a correct suggestion as top-five in 98.3 % of the cases, and as first suggestion in 91.3 % of the cases.Kirje An Annotated Error Corpus for Esperanto(University of Tartu Library, 2025-03) Bick, Eckhard; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieThis paper presents and evaluates a new multi-genre error corpus for (written) Esperanto, EspEraro, building on both learner, news and internet data and covering both ordinary spelling errors and real-word errors such as grammatical and word choice errors. Because the corpus has been annotated not only for errors, error types and corrections, but also with Constraint Grammar (CG) tags for part-of-speech, inflection, affixation, syntactic function, dependency and semantic class, it allows users to linguistically contextualize errors and to craft and test CG rules aiming at the recognition and/or correction of the various error types covered in the corpus. The resource was originally created for regression-testing a newly developed spell- and grammar checker, and contains about 75,000 tokens (~ 4,000 sentences), with 3,330 tokens annotated for one or more errors and a combined correction suggestion. We discuss the different error types and evaluate their weight in the corpus. Where relevant, we explain the role of Constraint Grammar (CG) in the identification and correction of the individual error types.Kirje Building an Open-Source Development Infrastructure for Language Technology Projects(Oslo, Norway, Linköping University Electronic Press, Sweden, pp. 343--352, 2013) Moshagen, Sjur N.; Pirinen, Tommi; Trosterud, Trond; Oepen, Stephan; Hagen, Kristin; Johannessen, Janne BondiKirje Case error corrections for noun phrases containing deverbal attributive nouns in Greenlandic(University of Tartu Library, 2025-03) Denbæk, Judithe; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieThis paper contains preliminary findings using Constraint Grammar (CG) in semantic annotation in a specific type of noun phrases in Greenlandic, in which the attributive noun is a nominalized predicative verbal stem. The annotation is used in a grammar checker pipeline for the purpose of making case error correction suggestions.Kirje Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway(University of Tartu Library, 2025-03) Enstad, Tita; Trosterud, Trond; Røsok, Marie Iversdatter; Beyer, Yngvil; Roald, Marie; Johansson, Richard; Stymne, SaraOptical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.Kirje Constraint Grammar in Dialogue Systems(2009-11-14T21:48:19Z) Antonsen, Lene; Huhmarniemi, Saara; Trosterud, TrondKirje Divvunspell---Finite-State Spell-Checking and Correction on Modern Platforms(University of Tartu Library, 2025-03) Pirinen, Flammie A.; Moshagen, Sjur Nørstebø; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieSpell-checking and correction is one of the key applications of natural language support. Historically, for the biggest, less morphologically complex languages, spell-checking and correction could be implemented by relatively simple means; however, for morphologically complex and low-resource languages, the solutions were often suboptimal. Finite-state methods are the state of the art in rule-based natural language processing and also for spell-checking and correction they have been effectively used. In this article, we show some recent developments of a finite-state spell-checker implementation that works with modern operating systems and platforms.Kirje Drawing Blue Lines - What can Constraint Grammar do for GEC?(University of Tartu Library, 2025-03) Wiechetek, Linda; Unhammer, Kevin Brubeck; Trosterud, Trond; Wiechetek, Linda; Pirinen, FlammieThis paper presents the application of rule-based methods for Grammatical Error Correction (GEC) across multiple low-resource languages. We describe new functionality using the Constraint Grammar (CG) formalism, designed for detecting and correcting different types of complex grammatical errors in a range of morphologically complex languages. These errors require transformations such as reordering, word additions/deletions, and alternative choices for multiword suggestions. New perspectives are gained from end-to-end-testing – this work aims to clarify the relationship between the command-line interface used by developers and the user interfaces of our grammar checker plug-in for common word processors. We discuss challenges and solutions in correcting complex errors, with examples from languages like Lule Sámi, Irish, and Greenlandic, enabling linguists to adapt these methods in order to provide accurate and context-aware proofing tools for their own languages in mainstream word processors like Microsoft Word, Google Docs or LibreOffice.Kirje Interactive pedagogical programs based on constraint grammar(Odense, Denmark, Northern European Association for Language Technology (NEALT), pp. 10--17, 2009) Antonsen, Lene; Huhmarniemi, Saara; Trosterud, Trond; Jokinen, Kristiina; Bick, EckhardKirje Interactive pedagogical programs based on constraint grammar(2009-05-11T08:55:58Z) Antonsen, Lene; Huhmarniemi, Saara; Trosterud, TrondKirje Machine translation with North Saami as a pivot language(Gothenburg, Sweden, Association for Computational Linguistics, pp. 123--131, 2017) Antonsen, Lene; Gerstenberger, Ciprian; Kappfjell, Maja; Nystø Rahka, Sandra; Olthuis, Marja-Liisa; Trosterud, Trond; Tyers, Francis M.; Tiedemann, Jörg; Tahmasebi, NinaKirje Next to nothing – a cheap South Saami disambiguator(2011-11-17) Antonsen, Lene; Trosterud, TrondKirje North-Sámi to Finnish rule-based machine translation system(Gothenburg, Sweden, Association for Computational Linguistics, pp. 115--122, 2017) Pirinen, Tommi; Tyers, Francis M.; Trosterud, Trond; Johnson, Ryan; Unhammer, Kevin; Puolakainen, Tiina; Tiedemann, Jörg; Tahmasebi, NinaKirje Proceedings(2009-11-14T22:05:13Z) Bick, Eckhard; Hagen, Kristin; Müürisep, Kaili; Trosterud, TrondKirje Proceedings(2011-11-17) Bick, Eckhard; Hagen, Kristin; Müürisep, Kaili; Trosterud, TrondKirje Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP(University of Tartu Library, 2025-03) Trosterud, Trond; Wiechetek, Linda; Pirinen, Flammie