Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models

Stenlund, Mathias; Myneni, Hemanadhan; Riedel, Morris

Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models

dc.contributor.author	Stenlund, Mathias
dc.contributor.author	Myneni, Hemanadhan
dc.contributor.author	Riedel, Morris
dc.contributor.editor	Johansson, Richard
dc.contributor.editor	Stymne, Sara
dc.coverage.spatial	Tallinn, Estonia
dc.date.accessioned	2025-02-19T08:33:16Z
dc.date.available	2025-02-19T08:33:16Z
dc.date.issued	2025-03
dc.description.abstract	Segmenting languages based on morpheme boundaries instead of relying on language independent segmenting algorithms like Byte-Pair Encoding (BPE) has shown to benefit downstream Natural Language Processing (NLP) task performance. This can however be tricky for polysynthetic languages like Inuktitut due to a high morpheme-to-word ratio and the lack of appropriately sized annotated datasets. Through our work, we display the potential of using pre-trained Large Language Models (LLMs) for surface-level morphological segmentation of Inuktitut by treating it as a binary classification task. We fine-tune on tasks derived from automatically annotated Inuktitut words written in Inuktitut syllabics. Our approach shows good potential when compared to previous neural approaches. We share our best model to encourage further studies on down stream NLP tasks for Inuktitut written in syllabics.
dc.identifier.uri	https://hdl.handle.net/10062/107262
dc.language.iso	en
dc.publisher	University of Tartu Library
dc.relation.ispartofseries	NEALT Proceedings Series, No. 57
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models
dc.type	Article

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1

Nimi:: 2025_nodalida_1_69.pdf
Suurus:: 1.17 MB
Formaat:: Adobe Portable Document Format

Lae alla

Kollektsioonid

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)