Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models

dc.contributor.authorStenlund, Mathias
dc.contributor.authorMyneni, Hemanadhan
dc.contributor.authorRiedel, Morris
dc.contributor.editorJohansson, Richard
dc.contributor.editorStymne, Sara
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-19T08:33:16Z
dc.date.available2025-02-19T08:33:16Z
dc.date.issued2025-03
dc.description.abstractSegmenting languages based on morpheme boundaries instead of relying on language independent segmenting algorithms like Byte-Pair Encoding (BPE) has shown to benefit downstream Natural Language Processing (NLP) task performance. This can however be tricky for polysynthetic languages like Inuktitut due to a high morpheme-to-word ratio and the lack of appropriately sized annotated datasets. Through our work, we display the potential of using pre-trained Large Language Models (LLMs) for surface-level morphological segmentation of Inuktitut by treating it as a binary classification task. We fine-tune on tasks derived from automatically annotated Inuktitut words written in Inuktitut syllabics. Our approach shows good potential when compared to previous neural approaches. We share our best model to encourage further studies on down stream NLP tasks for Inuktitut written in syllabics.
dc.identifier.urihttps://hdl.handle.net/10062/107262
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.relation.ispartofseriesNEALT Proceedings Series, No. 57
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleSurface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_nodalida_1_69.pdf
Suurus:
1.17 MB
Formaat:
Adobe Portable Document Format