A Mansi FST and spellchecker

dc.contributor.authorRueter, Jack
dc.contributor.authorHorváth, Csilla
dc.contributor.authorTrosterud, Trond
dc.contributor.editorTrosterud, Trond
dc.contributor.editorWiechetek, Linda
dc.contributor.editorPirinen, Flammie
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-17T09:07:51Z
dc.date.available2025-02-17T09:07:51Z
dc.date.issued2025-03
dc.description.abstractThe article presents a finite state transducer and spellchecker for Mansi, an Ob-Ugric Uralic language spoken in northwestern Siberia. Mansi has a rich but mostly agglutinative morphology, with a morphophonology dominated by sandhi phenomena. With a small set of morphophonological rules (32 twolc rules) and a lexicon consisting of 12,000 Mansi entries and a larger set of propernouns we were able to build a transducer covering 98.9 % of a large (700k) newspaper corpus. Being a part of the GiellaLT infrastructure, the transducer was turned into a spellchecker. The most common spelling error in Mansi is the omission of length marks on vowels, and for the 1000 most common words containing long vowels, the spellchecker was able to give a correct suggestion as top-five in 98.3 % of the cases, and as first suggestion in 91.3 % of the cases.
dc.identifier.urihttps://hdl.handle.net/10062/107150
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleA Mansi FST and spellchecker
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_cgfsnlp_1_5.pdf
Suurus:
299.26 KB
Formaat:
Adobe Portable Document Format