Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway

dc.contributor.authorEnstad, Tita
dc.contributor.authorTrosterud, Trond
dc.contributor.authorRøsok, Marie Iversdatter
dc.contributor.authorBeyer, Yngvil
dc.contributor.authorRoald, Marie
dc.contributor.editorJohansson, Richard
dc.contributor.editorStymne, Sara
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-17T13:59:42Z
dc.date.available2025-02-17T13:59:42Z
dc.date.issued2025-03
dc.description.abstractOptical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.
dc.identifier.urihttps://hdl.handle.net/10062/107202
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.relation.ispartofseriesNEALT Proceedings Series, No. 57
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleComparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_nodalida_1_11.pdf
Suurus:
223.03 KB
Formaat:
Adobe Portable Document Format