Small Languages, Big Models: A Study of Continual Training on Languages of Norway

dc.contributor.authorSamuel, David
dc.contributor.authorMikhailov, Vladislav
dc.contributor.authorVelldal, Erik
dc.contributor.authorØvrelid, Lilja
dc.contributor.authorCharpentier, Lucas Georges Gabriel
dc.contributor.authorKutuzov, Andrey
dc.contributor.authorOepen, Stephan
dc.contributor.editorJohansson, Richard
dc.contributor.editorStymne, Sara
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-18T14:45:10Z
dc.date.available2025-02-18T14:45:10Z
dc.date.issued2025-03
dc.description.abstractTraining large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
dc.identifier.urihttps://hdl.handle.net/10062/107253
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.relation.ispartofseriesNEALT Proceedings Series, No. 57
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleSmall Languages, Big Models: A Study of Continual Training on Languages of Norway
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_nodalida_1_61.pdf
Suurus:
428.81 KB
Formaat:
Adobe Portable Document Format