Small Languages, Big Models: A Study of Continual Training on Languages of Norway

dc.contributor.author	Samuel, David
dc.contributor.author	Mikhailov, Vladislav
dc.contributor.author	Velldal, Erik
dc.contributor.author	Øvrelid, Lilja
dc.contributor.author	Charpentier, Lucas Georges Gabriel
dc.contributor.author	Kutuzov, Andrey
dc.contributor.author	Oepen, Stephan
dc.contributor.editor	Johansson, Richard
dc.contributor.editor	Stymne, Sara
dc.coverage.spatial	Tallinn, Estonia
dc.date.accessioned	2025-02-18T14:45:10Z
dc.date.available	2025-02-18T14:45:10Z
dc.date.issued	2025-03
dc.description.abstract	Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
dc.identifier.uri	https://hdl.handle.net/10062/107253
dc.language.iso	en
dc.publisher	University of Tartu Library
dc.relation.ispartofseries	NEALT Proceedings Series, No. 57
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Small Languages, Big Models: A Study of Continual Training on Languages of Norway
dc.type	Article

Failid

Nüüd näidatakse 1 - 1 1