Small Languages, Big Models: A Study of Continual Training on Languages of Norway
dc.contributor.author | Samuel, David | |
dc.contributor.author | Mikhailov, Vladislav | |
dc.contributor.author | Velldal, Erik | |
dc.contributor.author | Øvrelid, Lilja | |
dc.contributor.author | Charpentier, Lucas Georges Gabriel | |
dc.contributor.author | Kutuzov, Andrey | |
dc.contributor.author | Oepen, Stephan | |
dc.contributor.editor | Johansson, Richard | |
dc.contributor.editor | Stymne, Sara | |
dc.coverage.spatial | Tallinn, Estonia | |
dc.date.accessioned | 2025-02-18T14:45:10Z | |
dc.date.available | 2025-02-18T14:45:10Z | |
dc.date.issued | 2025-03 | |
dc.description.abstract | Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B. | |
dc.identifier.uri | https://hdl.handle.net/10062/107253 | |
dc.language.iso | en | |
dc.publisher | University of Tartu Library | |
dc.relation.ispartofseries | NEALT Proceedings Series, No. 57 | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 International | |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ | |
dc.title | Small Languages, Big Models: A Study of Continual Training on Languages of Norway | |
dc.type | Article |
Failid
Originaal pakett
1 - 1 1