Investigating Linguistic Abilities of LLMs for Native Language Identification

dc.contributor.authorUluslu, Ahmet Yavuz
dc.contributor.authorSchneider, Gerold
dc.contributor.editorMuñoz Sánchez, Ricardo
dc.contributor.editorAlfter, David
dc.contributor.editorVolodina, Elena
dc.contributor.editorKallas, Jelena
dc.coverage.spatialTallinn, Estonia
dc.date.accessioned2025-02-17T10:46:46Z
dc.date.available2025-02-17T10:46:46Z
dc.date.issued2025-03
dc.description.abstractLarge language models (LLMs) have achieved state-of-the-art results in native language identification (NLI). However, these models often depend on superficial features, such as cultural references and self-disclosed information in the document, rather than capturing the underlying linguistic structures. In this work, we evaluate the linguistic abilities of opensource LLMs by evaluating their performance in NLI through content-independent features, such as POS n-grams, function words, and punctuation marks, and compare their performance against traditional machine learning approaches. Our experiments reveal that while LLM’s initial performance on structural features (55.2% accuracy) falls significantly below their performance on full text (96.5%), fine-tuning significantly improves their capabilities, enabling state-of-the-art results with strong cross-domain generalization.
dc.identifier.urihttps://hdl.handle.net/10062/107172
dc.language.isoen
dc.publisherUniversity of Tartu Library
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.titleInvestigating Linguistic Abilities of LLMs for Native Language Identification
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
2025_nlp4call_1_7.pdf
Suurus:
344.92 KB
Formaat:
Adobe Portable Document Format