Investigating Linguistic Abilities of LLMs for Native Language Identification

Uluslu, Ahmet Yavuz; Schneider, Gerold

Investigating Linguistic Abilities of LLMs for Native Language Identification

Failid

2025_nlp4call_1_7.pdf (344.92 KB)

Kuupäev

2025-03

Autorid

Uluslu, Ahmet Yavuz

Schneider, Gerold

Kirjastaja

University of Tartu Library

Abstrakt

Large language models (LLMs) have achieved state-of-the-art results in native language identification (NLI). However, these models often depend on superficial features, such as cultural references and self-disclosed information in the document, rather than capturing the underlying linguistic structures. In this work, we evaluate the linguistic abilities of opensource LLMs by evaluating their performance in NLI through content-independent features, such as POS n-grams, function words, and punctuation marks, and compare their performance against traditional machine learning approaches. Our experiments reveal that while LLM’s initial performance on structural features (55.2% accuracy) falls significantly below their performance on full text (96.5%), fine-tuning significantly improves their capabilities, enabling state-of-the-art results with strong cross-domain generalization.

URI

https://hdl.handle.net/10062/107172

Kollektsioonid

Proceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning

Kirje täielik lehekülg

Investigating Linguistic Abilities of LLMs for Native Language Identification

Failid

Kuupäev

Autorid

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Abstrakt

Kirjeldus

Märksõnad

Viide

URI

Kollektsioonid