Error rate of automated part-of-speech tagging of Estonian academic learner English
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Corpora are a great tool for linguistic research and improving learner language. At the moment, there exists the Tartu Corpus of Estonian Learner English (TCELE). However, it is small and lacking academic learner English. Building a corpus of Estonian academic learner English (EALE) could fill the gap in TCELE and provide worthwhile information for students, teachers and researchers alike. Modern corpora include various types of annotation and tagging words for their part of speech (POS) is the most common of them, but manual tagging is an overwhelmingly long and difficult task. Automated taggers can make this process relatively fast and easy. However, while automated tagger performance has been evaluated with both native writing and learner writing, there is a lack of research of automated tagger performance on academic learner writing. This paper aims to study the accuracy of automated POS tagging of EALE. To achieve this, a corpus of EALE was built and tagged using the Natural Language Toolkit (NLTK) POS tagger with the results compared against a sample of manually added tags.
Description
Keywords
akadeemiline õppijakeel, märgendamine