Lexicon-Enhanced Neural Lemmatization for Estonian
Date
2020
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
The problem of lemmatization, i.e. recovering the normal, or dictionary form of
a word from the text, is one of the crucial parts of the natural language processing
applications. It is important for the text preprocessing which is the step of cleaning
and preparing the data for the use in NLP models and algorithms. This step can greatly
improve the performance of a model if done correctly or, on the other hand, drastically
reduce the quality of the output if neglected.
Nowadays, neural networks dominate in the field of NLP as well as in the problem of
lemmatization. Most of the recent papers boast to achieve 95-96% accuracy but there is
still plenty of room for improvement. As with most of the neural network architectures,
the lack of training data can be a huge drawback during the process of model creation.
There exist many smaller languages that cannot afford to have large annotated datasets.
The Estonian language, being somewhat in the middle in terms of its data size, can
benefit from additional data.
In this thesis, we propose a novel approach for lemmatization. In addition to the
regular input, the lemmatization model takes the predictions either from another, weaker
rule-based lemmatizer or uses the lexicon build from the training data to enhance the
lemma prediction. With the combination of several attention layers, the model manages
to choose the best from two inputs and produce more accurate lemmas.
Description
Keywords
Natural language processing, lemmatization, deep learning