Lexicon-Enhanced Neural Lemmatization for Estonian

Milintsevich, Kirill

Lexicon-Enhanced Neural Lemmatization for Estonian

Failid

Milintsevich_Thesis.pdf (659.31 KB)

Kuupäev

2020

Autorid

Milintsevich, Kirill

Kirjastaja

Tartu Ülikool

Abstrakt

The problem of lemmatization, i.e. recovering the normal, or dictionary form of a word from the text, is one of the crucial parts of the natural language processing applications. It is important for the text preprocessing which is the step of cleaning and preparing the data for the use in NLP models and algorithms. This step can greatly improve the performance of a model if done correctly or, on the other hand, drastically reduce the quality of the output if neglected. Nowadays, neural networks dominate in the field of NLP as well as in the problem of lemmatization. Most of the recent papers boast to achieve 95-96% accuracy but there is still plenty of room for improvement. As with most of the neural network architectures, the lack of training data can be a huge drawback during the process of model creation. There exist many smaller languages that cannot afford to have large annotated datasets. The Estonian language, being somewhat in the middle in terms of its data size, can benefit from additional data. In this thesis, we propose a novel approach for lemmatization. In addition to the regular input, the lemmatization model takes the predictions either from another, weaker rule-based lemmatizer or uses the lexicon build from the training data to enhance the lemma prediction. With the combination of several attention layers, the model manages to choose the best from two inputs and produce more accurate lemmas.

Märksõnad

Natural language processing, lemmatization, deep learning

URI

https://hdl.handle.net/10062/94079

Kollektsioonid

MTAT magistritööd – Master's theses

Kirje täielik lehekülg

Lexicon-Enhanced Neural Lemmatization for Estonian

Failid

Kuupäev

Autorid

Ajakirja pealkiri

Ajakirja ISSN

Köite pealkiri

Kirjastaja

Abstrakt

Kirjeldus

Märksõnad

Viide

URI

Kollektsioonid