Weakly-Supervised Text Classification for Estonian Sentiment Analysis
Kuupäev
2022
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikool
Abstrakt
Text Classification is one of the most fundamental tasks in Natural Language
Processing. Hand-labelling texts is costly and might need specialised domain
knowledge – this is where unsupervised and weakly-supervised approaches could be
useful. In this Master’s Thesis, the weakly-supervised text classification paradigm
is used to classify the sentiment of Estonian texts. In this paradigm, the weak
labels are created using labelling functions (Ratner et al., 2016). The aim of
this thesis is to assess the applicability of weakly-supervised models trained with
around 40× larger dataset in contrast to hand-labelling a smaller amount of texts
to train a fully-supervised classifier. The compared models are fully and weaklysupervised
BERT (Devlin et al., 2019); weakly-supervised COSINE (Yu et al., 2021)
and WeaSEL (Cachay et al., 2021). Human evaluation is performed on texts where
the models disagreed the most. As a result, we find that the fully-supervised
models have the best performance. The best-performing weakly-supervised model
trained on the larger dataset had an average classification accuracy of 7.29% worse
(7.05% worse weighted F1-score) than the fully-supervised BERT model. The lower
performance of weakly-supervised models might be caused by the low quality of
labelling functions – developing them further might lead to better results.
Kirjeldus
Märksõnad
Text classification, weakly-supervised text classification, weak supervision, labelling functions, unsupervised text classification, natural language processing, sentiment analysis, Estonian dataset