Weakly-Supervised Text Classification for Estonian Sentiment Analysis

dc.contributor.advisorSirts, Kairit, juhendaja
dc.contributor.authorPung, Andreas
dc.contributor.otherTartu Ülikool. Loodus- ja täppisteaduste valdkondet
dc.contributor.otherTartu Ülikool. Arvutiteaduse instituutet
dc.date.accessioned2023-08-25T07:04:24Z
dc.date.available2023-08-25T07:04:24Z
dc.date.issued2022
dc.description.abstractText Classification is one of the most fundamental tasks in Natural Language Processing. Hand-labelling texts is costly and might need specialised domain knowledge – this is where unsupervised and weakly-supervised approaches could be useful. In this Master’s Thesis, the weakly-supervised text classification paradigm is used to classify the sentiment of Estonian texts. In this paradigm, the weak labels are created using labelling functions (Ratner et al., 2016). The aim of this thesis is to assess the applicability of weakly-supervised models trained with around 40× larger dataset in contrast to hand-labelling a smaller amount of texts to train a fully-supervised classifier. The compared models are fully and weaklysupervised BERT (Devlin et al., 2019); weakly-supervised COSINE (Yu et al., 2021) and WeaSEL (Cachay et al., 2021). Human evaluation is performed on texts where the models disagreed the most. As a result, we find that the fully-supervised models have the best performance. The best-performing weakly-supervised model trained on the larger dataset had an average classification accuracy of 7.29% worse (7.05% worse weighted F1-score) than the fully-supervised BERT model. The lower performance of weakly-supervised models might be caused by the low quality of labelling functions – developing them further might lead to better results.et
dc.identifier.urihttps://hdl.handle.net/10062/91752
dc.language.isoenget
dc.publisherTartu Ülikoolet
dc.rightsopenAccesset
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectText classificationet
dc.subjectweakly-supervised text classificationet
dc.subjectweak supervisionet
dc.subjectlabelling functionset
dc.subjectunsupervised text classificationet
dc.subjectnatural language processinget
dc.subjectsentiment analysiset
dc.subjectEstonian datasetet
dc.subject.othermagistritöödet
dc.subject.otherinformaatikaet
dc.subject.otherinfotehnoloogiaet
dc.subject.otherinformaticset
dc.subject.otherinfotechnologyet
dc.titleWeakly-Supervised Text Classification for Estonian Sentiment Analysiset
dc.typeThesiset

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Pung_ComputerScience_2022.pdf
Size:
1.67 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: