Eesti-inglise statistilise masintõlke mudeli ümberpööramine inglise-eesti suunale
Date
2012
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Käesolevas töös on käsitletud statistilist masintõlget nii teoreetiliselt kui ka praktiliselt. Statistiline masintõlge on valdkond, mis üritab panna arvutit tõlkima, ilma et ta teaks midagi keelte ametliku grammatika kohta, vaid saab sisendiks ainult paralleelkorpuse ehk miljoneid lausepaare, kus üks paariline on teise paarilise tõlge.
Praktilises pooles kasutati olemasolevat Mosese statistilise masintõlke raamistikku, et luua uus tõlkemudel inglise-eesti suunal. Lisaks pöörati ümber olemasolev eesti-inglise tõlkemudel, mis oli kaalutult kokku pandud erinevatest korpustest saadud mudelitest. Kogu töö käigus loodi 1 keelemudel, 2 fraasimudelit ja 2 ümberpaiknemismudelit.
Teoreetiline osa oli referatiivne ning käsitles just neid fraasi-, keele- ja ümberpaiknemismudeli algoritme, mida me sisuliselt kasutasime töö praktilises osas. Täpsemalt käsitleti kahesuunalist leksikograafiliste kaaludega fraasimudelit, trigramm keelemudelit, mis kasutas silumiseks rekursiivset interpolatsiooni koos Witten-Belli meetodiga ning kahesuunalist msd (monotone, swap, discontinues ehk jääb paigale, vahetab, katkendlik) ümberpaiknemismudelit.
Töö lõpus tõlgiti rohkem kui tuhandelauseline testkorpus ja hinnati saadud tulemust automaatse hindamismeetodiga BLEU. Lisaks vaadeldi tulemust lähemalt käsitsi. Kuigi paljud kerged laused tõlgiti peaaegu ideaalselt, siis keerulisemate lausetega hakkasid vähemalt osaliselt tekkima raskused. Suurim probleem oli konteksti mittemõistmine, sellele järgnesid käänamine ja lause ülesehitus.
Töö väljundiks on valmiv statistilise masintõlke mudel inglise-eesti suunal ning teadmine, et antud valdkond on perspektiivikas. Töö on lisaks mõeldud inglise-eesti suunal statistilise masintõlke tegemise alustamiseks.
The present thesis is about statistical machine translation in both theoretical and practical manner. Statistical machine translation is an area, which aims to make the machine translate without giving it any knowledge about grammar of the languages. It only receives a parallel corpora with millions of sentence pairs, where the only certain knowledge is, that in each pair, the sentences translate to each other. In the practical part of the work, we used statistical machine translation framework Moses, to create a new language model for English-Estonian direction. In addition an existing opposite direction language model, which was built from different corporas and put together weigthed, was inverted. Throughout the work 1 language-model, 2 phrase-models and 2 reordering-models were created. Theoretical part of the work involved describing different algorithms that were used inside the framework and its components in the practical part. To be more precise we discussed bidirectional phrase-model with lexical weights, tri-gram language model with recursive interpolation and Witten-Bell smoothing and bidirectional msd(monotone, swap, discontinues) reordering model. At the end stage of the work a test corpora with more than 1000 sentences was translated using the created models. Result was measured with automatic evaluation method BLEU. In addition, the result was examined close up and even though there were many good to allmost perfect translations for simpler sentences, in more complex sentences, there started to exist errors. Most common was misunderstanding the contex. Others worth mentioning were wrong inflection and bad sentence structure. As the result of the work a English-Estonian machine translation model was made and we came to the conclusion that it is a promising field for English-Estonian translation. The work at hand is also meant for educational purposes for anyone willing to step into statistical machine translation field for English-Estonian direction.
The present thesis is about statistical machine translation in both theoretical and practical manner. Statistical machine translation is an area, which aims to make the machine translate without giving it any knowledge about grammar of the languages. It only receives a parallel corpora with millions of sentence pairs, where the only certain knowledge is, that in each pair, the sentences translate to each other. In the practical part of the work, we used statistical machine translation framework Moses, to create a new language model for English-Estonian direction. In addition an existing opposite direction language model, which was built from different corporas and put together weigthed, was inverted. Throughout the work 1 language-model, 2 phrase-models and 2 reordering-models were created. Theoretical part of the work involved describing different algorithms that were used inside the framework and its components in the practical part. To be more precise we discussed bidirectional phrase-model with lexical weights, tri-gram language model with recursive interpolation and Witten-Bell smoothing and bidirectional msd(monotone, swap, discontinues) reordering model. At the end stage of the work a test corpora with more than 1000 sentences was translated using the created models. Result was measured with automatic evaluation method BLEU. In addition, the result was examined close up and even though there were many good to allmost perfect translations for simpler sentences, in more complex sentences, there started to exist errors. Most common was misunderstanding the contex. Others worth mentioning were wrong inflection and bad sentence structure. As the result of the work a English-Estonian machine translation model was made and we came to the conclusion that it is a promising field for English-Estonian translation. The work at hand is also meant for educational purposes for anyone willing to step into statistical machine translation field for English-Estonian direction.