Joint Embeddings for Voices and Their Textual Descriptions
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Embeddings are vector representations which is a highly effective method employed in
machine learning to represent data in a more meaningful and efficient manner. In this
study, we aim to implement vector representation for speakers’ voices and corresponding
textual descriptions, maximizing their cosine similarity. In other words, we want to
build a system capable of representing both voices and descriptions of those voices
as closely as possible in the multidimensional space. In our work, the data collection
process involves using public datasets as well as manually annotated data. In order to
conduct our research, we have utilized different training modes, such as standalone,
where encoders are trained individually, and joint training techniques, where encoders
are trained together to learn to adapt their outputs accordingly. We then evaluated the
models on our control sample extracted from the manually collected dataset and assessed
the quality of our annotations. We have also investigated the changes in cosine similarity
between the speakers’ and voice descriptions’ vector representation with the decline in
annotation quality.
Description
Keywords
text-to-speech, embeddings, cosine similarity, Wav2Vec2, sentence encoders