Visually grounded speech models for low-resource languages and cognitive modelling

Date
2024-12
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
Visually grounded speech models (VGS) learn from unlabelled speech paired with images. Such models can be valuable to develop speech applications for low-resource languages lacking transcribed data, and understanding how humans acquire language since children learn speech from multimodal cues. This dissertation makes contributions to both of these areas. In the first part of this dissertation, we consider two research questions about using VGS models in low-resource language applications. The first research question asks: can we get a VGS model that can detect and localise a keyword depicted by an image within speech? For this, we propose a new task called visually prompted keyword localisation (VPKL). Here, an image depicting a keyword query should be detected in spoken utterances. A detected query should be localised within the utterance. To do VPKL, we modify a common VGS modelling approach using an acoustic and a vision network connected with a multimodal attention mechanism. On an artificial low-resource language, English, we find that using an ideal tagger to get training pairs outperforms a previous visual Bag-of-Words (BOW) model locating written keywords in spoken utterances. An actual visual tagger results in lower scores than the written keyword baseline. To do VPKL for a real low-resource language, we consider few-shot learning before returning to this problem. In the second research question, we ask if we can get a VGS model to learn words using only a few word-image pairs. We use an architecture similar to the VPKL model’s and combine it with a few-shot learning approach that can learn new classes from fewer natural word-image pairs. Using the few given word-image example pairs, new unsupervised word-image training pairs are mined from large unlabelled speech and image sets. Our approach outperforms an existing VGS few-shot model when the number of examples per class is small. As a result, we apply this approach to an actual low-resource language – Yorùbá. The Yorùbá few-shot model outperforms its English variant. From the few-shot progress we make, we return to the VPKL approach and propose a simpler model similar to our previous VPKL model. Here we assume we have access to a dataset consisting of spoken utterances paired with descriptive images. To mine speech-image training pairs for a keyword, we use a few spoken word examples of the keyword and compare them to the utterances in the dataset’s speech-image pairs. We found that this approach outperforms our previous approach on an English VPKL task and the visual BOW model that detects textual keywords in speech. As a result, we apply this approach to Yorùbá. Since the speech system in the pair mining scheme uses a model trained on English, the precision of the few-shot Yorùbá localisation model is low. However, the ground truth Yorùbá model outperforms the textual keyword localisation model applied to Yorùbá by large margins. In the second part of this dissertation, we ask two more research questions regarding the use of VGS models in computational cognitive studies. Our third research question considers whether a VGS model exhibits the mutual exclusivity (ME) bias which is a word learning constraint used by children. This bias states that a novel word belongs to an unknown object instead of a familiar one. To do this, we use our few-shot object and word learning model and generate a speech-image dataset containing spoken English word and image examples for a set of familiar and novel classes. The model is trained on the word-image pairs for the familiar classes. The model is then prompted with novel English spoken words and asked whether the words belong to unknown or familiar objects. All variants of the model exhibit the ME bias. A model that uses both self-supervised audio and vision initialisations has the strongest ME bias. This makes sense from a cognitive perspective since children are exposed to spoken language and visual stimuli in their surroundings when they begin using the ME bias. Various cognitive ME studies have considered the effect that factors such as multilingualism have on the ME bias. Since this effect has not yet been studied computationally, our fourth research question asks how multilingualism affects the ME bias exhibited by our VGS model. We extend the English ME dataset’s training set to contain spoken Dutch and French words for the familiar classes. We train a trilingual English-Dutch-French model and two bilingual models: an English-Dutch model and an English-French model. These multilingual models are compared to the monolingual English model of the previous research question. We find that the monolingual model has a weaker ME bias than multilingual models. This trend is opposite to the trends seen in children: monolingual children have a stronger ME bias than multilingual children. This study is preliminary and requires further investigation. In summary, we find that VGS models can be used to develop low-resource applications by using only a small set of ground truth examples. We also found that VGS models can be used to computationally study the ME bias observed in children. Further investigation is required into the effect of multilingualism on the bias in VGS models and comparing it to the effect in children. We believe this dissertation has given enough proof of how valuable VGS models can be and will encourage research in this field to build inclusive speech technology and contribute to understanding human language learning.
Description
Thesis (PhD)--Stellenbosch University, 2024.
Keywords
Citation