Department of Applied Mathematics
Permanent URI for this community
Browse
Browsing Department of Applied Mathematics by Subject "Architectural models"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemLow-resource image captioning(Stellenbosch : Stellenbosch University, 2022-12) Du Plessis, Mikkel; Brink, Willie; Stellenbosch University. Faculty of Science. Dept. of Applied Mathematics.ENGLISH ABSTRACT: Image captioning combines computer vision and natural language processing, and aims to automatically generate a short natural language phrase that describes relationships between objects and context within a given image. As the field of deep learning evolves, several approaches have produced impressive models and generally follow an encoder-decoder architecture. An encoder is utilised for visual cues and a textual decoder to produce a final caption. This can create a challenging gap between visual and textual representations, and makes the training of image captioning models resource intensive. Consequently, recent image captioning models have relied on a steady increase of training set size, computing requirements and training times. This thesis explores the viability of two model architectures for the task of image captioning in a low-resource scenario. We focus specifically on models that can be trained on a single consumer-level GPU in under 5 hours, using only a few thousand images. Our first model is a conventional image captioning model with a pre-trained convolutional neural network as the encoder, followed by an attention mechanism, and an LSTM as the decoder. Our second model utilises a Transformer in the encoder and the decoder. Additionally, we propose three auxiliary techniques that aim to extract more information from images and training captions with only marginal computational overhead. Firstly, we address the typical sparseness in object and scene representation by taking advantage of top-down and bottom-up features, in order to present the decoder with richer visual information and context. Secondly, we suppress semantically unlikely caption candidates during the decoder’s beam search procedure through the inclusion of a language model. Thirdly, we enhance the expressiveness of the model by augmenting training captions with a paraphrase generator. We find that the Transformer-based architecture is superior under low-data circumstances. Through a combination of all proposed methods applied, we achieve state-of-the-art performance on the Flickr8k test set and surpass existing recurrent-based methods. To further validate the generalisability of our models, we train on small, randomly sampled subsets of the MS COCO dataset and achieve competitive test scores compared to existing models trained on the full dataset.