Header menu link for other important links
X
A Bottom-Up and Top-Down Approach for Image Captioning using Transformer
, Parameswaran S.N.
Published in Association for Computing Machinery
2018
Abstract
Describing the contents of an image automatically has been a fundamental problem in the field of artificial intelligence and computer vision. Existing approaches are either top-down, which start from a simple representation of an image and convert it into a textual description; or bottom-up, which come up with attributes describing numerous aspects of an image to form the caption or a combination of both. Recurrent neural networks (RNN) enhanced by Long Short-Term Memory networks (LSTM) have become a dominant component of several frameworks designed for solving the image captioning task. Despite their ability to reduce the vanishing gradient problem, and capture dependencies, they are inherently sequential across time. In this work, we propose two novel approaches, a top-down and a bottom-up approach independently, which dispenses the recurrence entirely by incorporating the use of a Transformer, a network architecture for generating sequences relying entirely on the mechanism of attention. Adaptive positional encodings for the spatial locations in an image and a new regularization cost during training is introduced. The ability of our model to focus on salient regions in the image automatically is demonstrated visually. Experimental evaluation of the proposed architecture on the MS-COCO dataset is performed to exhibit the superiority of our method. © 2018 ACM.
About the journal
JournalData powered by TypesetACM International Conference Proceeding Series
PublisherData powered by TypesetAssociation for Computing Machinery
Open AccessNo