Describing the contents of an image automatically has been a fundamental problem in the field of artificial intelligence and computer vision. Existing approaches are either top-down, which start from a simple representation of an image and convert it into a textual description; or bottom-up, which come up with attributes describing numerous aspects of an image to form the caption or a combination of both. Recurrent neural networks (RNN) enhanced by Long Short-Term Memory networks (LSTM) have become a dominant component of several frameworks designed for solving the image captioning task. Despite their ability to reduce the vanishing gradient problem, and capture dependencies, they are inherently sequential across time. In this work, we propose two novel approaches, a top-down and a bottom-up approach independently, which dispenses the recurrence entirely by incorporating the use of a Transformer, a network architecture for generating sequences relying entirely on the mechanism of attention. Adaptive positional encodings for the spatial locations in an image and a new regularization cost during training is introduced. The ability of our model to focus on salient regions in the image automatically is demonstrated visually. Experimental evaluation of the proposed architecture on the MS-COCO dataset is performed to exhibit the superiority of our method. © 2018 ACM.