Home ML Papers Kelvin Xu - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)

Kelvin Xu - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)

History / Edit / PDF / EPUB / BIB /
Created: July 6, 2017 / Updated: December 21, 2025 / Status: finished / Readability: technical / 1 min read (~185 words)
machine-learning

Two attention-based image caption generators under a common framework:
- A "soft" deterministic mechanism trainable by standard back-propagation methods
- A "hard" stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by REINFORCE

We use a convolutional neural network in order to extract a set of feature vectors which we refer to as annotation vectors
The extractor produces $L$ vectors, each of which is a $D$-dimensional representation corresponding to a part of the image
In order to obtain a correspondence between the feature vectors and portions of the 2D image, we extract features from a lower convolutional layer

We use a long short-term memory (LSTM) network that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning. 2015.