Kelvin Xu - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)

History / Edit / PDF / EPUB / BIB /
Created: July 6, 2017 / Updated: February 6, 2021 / Status: finished / 1 min read (~185 words)

  • Two attention-based image caption generators under a common framework:
    • A "soft" deterministic mechanism trainable by standard back-propagation methods
    • A "hard" stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by REINFORCE

  • We use a convolutional neural network in order to extract a set of feature vectors which we refer to as annotation vectors
  • The extractor produces $L$ vectors, each of which is a $D$-dimensional representation corresponding to a part of the image
  • In order to obtain a correspondence between the feature vectors and portions of the 2D image, we extract features from a lower convolutional layer
  • We use a long short-term memory (LSTM) network that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words

  • Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning. 2015.