Abstract: Neural attention has become central to many state-of-the-art models in naturallanguage processing and related domains. Attention networks are an easy-to-trainand effective method for softly simulating alignment; however, the approach doesnot marginalize over latent alignments in a probabilistic sense. This property makesit difficult to compare attention to other alignment approaches, to compose it withprobabilistic models, and to perform posterior inference conditioned on observeddata. A related latent approach, hard attention, fixes these issues, but is generallyharder to train and less accurate. This work considers variational attention networks,alternatives to soft and hard attention for learning latent variable alignmentmodels, with tighter approximation bounds based on amortized variational inference.We further propose methods for reducing the variance of gradients to makethese approaches computationally feasible. Experiments show that for machinetranslation and visual question answering, inefficient exact latent variable modelsoutperform standard neural attention, but these gains go away when using hardattention based training. On the other hand, variational attention retains most ofthe performance gain but with training speed comparable to neural attention.