J. Lei-ba, J. R. Kiros, and G. E. Hinton, Layer normalization. arXiv preprint, 2016.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014.

O. Caglayan, W. Aransa, Y. Wang, M. Masana, M. García-martínez et al., Does Multimodality Help Human and Machine for Translation and Image Captioning?, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp.627-633, 2016.
DOI : 10.18653/v1/W16-2358

URL : https://hal.archives-ouvertes.fr/hal-01433183

O. Caglayan, L. Barrault, and F. Bougares, Multimodal attention for neural machine translation, 2016.

M. Fr-ozan-caglayan, A. García-martínez, W. Bardet, F. Aransa, L. Bougares et al., Nmtpy: A flexible toolkit for advanced neural machine translation systems. arXiv preprint, 2017.

I. Calixto, Q. Liu, and N. Campbell, Doubly-attentive decoder for multimodal neural machine translation. arXiv preprint, 2017.

I. Calixto, Q. Liu, and N. Campbell, Incorporating global visual features into attentionbased neural machine translation. arXiv preprint, 2017.

J. Chung, C. ¸. Aglar-gülçehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.

J. H. Clark, C. Dyer, A. Lavie, and N. A. Smith, Better hypothesis testing for statistical machine translation: Controlling for optimizer instability, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, pp.176-181, 2011.

D. Elliott, S. Frank, L. Barrault, F. Bougares, and L. Specia, Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description, Proceedings of the Second Conference on Machine Translation, 2017.
DOI : 10.18653/v1/W17-4718

D. Elliott, S. Frank, and L. Specia, Multi30K: Multilingual English-German Image Descriptions, Proceedings of the 5th Workshop on Vision and Language, pp.70-74, 2016.
DOI : 10.18653/v1/W16-3210

D. Elliott and . Kádár, Imagination improves multimodal translation, 2017.

O. Firat and K. Cho, Conditional gated recurrent unit with attention mechanism, 2016.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. PMLR, volume 9 of Proceedings of Machine Learning Research, pp.249-256, 2010.

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
DOI : 10.1109/CVPR.2016.90

P. Huang, F. Liu, S. Shiang, J. Oh, and C. Dyer, Attentionbased multimodal neural machine translation, Proceedings of the First Conference on Machine Translation. Association for Computational Linguistics, pp.639-645, 2016.
DOI : 10.18653/v1/w16-2360

D. Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412, 2014.

A. Lavie and A. Agarwal, Meteor, Proceedings of the Second Workshop on Statistical Machine Translation, StatMT '07, pp.228-231, 2007.
DOI : 10.3115/1626355.1626389

J. Libovick´ylibovick´y and J. Helcl, Attention strategies for multi-source sequenceto-sequence learning, 2017.

M. Luong, V. Quoc, I. Le, O. Sutskever, L. Vinyals et al., Multi-task sequence to sequence learning, 2015.

K. Papineni, S. Roukos, T. Ward, and W. Zhu, BLEU, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , ACL '02, pp.311-318, 2002.
DOI : 10.3115/1073083.1073135

R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, Proceedings of the 30th International Conference on International Conference on Machine Learning -Volume 28. JMLR.org, ICML'13, pp.1310-1318, 2013.

K. Shaoqing-ren, R. He, J. Girshick, and . Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Proceedings of the 28th International Conference on Neural Information Processing Systems, pp.91-99, 2015.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision, vol.1010, issue.1, pp.211-252, 2015.
DOI : 10.1007/978-3-642-15555-0_11

R. Sennrich, B. Haddow, and A. Birch, Neural Machine Translation of Rare Words with Subword Units, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.1715-1725, 2016.
DOI : 10.18653/v1/P16-1162

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint, 2014.

L. Specia, S. Frank, and D. Elliott, A Shared Task on Multimodal Machine Translation and Crosslingual Image Description, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp.543-553, 2016.
DOI : 10.18653/v1/W16-2346

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfit- ting, 2014.

I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, Proceedings of the 27th International Conference on Neural Information Processing Systems, pp.3104-3112, 2014.

A. Toral, M. Víctor, and . Sánchez-cartagena, A multifaceted evaluation of neural versus phrasebased machine translation for 9 language directions, Proceedings of the 15th Conference of the European Chapter, pp.1063-1073, 2017.

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, Proceedings of the 32nd International Conference on Machine Learning (ICML-15). JMLR Workshop and Conference Proceedings, pp.2048-2057, 2015.