Gonçalo M. Correia, Vlad Niculae and André F.T. Martins. In Proceedings of EMNLP, 2019

Comments: Presented as a Talk in Machine Learning Session III.

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with $\alpha$-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the $\alpha$ parameter -- which controls the shape and sparsity of $\alpha$-entmax -- allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.
@inproceedings{Correia2019b, author = {Correia, Gonçalo M. and Niculae, Vlad and Martins, André F. T.}, booktitle = {Proceedings of EMNLP}, title = {Adaptively Sparse Transformers}, year = {2019}}
## Unbabel's Submission to the WMT2019 APE Shared Task: BERT-based Encoder-Decoder for Automatic Post-Editing

António V. Lopes, M. Amin Farajian, Gonçalo M. Correia, Jonay Trenous, André F.T. Martins. In Proceedings of WMT, 2019

This paper describes Unbabel's submission to the WMT2019 APE Shared Task for the English-German language pair. Following the recent rise of large, powerful, pre-trained models, we adapt the BERT pretrained model to perform Automatic Post-Editing in an encoder-decoder framework. Analogously to dual-encoder architectures we develop a BERT-based encoder-decoder (BED) model in which a single pretrained BERT encoder receives both the source src and machine translation tgt strings. Furthermore, we explore a conservativeness factor to constrain the APE system to perform fewer edits. As the official results show, when trained on a weighted combination of in-domain and artificial training data, our BED system with the conservativeness penalty improves significantly the translations of a strong Neural Machine Translation system by and in terms of TER and BLEU, respectively.
@inproceedings{Lopes2019, author = {Lopes, António V. and Farajian, M. Amin and Correia, Gonçalo M. and Trenous, Jonay and Martins, André F. T.}, booktitle = {Proceedings of WMT19}, title = Unbabel's Submission to the WMT2019 APE Shared Task: BERT-based Encoder-Decoder for Automatic Post-Editing, year = {2019} }
## A Simple and Effective Approach to Automatic Post-Editing with Transfer Learning

Gonçalo M Correia, André FT Martins. In Proceedings of ACL, 2019