Assignment 2: Multilingual Translation

Due Feb 25, 11:59pm ET.

This is a group assignment, with teams of 2-3 people

Slides for the recitation.

Overview

Machine Translation has seen great progress in the last few years with the advent of deep learning models, achieving state-of-the-art results in many high-resource language pairs. However, progress in MT low-resource languages has lagged behind due to the limited availability of parallel data. The goal of this homework is to make you familiar with machine translation frameworks and explore how multilingual training techniques can help improve performance in low-resource languages.

Neural Machine Translation

To recap, given a sentence in the source languange \(\mathbf{x} = (x_1, \ldots ,x_n)\), the goal of a machine translation model is to predict a sentence in the target language \(\mathbf{y} = (y_1, \ldots ,y_m)\). To do so we model the probability conditional on source, training to maximize the likelihood of the data

\[\theta^* = \underset{\theta}{\mathrm{argmax}} \, \mathbb{E}_{\mathbf{x},\mathbf{y}}\left [ p_\theta(\mathbf{y} | \mathbf{x}) \right ] \]

At inference time, we try to obtain the most likely translation for a given source sentence, by approximatly solving the inference problem:

\[ \hat{\mathbf{y}} = \underset{\mathbf{y}}{\mathrm{argmax}} \, p_\theta(\mathbf{y} | \mathbf{x}) \]

Search algorithms such as beam search are typically used, but it is also possible to obtain translations by sampling from the model.

To evaluate the model, typically automatic metrics that rely on lexical-overlap, such as BLEU, are used. However there are also newer metrics based cross-lingual languages models, such as BLEURT and COMET, that correlate better with human judgments.

Experiments

Start by downloading the code from here.

Like the previous assignment, this assignment will use PyTorch as our backbone deep learning framework. Again we recommend using an anaconda environment. You can reuse the environment from the previous assignment OR create a new one by running

conda create -n 11737hw2 python=3.8
conda activate 11737hw2
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

This assignment will also use fairseq as the machine translation framework on top of PyTorch. To download the codebase and install it in your environment, run:

git clone git@github.com:pytorch/fairseq.git
cd fairseq
pip install .
pip install --upgrade numpy
export FAIRSEQ_DIR=`pwd`
cd ..

Finally, we also need some extra libraries, such as SacreBLEU and COMET, to run evaluation:

pip install -r requirements.txt

Data

We also need data to train our models. In this assignment, we will use the TED talks [1], which contains parallel data from 58 to English. The raw data can be downloaded by running:

python download_data.py

Bilingual Baselines

As a first step, we will train NMT systems using only parallel data from the languages of interest. In this assignment, we will consider two low-resource languages: Azerbaijani (aze) and Belarus (bel), translating to and from English (eng)

We provide scripts for complete data processing, including simple cleaning and (subword) tokenization as well as training and evaluation. You should read the scripts to understand the data processing pipeline, training and evaluation.

To perform preprocessing for the aze-eng parallel corpora, run:

bash preprocess-ted-bilingual.sh

Then, you can train and evaluate models on the preprocess data in both directions by running:

bash traineval_aze_eng.sh
bash traineval_eng_aze.sh

With slight modifications to the scripts, you should be able to run the baselines for both aze and bel. You should obtain results similar to ones in the table below.

LP BLEU COMET
aze-eng 2.10 -1.162
eng-aze 1.33 -1.352
bel-eng 1.52 -1.3552
eng-bel 0.80 -1.4560

Multilingual Training

Note that since the languages we consider have very limited amount of parallel training data, the NMT model performs really badly, with BLEU scores of less than 10 and (very) negative COMET scores. This is a known issue for neural models in general - they are much more data-hungry than statistical methods. Luckily, we can use multilingual training to boost the performance of these low resource languages. In this section, you will learn to use data from the high resource-related languages to improve the NMT performance.

Crucially, for cross-lingual transfer to work, the high-resource must be similar to the low-resource language we are transferring to. For example, a closely related language to Azerbaijani is Turkish (tur), which has much more parallel data (about 200k sentences in the TED corpus). To train a model with multilingual training, we just need to concatenate the data from both aze and tur and train the model on this data. The idea is that the knowledge learned about tur can also help the model to translate aze.

We provide scripts for training and evaluation a multilingual model for aze trained also on tur

To start, run the following script to preprocess the data

bash preprocess-ted-multilingual.sh

Then run training and evaluation both from English and to English.

bash traineval_azetur_eng.sh
bash traineval_eng_azetur.sh

For bel, you need a high-resource transfer language and to modify the scripts accordingly. For example, you could use data from Russian (rus). You should be able to match the following results:

LP BLEU COMET
aze-eng 11.37 -0.2290
eng-aze 5.96 -0.0857
bel-eng 17.24 -0.3396
eng-bel 9.84 -0.4102

Finetuning Pretrained Multilingual Models

Another option to improve the performance is to leverage massive multilingual pretrained models. These models were trained on gigantic corpora with over 100 languages and have been shown to improve performance on low-resource languages by extensively leveraging cross-lingual transfer across the languages considered.

In this assignment, we will consider finetuning the small FLORES-101 models on our low-resource languages.

To start, download the fairseq checkpoints for the model by running:

mkdir -p checkpoints && cd checkpoints
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz
tar -xvf flores101_mm100_175M.tar.gz
cd ..

We provide scripts for preprocessing the data and finetuning the model for aze.

Note that we need to run preprocessing again since we need to use the original (subword) tokenization that the FLORES model was trained with. To preprocessed data, run:

bash preprocess-ted-flores-bilingual.sh

You can then finetune the model and evaluate it both from English and to English.

bash traineval_flores_aze_eng.sh
bash traineval_flores_eng_aze.sh

Again, with slight modifications to the scripts, you should be able to finetune and evaluate for both bel as well. You should obtain results similar to the ones in the table below.

LP BLEU COMET
aze-eng 12.96 -0.1035
eng-aze 7.74 -0.0109
bel-eng 20.04 -0.0157
eng-bel 14.15 0.0481

Improving Multilingual Transfer

Data Augmentation

Extra monolingual data is often helpful for improving NMT performance. Specifically, back-translation [2,3] has been widely used to boost the performance of low-resource NMT. A closely related method, self-training, is recently proven to be effective as well [4]. Recently, several methods have been proposed to combine different methods of using monolingual data with multilingual training. Xia et al. [5] explores several different strategies for modifying the related language data to improve multilingual training. [6] adds a masked language model objective for monolingual data while training a multilingual model.

Choosing Transfer Languages

For the provided multilingual training method, we simply use a closely related high-resource language as a transfer language. However, it is likely that data from other languages would be helpful as well. Lin et al. [7] has done a systematic study of choosing transfer languages for NLP tasks. There are also methods designed to choose multilingual data for specific NLP tasks [8].

Better Word Representation or Segmentation

Vocabulary differences between the low-resource language and its related high-resource language is an important bottleneck for multilingual transfer. Wang et al [9] propose a character-aware embedding method for multilingual training. For morphological rich languages, such as Turkish and Azerbaijani, it is also useful to use the morphology information in word representations [10].

Recently, several approaches are proposed to improve the word segmentation for standard NMT models [11, 12]. It is possible that these improvements would be helpful for multilingual NMT as well.

Better Modeling

You can also improve the NMT architecture or optimization to better model data from multiple languages. Wang et al. [13] propose three strategies to improve one-to-many translation, including better initialization and language-specific embedding. Zhang et. [14] propose adding language-aware modules in the NMT model.

Efficient Finetuning

While pretrained multilingual models are generally more data-efficient efficient than bilingual models trained from scratch, they are still relatively data-hungry and can result in poor performance in extremely low-resource settings. Recently several approaches have been proposed to improve both data and parameter efficiency of pretrained machine translation models.

Adapter finetuning [15] is a general method that introduces a new small set of parameters to the model when finetuning, leaving the original parameters fixed. Adapter finetuning has been shown to improve the performance of multilingual models when finetuning in new languages [16].

Prefix-tuning [17] is an alternative finetuning paradigm that adds a parametrized prefix to inputs (embeddings) to the model and trains these prefixes, leaving the original model parameters fixed. Prefix-tuning can be even more parameter and data-efficient than adapter finetuning [18], but little research has been done in applying prefix-tuning to machine translation.

Other Approaches

There are many potential directions for improving multilingual training and finetuning. We encourage you to do more literature research, and even come up with your own method!

Grading

Write and submit a report (max 3 pages) describing and analyzing the results. The report should be written in the ACL Style

  1. Basic Requirement: Reproduce the results with the bilingual baselines and with either your trained-from-scratch multilingual model OR finetuned pretrained multilingual model. To account for variance in experiments, a drop of 0.5 BLEU or 0.05 COMET are acceptable. This will earn you a passing B grade.
  2. Analyze multilingual (pre-)training: Try to understand how multilingual pre-training helps the performance on the low-resource languages. For example, does multilinguality help more from or to English? Does performance drop for any particular type of source sentences. Any interesting phenomena is accepted so be creative! This will earn you a B+ grade.
  3. Implement at least one pre-existing method to try to improve multilingual transfer: Compare the performance of the implemented method with the baselines, clearly documenting results and analysing why it does or does not work. This will earn you a A- grade.
  4. Implement several methods to improve multilingual transfer: For example, you can implement multiple pre-existing methods or one pre-existing method and one novel method. Compare the performance with the baselines, clearly documenting results and analysing why it does or does not work. This will earn you a A or A+ for particularly extensive or interesting improvements and analysis.

If using existing code, please cite your sources.

Submission

Your submission consists of three parts: code, model outputs and writeup. Put all your code in a folder named code and instructions on how to run if you have implemented additional code. Include the output of your models in an outputs directory, with a description of what model each file is associated with. Rename the writeup as writeup.pdf and compress all of them as assign2.tar.gz. This file must be submitted to Canvas (link on the course website).

References

[1]: Qi et al. When and Why are pre-trained word embeddings useful for Neural Machine Translation

[2]: Edunov et al. Understanding back-translation at scale.

[3]: Sennrich et al. Improving neural machine translation models with monolingual data

[4]: He et al. Revisiting self-training for neural sequence generation

[5]: Xia et al. Generalized data augmentation for low-resource translation

[6]: Siddhant et al. Leveraging monolingual data with self-supervision for multilingual neural machine translation

[7]: Lin et al. Choosing transfer languages for cross-lingual learning

[8]: Wang et al. Target conditioned sampling: Optimizing data selection for multilingual neural machine translation

[9]: Wang et al. Multilingual neural machine translation with soft decoupled encoding

[10]: Chaudhary et al. Adapting word embeddings to new languages with morphological and phonological subword

[11]: Provilkov et al. BPE-dropout: Simple and effective subword regularization

[12]: He et al. Dynamic programming encoding for subword segmentation in neural machine translation

[13]: Wang et al. Three strategies to improve one-to-many multilingual translation

[14]: Zhang et al. Improving massively multilingual neural machine translation and zero-shot translation

[15]: Houlsby et al. Parameter-Efficient Transfer Learning for NLP

[16]: Philip et al. Monolingual Adapters for Zero-Shot Neural Machine Translation

[17]: Li et al. Prefix-Tuning: Optimizing Continuous Prompts for Generation

[18]: He et al. Towards a Unified View of Parameter-Efficient Transfer Learning