CS 11-737: Multilingual NLP

Assignment 1: Multilingual POS Tagging

Due Feb 7, 11:59pm ET.

This is an individual assignment.

Overview

A number of tasks in natural language processing can be framed as sequence tagging, i.e. predicting a sequence of labels, one for each token in the sentence. In this homework, you will be looking at part of speech tagging for multiple languages, and investigating the challenges when the available labeled data is scarce. The goal of this homework is to make you familiar with deep learning frameworks (Pytorch), computing resources (AWS), and multilingual datasets.

Models for Sequence Tagging

For any given sequence of tokens, \(\mathbf{x} = (x_1, \ldots ,x_n)\), sequence tagging predicts a sequence of labels of the same length, \(\mathbf{y}= (y_1 \ldots, y_n)\), where \(y_i \in \{1, \ldots, L\}\), the labels of our interest. In discriminative models, we model any sequence of tags for input sequence with a scoring function \(s(\mathbf{y},\mathbf{x})\), such that the best prediction of the model corresponds to the following inference problem:

\[\hat{y} = \arg \mathrm{max}_\mathbf{y} s(\mathbf{y}, \mathbf{x}) = \arg \mathrm{max}_\mathbf{y} \sum_{i=1}^n \psi(y_i, i, x) \]

This classifier, when predicting the label \(y_i\) can use any of the features of the input sequence \(\mathbf{x}\) and the position \(i\). As a baseline in this homework, we will provide a bidirectional LSTM model to generate the features of the input sequence \(h_i\) on top of which we apply a feed-forward layer to predict the POS tag at every step \(i\). (A more sophisticated model could incorporate the sequential nature of the labels, for example, conditional random fields (CRFs), although recent work like BERT has shown great results without explicitly incorporating any such information as well).

Data

You will evaluate your models on a subset of the treebank of Universal Dependencies (UD) v1.2. Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. In this task, you will only focus on parts of speech tags. You will evaluate your models on 7 languages (with a mix of high resource and low resource). We define languages with more than 60K tokens in their dataset as high-resource and others as low-resource, for this task. We will consider languages from the following language families.

Germanic: English (en), Afrikaans (af)
Slavic: Czech (cs)
Romance: Spanish (es)
Semitic: Arabic (ar)
Baltic: Lithuanian (lt)
Armenian: Armenian (hy)
Dravidian: Tamil (ta)

Using AWS

The assignment is completely doable without a GPU, however it is strongly recommended to use one. If you do not have a GPU, the course will provide you with AWS credits that can be requested by following instructions on this Piazza post: https://piazza.com/class/kwxtpsn7fsc6jp. All students should have and use an AWS account using their @andrew.cmu.edu email, and should join AWS educate on this account: https://aws.amazon.com/education/awseducate/.

Setting up an AWS instance for this assignment should be fairly straightforward. However, if you do come across any difficulties, the instructors of the 11-785 Introduction to Deep Learning class (Prof. Bhiksha Raj, Ameya Sunil Mahabaleswarkar, and Zhe Chen) has a very helpful recitation on AWS fundamentals. The playlist also has useful AWS tips such as how to connect to AWS instances from VSCode. If you have any further questions, please don’t hesitate to post a question on Piazza or contact the TAs for this assignment (cs11-737-sp2022-tas@cs.cmu.edu).

Results

We provide a baseline implementation of a BiLSTM model along with the required data written in Pytorch. The provided link also provide a model trained on English UD data. Please refer to the comments under assign1/main.py to understand what each part of the code does.

To be able to run this code, you will need to install python >= 3.8, pytorch >= 1.10.1 and torchtext >= 0.11.1. We recommend you run your experiments in a conda environment which you can reuse for your future homeworks. The required libraries can then be installed by creating and activating a new conda environment and the following commands:

conda create -n 11737hw python=3.8
conda activate 11737hw
conda install numpy
conda install pytorch=1.10.1 torchtext=0.11.1 cudatoolkit -c pytorch

To run the code and reproduce the results:

Download the code and data from here.
Evaluate the model on the English UD test data.
```
python main.py --mode eval --lang en
```
This will load the provided model file trained on the English data and return the per-label accuracy on the test set.
Train a new model for each language in the provided list above.
```
python main.py --mode train --lang lang-code
```
Valid language codes are {en, es, cs, ar, af, lt, hy, ta}.
Evaluate the newly trained models on their respective test sets.

Models are evaluated using per-label accuracy on the test set.

Requirements and Grading

You are expected to do the following:

Basic Requirement: Run the provided model on the provided English UD data and reproduce the results on its test set (91.44% accuracy). This will earn you a passing B grade.

Multilingual Results: Reproduce the results on the other provided languages. This will earn you B+. Expected per-label accuracy (a drop within 0.5% is acceptable):

en	cs	es	ar	af	lt	hy	ta
91.44	93.97	93.36	94.44	88.85	75.59	79.91	40.09

Report: Write and submit a report (max 3 pages) describing and analyzing the results. The report should be written in the ACL Style. This will earn you an A-. Few examples of analyses:
- How the performance changes across language families, typology, dataset size, etc.
- Error analysis: Does the model perform better or worse on certain kinds of tags. What are the implications of this?
- What happens when you change training set sizes?
- What happens if you change the hyperparameters of the model?
- What happens if the labels are noisy?
Note that these are just examples and not an exhaustive list. You can choose to run any of these or do a different analysis of your own which you think might be interesting. To get an A-, you need to conduct at least two analyses.
Make changes to the existing preprocessing/model/algorithm and show an improvement in the reported results. This will earn you an A, or A+ for particularly extensive or interesting improvements. Some suggestions for improvement:
- Add a CRF layer on top of the baseline model.
- Add a CNN input layer to capture character level features.
- Pre-train the model parameters with a language modeling objective
- Add multilingual pre-trained embeddings (Polyglot, mBERT to your model. If you do this, make sure you also report at least one set of results without external resources such as this.
- Auxiliary Losses.
For this part, you are allowed to make use of existing code with proper citation as long as it is incorporated in the given codebase without making drastic changes.

Submission

Update (2/4/2022): Please include model outputs instead of the trained models.

In your submission, please include the following files:

code/: all of your code.
code/README: instructions on how to run if you have implemented additional code.
model_outputs/: the predictions of the testing sets. This directory should contain af.conll, ar.conll, cs.conll, en.conll, es.conll, hy.conll, lt.conll, ta.conll. The format is the same as the format of the training data.

Please compress the above files as assign1.zip and submit the zip file as well as the report to Gradescope.