CS 11-731: Machine Translation and Sequence-to-sequence Models

Assignment 2 (Due 11/5)

Assignment 2 consists of a challenge task, on translating into low-resource languages. See the README inside the package for more details.

Assignment 2 Package

We also provide Baseline Repository that you can base your implementation on if you so choose. Use of this repository is optional.

The grading rubric for this checkpoint is as follows:

A+: Exceptional or surprising. Goes far beyond most other submissions.
A: A working implementation that implements some method specifically tailored to the challenge task and achieves good results, with corresponding analysis.
A- or B+: Similar to A, but less well motivated, less novel, less good results, or less comprehensive analysis.
B or B-: Good effort, but the implementation or report are incomplete.
C+ or below: Clear lack of effort or incompleteness.

Submission Instructions

Please submit a zip with the following files.

System outputs dev-enaf.af test-enaf.af dev-ennso.nso test-ennso.nso dev-ents.ts test-ents.ts
report.pdf, your report describing what you did and analyzing the results.
github-url.txt a file containing a single line corresponding to your github repository URL (e.g. https://github.com/neubig/mtandseq2seq-code). You must provide access to this repository to all instructors and TAs (github IDs: neubig, antonisa, cindyxinyiwang, pmichel31415, shrutirij, xiamengzhou).
code/ containing the contents of your github repository.

We have released a validation script that you can use to check your submission is in compliance, and an example submission that passes the validation script.

Allowed and Disallowed Use of External Resources

Allowed resources:

Anything provided in the assingment package and all code provided by the Baseline Repository, or any libraries that the baseline repository depends on.
Pre-processing code such as sentencepiece. Tools for language analysis (e.g. tokenization, parsing) such as SpacY or StanfordNLP that operate over one language only, including the use of pre-trained models that are provided with these packages.
Other textual data online. Some potential sources for data include Wikipedia dumps or the OPUS repository. However, you are not allowed to use data downloaded from the original African Language Translation repository where the class dataset was created, as some of the data may be included in the test set.

Disallowed resources (of course, you are allowed to implement the following yourself, but are not allowed to use pre-existing implementations):

Any pre-trained neural models or word embeddings that you did not implement yourself. You can also not use existing tools to pre-train word embedings.
Any externally implemented tools are used for machine translation, or processing of parallel text (e.g. tools for sentence filtering).

Machine Translation andSequence-to-sequence Models

Submission Instructions

Allowed and Disallowed Use of External Resources

Machine Translation and
Sequence-to-sequence Models