Assignment 3: Multilingual Speech Recognition

Due March 21, 11:59pm ET.

This is a group assignment, with teams of 2-3 people

notebook for the recitation.

Overview

End-to-End (E2E) models have become the popular approaches for speech recognition and other speech processing tasks. The performance of E2E speech recognition (ASR) has been greatly improved during the last several years due to the success of deep learning. However, most of the studies on E2E-ASR were carried out on languages with a large amount of data, such as English, Chinese or Japanese. There are also significant demand to build E2E-ASR models for low-resource languges. The goal of assignment3 is guide you through the basic steps of building an E2E-ASR system and learn how to build a E2E-ASR model or improve the performance of some existing E2E-ASR models for a specific speech dataset.

Experiments

The base code for this experiment is provided through a Google Colab notebook that can be accessed here

The first task is to run the whole notebook from start to end.

If you plan to do further experimenting outside of Google Colab, we recommend you save it by, for example, creating a copy on your GitHub with File -> Save a copy on Github

Running another Non-English Recipe in ESPNet

The previous recipe was run on CommonVoice. One possible extension is to attempt to run an existing recipe in ESPNet for another dataset.

The list of available recipes is available here. Pick a recipe for a language other than English. Your goal is to modify the recipe and try to improve the performance over the reported values in the README.md

NOTE: Please let us know which dataset you tackling by March 14th on this sheet.

Train a Speech Recognition System on a New Dataset

Another possible extension is to train a model on a new dataset that isn’t covered by ESPNet yet.

We have prepared some candidate corpora in this sheet. Please feel free to propose other speech datasets that you prefer to work on.

Some suggested steps include:

  • Add your group information in the sheet to claim that you want to work on a specific dataset.
  • Following Stage 1, implementing a bash script (local/data.sh) that includes:
    • Download the dataset
    • Split the dataset into train / dev / test
    • Perform text normalization, e.g. removing the punctuation, unifying letter case, etc.
    • Prepare the data in Kaldi style
  • Following stage 2-4, do speed perturbation if necessary and dump the audio file.
  • Prepare the tokenization model as in stage 5
  • Train the language model (LM) following the stages 6 - 9
  • Train the end-to-end speech recognition model (E2E-ASR), following the stages 10 - 11
  • Decoding and scoring the dev and test sets with your system (LM + E2E-ASR), following the stages 12-13

NOTE: Please let us know which dataset you tackling by March 14th on this sheet.

Self-Supervised Learning

The original recipes use spectrum features to train the model. One possible way to improve the performance further is by using self-supervised learning representations (SSLR) as speech features.

We recommend you to use HuBERT or Wav2Vec2.0 and their variants, e.g. XLSR, WavLM. A tutorial on how to use these features can be found here

Submit a Pull Request to ESPNet

Finally, if you are able to train a system that improves the performance of existing, try to submit a PR to ESPNet! This helps other researchers to use your system for their research.

First, you must follow this tutorial for making a recipe to clean and fix your scripts. Then you should follow the tutorial for PR. You are required to follow the principles of ESPnet recipes and to pass all the CI tests. Some things to consider

  • If you choose to work on a new dataset, you’ll have to make a data preparation script according to the dataset, including data download, generate the train / dev / test sets in Kaldi style. You can refer to the local/data.sh in some other recipes for more details.
  • Prepare the configuration file in conf/tuning.
  • Add your results and misc. information in the README.md in the recipe directory.
  • Upload your model to HuggingFace and put it in the README.md.

Grading

Write and submit a report (max 3 pages) describing and analyzing the results. The report should be written in the ACL Style

  1. Run the notebook.
  2. Experiment with a different dataset: You can do either of the following options.
    • Reproducing an existing recipe to target another non-English language dataset: Modify it and try to improve the performance in terms of Word Error Rate (WER), documenting the changes made. If you are able to improve performance over 5% WERR, this will earn you a B+ grade, otherwise, this will earn you B grade.
    • Training an ASR system in a new speech dataset: Alternatively to improving the existing recipe, you can instead train a new ASR system on a new dataset. This will also earn you a B+ grade.
    • Note: we usually use word error reduction rate reduction (WERR) to measure the performance: WERR = (old_WER - new_WER) / old_WER. Generally speaking, 5% is a threshold. However, if the old_WER is very bad, then the expected WERR would be higher.
  3. Add self-supervised learning: Try to improve the WER of the supervised system you trained (either on the existing dataset or on a new dataset) using representations learned through self-supervised learning. This will earn you a A if you obtain obvious improvements (5+% WERR), otherwise will earn you a A-.
  4. Submit a pull request to ESPNet: This will earn you a A+ grade.

If using existing code, please cite your sources.

Submission

Your submission consists of three parts: code, system performance in RESULTS.md and writeup. Put all your code in a folder named code and instructions on how to run if you have implemented additional code. Include the output of your models in an outputs directory, with a description of what model each file is associated with. Rename the writeup as writeup.pdf and compress all of them as assign3.tar.gz. This file must be submitted to Canvas (link on the course website).