CS 11-737: Multilingual NLP

Sequence-to-sequence Speech Recognition (3/15/2022)

Lecture: (by Shinji Watanabe)

Introduction to end-to-end speech recognition
HMM-based pipeline system
Connectionist temporal classification (CTC)
Attention-based encoder decoder
Joint CTC/attention (Joint C/A)
RNN transducer (RNN-T)

Language in 10: None

Slides: E2E ASR

Discussion: Please discuss your current status of assignment 3. Please pick up one or two items from the following items.

Which language did you choose, and why?

Please share the information of how many hours of training data? What kind of scripts are used? What kind of text/audio pre-processing you’re performing? etc.

What is your computing environment?

Using AWS? Your Lab’s computing resources?
OS, GPU types, cudnn versions, python version, pytorch version, etc.

Which stage did you finish?

What were the difficulties and what were the things that should be good to be shared with the others?
What issues are you currently facing on?

What is the role in your team, if your team member is also in the discussion group?

Any other issues, status, and TIPS that you want to report

References:

Reference: Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. 2006.

Reference: Chorowski, Jan K., et al. "Attention-based models for speech recognition." Advances in neural information processing systems 28 (2015).

Reference: Watanabe, Shinji, et al. "Hybrid CTC/attention architecture for end-to-end speech recognition." IEEE Journal of Selected Topics in Signal Processing 11.8 (2017): 1240-1253.

Reference: Graves, Alex, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Workshop, 2012

<-- Back To Schedule

Site Content: Graham Neubig, Site Design: TEMPLATED, Jekyll Template: Cloud Cannon