Sequence-to-sequence Speech Recognition (3/15/2022)
Lecture: (by Shinji Watanabe)
- Introduction to end-to-end speech recognition
- HMM-based pipeline system
- Connectionist temporal classification (CTC)
- Attention-based encoder decoder
- Joint CTC/attention (Joint C/A)
- RNN transducer (RNN-T)
Language in 10: None
Slides: E2E ASR
Discussion: Please discuss your current status of assignment 3. Please pick up one or two items from the following items.
- Which language did you choose, and why?
- Please share the information of how many hours of training data? What kind of scripts are used? What kind of text/audio pre-processing you’re performing? etc.
- What is your computing environment?
- Using AWS? Your Lab’s computing resources?
- OS, GPU types, cudnn versions, python version, pytorch version, etc.
- Which stage did you finish?
- What were the difficulties and what were the things that should be good to be shared with the others?
- What issues are you currently facing on?
- What is the role in your team, if your team member is also in the discussion group?
- Any other issues, status, and TIPS that you want to report
References:
- Reference: Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. 2006.
- Reference: Chorowski, Jan K., et al. "Attention-based models for speech recognition." Advances in neural information processing systems 28 (2015).
- Reference: Watanabe, Shinji, et al. "Hybrid CTC/attention architecture for end-to-end speech recognition." IEEE Journal of Selected Topics in Signal Processing 11.8 (2017): 1240-1253.
- Reference: Graves, Alex, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Workshop, 2012