Sequence-to-sequence Speech Recognition (3/15/2022)

Lecture: (by Shinji Watanabe)

  • Introduction to end-to-end speech recognition
  • HMM-based pipeline system
  • Connectionist temporal classification (CTC)
  • Attention-based encoder decoder
  • Joint CTC/attention (Joint C/A)
  • RNN transducer (RNN-T)

Language in 10: None

Slides: E2E ASR

Discussion: Please discuss your current status of assignment 3. Please pick up one or two items from the following items.

  • Which language did you choose, and why?
    • Please share the information of how many hours of training data? What kind of scripts are used? What kind of text/audio pre-processing you’re performing? etc.
  • What is your computing environment?
    • Using AWS? Your Lab’s computing resources?
    • OS, GPU types, cudnn versions, python version, pytorch version, etc.
  • Which stage did you finish?
    • What were the difficulties and what were the things that should be good to be shared with the others?
    • What issues are you currently facing on?
  • What is the role in your team, if your team member is also in the discussion group?
  • Any other issues, status, and TIPS that you want to report

References:

  • Reference: Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. 2006.
  • Reference: Chorowski, Jan K., et al. "Attention-based models for speech recognition." Advances in neural information processing systems 28 (2015).
  • Reference: Watanabe, Shinji, et al. "Hybrid CTC/attention architecture for end-to-end speech recognition." IEEE Journal of Selected Topics in Signal Processing 11.8 (2017): 1240-1253.
  • Reference: Graves, Alex, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Workshop, 2012

<-- Back To Schedule