Special Issue on Multi-Speaker, Multi-Microphone, and Multi-Modal Distant Speech Recognition Submission Date: 2024-12-02 Automatic speech recognition (ASR) has significantly progressed in the single-speaker scenario, owing to extensive training data, sophisticated deep learning architectures, and abundant computing resources. Building on this success, the research community is now tackling real-world multi-speaker speech recognition, where the number and nature of the sound sources are unknown and changing over time. In this scenario, refining core multi-speaker speech processing technologies such as speech separation, speaker diarization, and robust speech recognition is essential, and the effective integration of these advancements becomes increasingly crucial. In addition, emerging approaches, such as end-to-end neural networks, speech foundation models, and advanced training methods (e.g., semi-supervised, self-supervised, and unsupervised training) incorporating multi-microphone and multi-modal information (such as video and accelerometer data), offer promising avenues to alleviate these challenges. This special issue gathers recent advances in multi-speaker, multi-microphone, and multi-modal speech processing studies to establish real-world conversational speech recognition.
Guest editors:
Assoc. Prof. Shinji Watanabe (Executive Guest Editor)
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
Email: shinjiw@ieee.org
Areas of Expertise: Speech recognition, speech enhancement, and speaker diarization
Dr. Michael Mandel
Reality Labs, Meta, Menlo Park, California, United States of America
Email: mmandel@meta.com
Areas of Expertise: Source separation, noise robust ASR, electromyography
Dr. Marc Delcroix
NTT Corporation, Chiyoda-Ku, Japan
Email: marc.delcroix@ieee.org; marc.delcroix@ntt.com
Areas of Expertise: Robust speech recognition, speech enhancement, source separation and extraction
Dr. Leibny Paola Garcia Perera
Johns Hopkins University, Baltimore, Maryland, United States of America
Email: lgarci27@jhu.edu
Areas of Expertise: Speech recognition, speech enhancement, and speaker diarization, multimodal speech processing
Dr. Katerina Zmolikova
Meta, Menlo Park, California, United States of America
Email: kzmolikova@meta.com
Areas of Expertise: Speech separation and extraction, speech enhancement, robust speech recognition
Dr. Samuele Cornell
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
Email: scornell@andrew.cmu.edu
Areas of Expertise: Robust speech recognition, speech separation and enhancement
Special issue information:
Relevant research topics include (but are not limited to):
Speaker identification and diarization
Speaker localization and beamforming
Single- or multi-microphone enhancement and source separation
Robust features and feature transforms
Robust acoustic and language modeling for distant or multi-talker ASR
Traditional or end-to-end robust speech recognition
Training schemes: data simulation and augmentation, semi-supervised, self-supervised, and unsupervised training for distant or multi-talker speech processing
Pre-training and fine-tuning of speech and audio foundation models and their application to distant and multi-talker speech processing
Robust speaker and language recognition
Robust paralinguistics
Cross-environment or cross-dataset performance analysis
Environmental background noise modeling
Multimodal speech processing
Systems, resources, and tools for distant Speech Recognition
In addition to traditional research papers, the special issue also hopes to include descriptions of successful conversational speech recognition systems where the contribution is more in the implementation than the techniques themselves, as well as successful applications of conversational speech recognition systems. For example, the recently concluded seventh and eighth CHiME challenges serve as a focus for discussion in this special issue. The challenge considered the problem of conversational speech separation, speech recognition, and speaker diarization in everyday home environments from multi-microphone and multi-modal input. Seventh and eighth CHiME challenges consist of multiple tasks based on 1) distant automatic speech recognition with multiple devices in diverse scenarios, 2) unsupervised domain adaptation for conversational speech enhancement, 3) distant diarization and ASR in natural conferencing environments, and 4) ASR for multimodal conversations in smart glasses. Papers reporting evaluation results on the CHiME-7/8 datasets or other datasets dealing with real-world conversational speech recognition are equally welcome.
Manuscript submission information:
Tentative Dates:
Submission Open Date: August 19, 2024
Manuscript Submission Deadline: December 2, 2024
Editorial Acceptance Deadline: September 1, 2025
Contributed full papers must be submitted via Computer Speech & Language online submission system (Editorial Manager®): https://www.editorialmanager.com/ycsla/default2.aspx. Please select the article type “VSI: Multi-DSR” when submitting the manuscript online.
Please refer to the Guide for Authors to prepare your manuscript: https://www.elsevier.com/journals/computer-speech-and-language/0885-2308/guide-for-authors
For any further information, the authors may contact the Guest Editors.
Keywords:
Speech recognition, speech enhancement/separation, speaker diarization, multi-speaker, multi-microphone, multi-modal, Distant Speech Recognition, CHiME challenge