Hyderabad, Indien  /  06. April 2025  -  11. April 2025

ICASSP 2025

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing

Die 50. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025) findet vom 6. bis 11. April 2025 in Hyderabad, Indien statt. Das Fraunhofer IDMT wird aktuelle Forschungsergebnisse aus den Bereichen Automatische Musikanalyse, Federated Learning, Akustik und Hearing Aids/Hearables vorstellen.

No Data Required: Zero-Shot Domain Adaptation for Automatic Music Transcription

Andrew McLeod

Automatic music transcription (AMT) takes a music recording and outputs a transcription of the underlying music. Deep learning models trained for AMT rely on large amounts of annotated training data, which are available only for some domains such as Western classical piano music. Using pre-trained models on out-of-domain inputs can lead to significantly lower performance. Fine-tuning or retraining on new target domains is expensive and relies on the presence of labeled data. In this work, we propose a method for taking a pre-trained transcription model and improving its performance on out-of-domain data without the need for any training data, requiring no fine-tuning or retraining of the original model. Our method uses the model to transcribe pitch-shifted versions of an input, aggregating the output across these versions where the original model is unsure. We take a model originally trained for piano transcription and present experiments under two domain shift scenarios: recording condition mismatch (piano with different recording setups) and instrument mismatch (guitar and choral data). We show that our method consistently improves note- and frame-based performance.

The paper will be presented by Andrew McLeod.

Federated Semi-supervised Learning for Industrial Sound Analysis and Keyword Spotting

Sascha Grollmisch, Thomas Köllmer, Artem Yaroshuck, Hanna Lukashevich

Obtaining and annotating representative training data for deep learning-based classifiers can be both expensive and impractical in domains such as Industrial Sound Analysis (ISA) and Keyword Spotting (KWS). Furthermore, conventional techniques often rely on centralized servers to store training datasets, raising concerns about data security. We introduce a method that combines Semi-supervised and Federated Learning (FSSL) for classifying audio using Federated Averaging and FixMatch. Our findings indicate that the model's accuracy decreases by 30 to 50 percentage points when the labeled data per client is reduced to just 1% of its original volume using standard supervised federated learning. However, our proposed FSSL method improves the accuracy by more than 25 percentage points and reaches nearly perfect accuracy for the ISA dataset, thus making efficient use of unlabeled data. Furthermore, this FSSL approach proves effective even when data distribution is uneven and clients only label subsets of all target classes.

The paper will be presented by Thomas Köllmer on the FLute: Workshop on Federated Learning for Audio on April 7, 2025.

How Machines perceive Rooms - Regions of Relevance in Room Impulse Responses

Prachi Sarma (TU Ilmenau), Christian Kehling (TU Ilmenau/Fraunhofer IDMT)

The recent surge in Extended Reality (XR) applications has sparked interest in acoustic research, particularly for enhancing immersive experiences. Room Impulse Responses (RIRs), which capture the acoustic characteristics of spaces, play a crucial role in creating plausible XR environments. Neural networks leverage RIRs for tasks such as room identification and acoustic parameter estimation. However, the decision-making process for neural networks using RIRs remains opaque. To address this, this publication applies eXplainable Artificial Intelligence (XAI) techniques, such as Layer-Wise Relevance Propagation (LRP), Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA), to identify the most relevant regions within RIRs. The techniques are applied to three independent task-dataset compositions in the area of room acoustic estimation. The findings reveal that the initial period of a RIR up to 122 ms, containing direct sound (DS) and early reflections (ER), most significantly influences network decisions.

The paper will be presented by Prachi Sarma (TU Ilmenau).

Low-Complexity Own Voice Reconstruction for Hearables with an In-Ear Microphone

Mattes Ohlenbusch, Christian Rollwage, Simon Doclo

Hearable devices, equipped with one or more microphones, are commonly used for speech communication. Here, we consider the scenario where a hearable is used to capture the user's own voice in a noisy environment. In this scenario, own voice reconstruction (OVR) is essential for enhancing the quality and intelligibility of the recorded noisy own voice signals. In previous work, we developed a deep learning-based OVR system, aiming to reduce the amount of device-specific recordings for training by using data augmentation with phoneme-dependent models of own voice transfer characteristics. Given the limited computational resources available on hearables, in this paper we propose low-complexity variants of an OVR system based on the FT-JNF architecture and investigate the required amount of device-specific recordings for effective data augmentation and fine-tuning. Simulation results show that the proposed OVR system considerably improves speech quality, even under constraints of low complexity and a limited amount of device-specific recordings.

The paper will be presented by Mattes Ohlenbusch on April 9, 2025 at 12 pm in room MR1.03.