Core Concepts
Accurate localization of sound sources within a virtual indoor environment by leveraging physically grounded sound propagation simulations and machine learning methods.
Abstract
The study aims to locate sound sources to specific locations within a virtual environment by leveraging physically grounded sound propagation simulations and machine learning methods. This process attempts to overcome the issue of data insufficiency to localize sound sources to their location of occurrence, especially in post-event localization scenarios.
The key highlights and insights are:
The study utilizes the SoundSpace2.0 framework and Habitat-Sim simulation engine to construct a pipeline for simulating the received sound in a virtual 3D room, generating sufficient data to effectively train a machine learning model.
Audio spectrograms, instead of the more commonly used direction of arrival (DoA), are used as the training data for the machine learning models.
Two machine learning approaches are explored: a convolutional neural network (CNN) model and an audio spectrogram transformer (AST) model pre-trained on AudioSet.
The AST model outperforms the CNN model, achieving an F1-score of 0.786 ± 0.014 in localizing the sound source to the specific room within the virtual environment.
The study discusses the limitations of the current work and future directions, including adapting to dynamic scenarios, handling sound mixtures, dereverberation of actual audio, and building 3D virtual environments based on real-world building blueprints.
Stats
The indoor environment consists of 10 different rooms and the microphone was situated only in one of the rooms with a minimal line of sight to the sound sources.
640 spectrograms of reverb-convolved audio were generated for room-based localization, and 512 spectrograms were generated for coordinate-based localization.
Quotes
"Accurate localization of sound sources within a virtual indoor environment by leveraging physically grounded sound propagation simulations and machine learning methods."
"The AST model outperforms the CNN model, achieving an F1-score of 0.786 ± 0.014 in localizing the sound source to the specific room within the virtual environment."