Core Concepts
This work provides a concise comparative analysis of Fourier and Wavelet transforms, the most commonly used signal decomposition methods for audio processing tasks. It also discusses metrics for evaluating speech intelligibility, with the goal of guiding machine learning engineers in choosing and fine-tuning a decomposition method for a specific model.
Abstract
The content starts with an introduction to the widespread use of automated voice assistants and the demand for applications that process audio signals, particularly human voice. It highlights that while end-to-end machine learning models exist, properly pre-processing the signal can greatly reduce the complexity of the task and allow it to be solved with a simpler model and fewer computational resources.
The main part of the work focuses on describing the commonly used signal decomposition methods - Short-Time Fourier Transform (STFT) and Wavelet Transform (WT). For STFT, the author discusses window functions, their properties, and applications. For WT, the author covers the Continuous Wavelet Transform (CWT), Discrete Wavelet Transform (DWT), Wavelet Families, Filters and Filter Banks, and the relationship between DWT and Filter Banks.
The author also discusses metrics for evaluating speech intelligibility, such as Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI).
The work concludes with a case study on the speaker isolation problem, where the author demonstrates the practical application of the previously discussed theoretical concepts.
Stats
The sampling frequency of the voice recording is 16,000 Hz.
The voice recording contains 959,669 samples.
The spectrogram was generated using a Hann window of length 32ms (512 samples) with 16ms (256 samples) overlap.
Quotes
"Asking Siri to turn on the lamps or to tell you which song is currently playing on the radio is no longer the matter of science fiction but a daily routine."
"Even though end-to-end models exist, properly pre-processing the signal can greatly reduce the complexity of the task and allow it to be solved with a simpler ML model and fewer computational resources."
"Wavelet Transform (WT) remains seldomly used in ML models and Wavelet Packet Transform (WPT) even more so. The reason for this might be the fact that understanding these transformations requires a solid mathematical background."