The article investigates the instabilities in convolutional neural networks (convnets) when used for processing raw audio signals. The key insights are:
The variance of the energy response of a randomly initialized convnet filterbank (Φx) depends on the autocorrelation of the input signal (x). Highly autocorrelated signals like speech and music are more prone to large deviations in the energy response compared to less autocorrelated signals.
The authors derive explicit formulas for the expected value and variance of the energy response (∥Φx∥²) and provide upper bounds on the probability of large deviations using Cantelli's and Chernoff's inequalities.
The authors analyze the extreme value statistics of the optimal frame bounds (A and B) of the random filterbank Φ. They show that the expected values of A and B are related to the order statistics of chi-squared random variables, and provide bounds on their variances.
An asymptotic analysis reveals that the condition number κ = B/A of Φ follows a logarithmic scaling law between the number and length of the filters, reminiscent of discrete wavelet bases. This suggests that convnets are most stable when using many short filters, rather than few long filters.
The theoretical findings are supported by extensive numerical simulations, which demonstrate the instabilities of convnets with random initialization, especially for highly autocorrelated audio signals.
The insights from this work can guide the design of more stable and robust convolutional architectures for raw audio processing tasks.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問