Core Concepts
Phoneme-based models can achieve strong crosslinguistic generalizability to unseen languages for open-vocabulary keyword spotting and zero-shot forced alignment.
Abstract
The authors present a multilingual speech dataset called IPAPACK, which contains over 1000 hours of speech data across 115 languages with phonemic transcriptions. They then propose CLAP-IPA, a multilingual phoneme-speech contrastive embedding model, and IPA-ALIGNER, a neural forced aligner, both of which can generalize to unseen languages.
Key highlights:
CLAP-IPA can perform zero-shot open-vocabulary keyword spotting in any language, outperforming text-based models on unseen languages.
Alignments between phonemes and speech signals emerge from CLAP-IPA's contrastive training, enabling zero-shot forced alignment in unseen languages.
IPA-ALIGNER, a finetuned version of CLAP-IPA, can provide competitive performance on word and phone boundary detection compared to HMM-based forced aligners, even on unseen languages.
Phoneme-based modeling enables better knowledge transfer across languages compared to text-based modeling, especially for low-resource languages.
The authors discuss the challenges and limitations in scaling up high-quality phonemic transcriptions for the world's languages.
Stats
The IPAPACK dataset contains over 1000 hours of speech data across 115 languages.
The VoxCommunis corpus contains 803.84 hours of speech data across 38 languages.
The FLEURS-IPA subset contains 779.54 hours of speech data across 77 languages.
The MSWC-IPA subset contains 613.44 hours of speech data across 36 languages.
The DORECO-IPA subset contains 18.99 hours of speech data across 44 languages.
Quotes
"Despite the seeming diversity, sounds of human speech are highly constrained by the anatomical structure of the human vocal tract, which is universally shared by all humans."
"Typological research has also shown that most, if not all, human speech can be represented by around 150 phonemes and diacritics."