A MultiLingual MultiTask (MLMT) model that integrates multilingual speech generation and recognition tasks into a single large language model, and a data construction strategy that creates code-switched data without relying on high-quality code-switched data.
Phoneme-based models can achieve strong crosslinguistic generalizability to unseen languages for open-vocabulary keyword spotting and zero-shot forced alignment.