Core Concepts
ZAEBUC-Spoken is a multilingual multidialectal Arabic-English speech corpus collected through Zoom meetings, offering a challenging set for automatic speech recognition.
Abstract
ZAEBUC-Spoken is a multilingual multidialectal Arabic-English speech corpus.
The corpus comprises twelve hours of Zoom meetings involving multiple speakers role-playing a work situation.
Different language setups are used, including Arabic variants and English with various accents.
The corpus is enriched with annotations for dialectness levels and automatic morphological annotations.
Related work on Arabic speech corpora and code-switching is discussed.
Data collection process, transcription guidelines, and corpus statistics are detailed.
Code-switching analysis for Arabic-English and MSA-dialectal code-switching is provided.
Morphological annotation details and future work plans are outlined.
Stats
The corpus contains twelve hours of Zoom meetings.
The MGB-2 Arabic challenge dataset comprises 1,200 hours of speech.
The corpus includes manual transcriptions and dialectness level annotations.
Quotes
"We make the corpus publicly available." - ZAEBUC-Spoken Project