Core Concepts
This research paper introduces ConversaSynth, a novel framework that leverages large language models (LLMs) and text-to-speech (TTS) systems to generate realistic and diverse synthetic audio conversations for enhancing AI models in audio processing tasks.
Stats
The average SNR across all evaluated audio files was found to be 93.49 dB.
The dataset includes 189 conversations totaling 4.01 hours.
2-speaker segments make up 27.5% of the dataset.
3-, 4-, and 5-speaker segments comprise 22.2%, 25.9%, and 24.3% of the dataset, respectively.
Out of the 200 generated conversations, 189 adhered to the correct format, resulting in a success rate of 94.5%.
The total time taken for text generation was 3,730.06 seconds, averaging approximately 18.65 seconds per conversation.
The process of converting the generated text conversations into audio took 4,457.94 seconds in total, averaging approximately 22.29 seconds per audio file.