Core Concepts
Using detailed music descriptions generated from in-game dialogue by LLMs improves the audio quality of long-form music generated by text-to-music models, while maintaining consistency in these descriptions over time leads to smoother transitions between musical segments.
Stats
MusicGen, a text-to-music model with 1.3B parameters, was used for generating 30-second audio segments.
Ollama 3.1, a 70B parameter LLM, was used for generating music descriptions.
Fréchet Audio Distance (FAD) scores indicated that Babel Bardo-DC, the version with detailed descriptions and continuation, produced the highest audio quality.
Kullback-Leibler Divergence (KLD) scores showed that Babel Bardo-Emotion, using emotion as the primary prompt, achieved the best story alignment.
Babel Bardo-Emotion also exhibited the smoothest transitions between music segments, as measured by KLD.
Quotes
"Results suggest that while detailed music descriptions contribute to improved audio quality, maintaining consistency across consecutive descriptions helps achieve smoother transitions between musical segments."
"Furthermore, our findings indicate that emotion serves as an effective signal for aligning generated music with TRPG narratives."