What happens when you point Meta’s MusicGen at one of the world’s most underrepresented vocal traditions?

Github link or
Meta’s MusicGen can generate a convincing jazz piano solo, a lo-fi beat, or a Celtic folk riff in seconds. Ask it for Mongolian throat singing — khoomei, sygyt, kargyraa — and you get something that sounds like a modem drowning. These styles barely exist on the internet, so they barely exist in the training data.
This is the classic low-resource domain adaptation problem, just for audio instead of text.

The goal: fine-tune MusicGen-small (300M parameters) on a self-built dataset of Mongolian and Tuvan throat singing on my machine, and actually get it to produce something recognizable.
No dataset existed, so I built one.
Using yt-dlp + ffmpeg, I downloaded audio from YouTube, segmented it into 10-second clips at 32kHz, and filtered for quality. Final count: 3,546 training clips / 393 validation clips — ~11 hours total, split across three styles:
Keeping styles separate matters because MusicGen is text-conditioned — each style gets its own prompt during training, so the model can learn to steer toward the right acoustic character at inference time rather than averaging everything into one indistinct sound.