While many AI tools are capable of solving real world problems or tasks such as: writing a paper summary using an LLM or training an object detection model with YOLO, there are many modalities of AI that are not nearly as mature. For example, there is an ongoing race to create the best possible Text-To-Speech (TTS) framework where recent players like Sesame have burst to the forefront.
Enter music generation, which promises audio synthesis for things like video games, video editing, or even just to create original songs. But this has proven a real challenge for developers. As the MultiModal Art team describes it, generation of songs complete with vocals and accompaniment, from lyrics and control signals, is one of the most challenging tasks for music synthesis. Despite its significance, no open-source system has managed to achieve this at scale, until now. While some closed source projects like SUNO have made great strides here, the lack of open-source alternatives leaves a big gap for the community, slowing innovation in the space.

With the YuE music generator, this space has taken a meaningful step forward. In this article, we will cover what the YuE music generator is and how it works before diving into a coding demo showing how to run YuE on NVIDIA H200s.
YuE Music Generator
“YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs (lyrics2song). It can generate a complete song, lasting several minutes, that includes both a catchy vocal track and accompaniment track. YuE is capable of modeling diverse genres/languages/vocal techniques” (Source https://github.com/multimodal-art-projection/YuE)
To achieve this, the developers of YuE made several notable contributions:
- Track-Decoupled Next-Token Prediction: they developed a novel dual-token strategy to separate model’s different audio component tracks (vocals, accompaniment) at the frame level. In practice, this makes the models resilient to challenging low vocal-to-accompaniment ratio scenarios like metal music
- Structural Progressive Conditioning: A progressive conditioning strategy for long-form music generation, enabling song-level lyrics following and structure control. As of its release on Jan. 28, 2025, YuE family is the first publicly available, open-source lyrics-to-song model capable of full-song generation with quality on par with commercial systems.” (source https://arxiv.org/pdf/2503.08638)
- Redesigned In-Context Learning for Music: A novel In-Context Learning framework. This enables advanced style transfer of voice characteristics across tracks, fundamental voice cloning, and bidirectional content creation.
- Multitask Multiphase Pre-training: They utilized a training strategy that converges and generalizes on real world data.
- Strong Performance: “YuE demonstrates strong results in musicality, vocal agility, and generation duration compared to proprietary systems, supports multilingual lyrics following , while also excelling in music understanding tasks on representation learning benchmark MARBLE.” (source https://arxiv.org/pdf/2503.08638)
Architecture

YuE is an autoregressive language model based framework tailored for lyrics-to-song generation. As depicted in the image here, YuE has four main components: an audio tokenizer (with a lightweight upsampler), a text tokenizer, and two language models (LMs). “The audio tokenizer converts waveforms into discrete tokens using a semantic acoustic fused approach. The Stage-1 LM is track-decoupled, trained on text tokens and semantic-rich bottom-level audio tokens (codebook-0 from residual VQ-VAE), modeling lyrics-to-song generation as an AR next-token prediction (NTP) task. In Stage-2, a smaller LM predicts residual tokens from codebook-0 tokens to reconstruct audio. Both LMs follow the widely-adopted LLaMA2 architecture [Touvron et al., 2023a, Team, 2024]. Finally, a lightweight vocoder upsamples Stage2’s 16 kHz audio to 44.1 kHz output.”
“The audio tokenizer converts waveforms into discrete tokens using a semantic acoustic fused approach. The Stage-1 LM is track-decoupled, trained on text tokens and semantic-rich bottom-level audio tokens (codebook-0 from residual VQ-VAE), modeling lyrics-to-song generation as an AR next-token prediction (NTP) task. In Stage-2, a smaller LM predicts residual tokens from codebook-0 tokens to reconstruct audio. Both LMs follow the widely-adopted LLaMA2 architecture [Touvron et al., 2023a, Team, 2024]. Finally, a lightweight vocoder upsamples Stage2’s 16 kHz audio to 44.1 kHz output.”
In practice, the model takes song lyrics and a music genre primer to create a song that effectively captures both inputs.
Code Demo
Running YuE is actually relatively simple, thanks to the comprehensive code released and open-sourced by the MultiModal-Art team. Follow along in this section to learn how to generate your own songs with YuE.
First, access your machine through SSH on your local terminal. Then, you may want to follow the steps outlined here in our R1 article, where we discuss how to mount a volume for storage onto the machine. After the machine is ready, we can begin setting up the environment.
Setup your Environment
Navigate to the directory of your choice in the machine. We recommend working on a mounted drive, but there should be sufficient storage on the machine to handle this without doing so. Once there, copy and paste the following code into the terminal:
pip install -r <(curl -sSL https://raw.githubusercontent.com/multimodal-art-projection/YuE/main/requirements.txt)
pip install flash-attn --no-build-isolation
sudo apt update
sudo apt install git-lfs
git lfs install
git clone https://github.com/multimodal-art-projection/YuE.git
cd YuE/inference/
git clone https://huggingface.co/m-a-p/xcodec_mini_infer
This will install all the required packages, including flash attention 2. Then, it will clone the YuE repository, and additionally clone the subrepository xcodec_mini_infer to the inference folder. From here, we have everything we need to proceed with generating music.
Generate music with YuE
To make it simple, the creators of YuE have provided an infer.py script for us to use to generate our music. This script will handle everything to download the models and then run inference. Furthermore, they have provided sample lyrics and genre selections for us to use for our first run.
Run the script below to generate your first song with YuE:
python3 infer.py \
--cuda_idx 0 \
--stage1_model m-a-p/YuE-s1-7B-anneal-en-cot \
--stage2_model m-a-p/YuE-s2-1B-general \
--genre_txt ../prompt_egs/genre.txt \
--lyrics_txt ../prompt_egs/lyrics.txt \
--run_n_segments 2 \
--stage2_batch_size 4 \
--output_dir ../output \
--max_new_tokens 3000 \
--repetition_penalty 1.1
This will leave us with a new MP3 file that we can use for whatever purpose we need!
Next, we should generate our own custom songs. We recommend using a Large Language Model to generate quick lyrics, such as a deployment of DeepSeek R1 on our H200 GPUs. We can then take those lyrics, and replace the values in lyrics.txt. Next, we can change the meta settings by editing the command prompt in genre.txt. This allows us to use human language to determine the type of song output we receive. For example, the default is ”inspiring female uplifting pop airy vocal electronic bright vocal vocal”. We could change this to something like “somber male country singer gruff acoustic vocal” to get almost the opposite of our original songs' vibes. Additionally, we can change values like run_n_segments and stage2_batch_size in the infer parameters to improve speed.
Here is a sample song we generated: