# MusicGen-Style
Welcome to MusicGen-Style's demo jupyter notebook. Here you will find a series of self-contained examples of how to use MusicGen-Style in different settings.

First, we start by initializing MusicGen-Style.

In [1]:
from audiocraft.models import MusicGen
from audiocraft.models import MultiBandDiffusion

USE_DIFFUSION_DECODER = False

model = MusicGen.get_pretrained('facebook/musicgen-style')
if USE_DIFFUSION_DECODER:
 mbd = MultiBandDiffusion.get_mbd_musicgen()

Next, let us configure the generation parameters. Specifically, you can control the following:
* `use_sampling` (bool, optional): use sampling if True, else do argmax decoding. Defaults to True.
* `top_k` (int, optional): top_k used for sampling. Defaults to 250.
* `top_p` (float, optional): top_p used for sampling, when set to 0 top_k is used. Defaults to 0.0.
* `temperature` (float, optional): softmax temperature parameter. Defaults to 1.0.
* `duration` (float, optional): duration of the generated waveform. Defaults to 30.0.
* `cfg_coef` (float, optional): coefficient used for classifier free guidance. Defaults to 3.0.
* `cfg_coef_beta` (float, optional): If not None, we use double CFG. cfg_coef_beta is the parameter that pushes the text. Defaults to None, user should start at 5.
 If the generated music adheres to much to the text, the user should reduce this parameter. If the music adheres too much to the style conditioning, 
 the user should increase it

When left unchanged, MusicGen will revert to its default parameters.

These are the conditioner parameters for the style conditioner:
* `eval_q` (int): integer between 1 and 6 included that tells how many quantizers are used in the RVQ bottleneck
 of the style conditioner. The higher eval_q is, the more style information passes through the model.
* `excerpt_length` (float): float between 1.5 and 4.5 that indicates which length is taken from the audio 
 conditioning to extract style. 


In [None]:
model.set_generation_params(
 use_sampling=True,
 top_k=250,
 duration=30
)

The model can perform text-to-music, style-to-music and text-and-style-to-music.
* Text-to-music can be done using `model.generate`, or `model.generate_with_chroma` with the wav condition being None. 
* Style-to-music and Text-and-Style-to-music can be done using `model.generate_with_chroma`

### Text-to-Music

In [None]:
from audiocraft.utils.notebook import display_audio

model.set_generation_params(
 duration=8, # generate 8 seconds, can go up to 30
 use_sampling=True, 
 top_k=250,
 cfg_coef=3., # Classifier Free Guidance coefficient 
 cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning
)

output = model.generate(
 descriptions=[
 '80s pop track with bassy drums and synth',
 '90s rock song with loud guitars and heavy drums',
 'Progressive rock drum and bass solo',
 'Punk Rock song with loud drum and power guitar',
 'Bluesy guitar instrumental with soulful licks and a driving rhythm section',
 'Jazz Funk song with slap bass and powerful saxophone',
 'drum and bass beat with intense percussions'
 ],
 progress=True, return_tokens=True
)
display_audio(output[0], sample_rate=32000)
if USE_DIFFUSION_DECODER:
 out_diffusion = mbd.tokens_to_wav(output[1])
 display_audio(out_diffusion, sample_rate=32000)

### Style-to-Music
For Style-to-Music, we don't need double CFG. 

In [None]:
import torchaudio
from audiocraft.utils.notebook import display_audio

model.set_generation_params(
 duration=8, # generate 8 seconds, can go up to 30
 use_sampling=True, 
 top_k=250,
 cfg_coef=3., # Classifier Free Guidance coefficient 
 cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning
)

model.set_style_conditioner_params(
 eval_q=1, # integer between 1 and 6
 # eval_q is the level of quantization that passes
 # through the conditioner. When low, the models adheres less to the 
 # audio conditioning
 excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt
 )

melody_waveform, sr = torchaudio.load("../assets/electronic.mp3")
melody_waveform = melody_waveform.unsqueeze(0).repeat(2, 1, 1)
output = model.generate_with_chroma(
 descriptions=[None, None], 
 melody_wavs=melody_waveform,
 melody_sample_rate=sr,
 progress=True, return_tokens=True
)
display_audio(output[0], sample_rate=32000)
if USE_DIFFUSION_DECODER:
 out_diffusion = mbd.tokens_to_wav(output[1])
 display_audio(out_diffusion, sample_rate=32000)

### Text-and-Style-to-Music
For Text-and-Style-to-Music, if we use simple classifier free guidance, the models tends to ignore the text conditioning. We then, introduce double classifier free guidance 
$$l_{\text{double CFG}} = l_{\emptyset} + \alpha [l_{style} + \beta(l_{text, style} - l_{style}) - l_{\emptyset}]$$

For $\beta=1$ we retrieve classic CFG but if $\beta > 1$ we boost the text condition

In [None]:
import torchaudio
from audiocraft.utils.notebook import display_audio

model.set_generation_params(
 duration=8, # generate 8 seconds, can go up to 30
 use_sampling=True, 
 top_k=250,
 cfg_coef=3., # Classifier Free Guidance coefficient 
 cfg_coef_beta=5., # double CFG is necessary for text-and-style conditioning
 # Beta in the double CFG formula. between 1 and 9. When set to 1 
 # it is equivalent to normal CFG. 
)

model.set_style_conditioner_params(
 eval_q=1, # integer between 1 and 6
 # eval_q is the level of quantization that passes
 # through the conditioner. When low, the models adheres less to the 
 # audio conditioning
 excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt
 )

melody_waveform, sr = torchaudio.load("../assets/electronic.mp3")
melody_waveform = melody_waveform.unsqueeze(0).repeat(3, 1, 1)

descriptions = ["8-bit old video game music", "Chill lofi remix", "80s New wave with synthesizer"]

output = model.generate_with_chroma(
 descriptions=descriptions,
 melody_wavs=melody_waveform,
 melody_sample_rate=sr,
 progress=True, return_tokens=True
)
display_audio(output[0], sample_rate=32000)
if USE_DIFFUSION_DECODER:
 out_diffusion = mbd.tokens_to_wav(output[1])
 display_audio(out_diffusion, sample_rate=32000)