{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MusicGen-Style\n",
    "Welcome to MusicGen-Style's demo jupyter notebook. Here you will find a series of self-contained examples of how to use MusicGen-Style in different settings.\n",
    "\n",
    "First, we start by initializing MusicGen-Style."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.models import MusicGen\n",
    "from audiocraft.models import MultiBandDiffusion\n",
    "\n",
    "USE_DIFFUSION_DECODER = False\n",
    "\n",
    "model = MusicGen.get_pretrained('facebook/musicgen-style')\n",
    "if USE_DIFFUSION_DECODER:\n",
    "    mbd = MultiBandDiffusion.get_mbd_musicgen()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let us configure the generation parameters. Specifically, you can control the following:\n",
    "* `use_sampling` (bool, optional): use sampling if True, else do argmax decoding. Defaults to True.\n",
    "* `top_k` (int, optional): top_k used for sampling. Defaults to 250.\n",
    "* `top_p` (float, optional): top_p used for sampling, when set to 0 top_k is used. Defaults to 0.0.\n",
    "* `temperature` (float, optional): softmax temperature parameter. Defaults to 1.0.\n",
    "* `duration` (float, optional): duration of the generated waveform. Defaults to 30.0.\n",
    "* `cfg_coef` (float, optional): coefficient used for classifier free guidance. Defaults to 3.0.\n",
    "* `cfg_coef_beta` (float, optional): If not None, we use double CFG. cfg_coef_beta is the parameter that pushes the text. Defaults to None, user should start at 5.\n",
    "    If the generated music adheres to much to the text, the user should reduce this parameter. If the music adheres too much to the style conditioning, \n",
    "    the user should increase it\n",
    "\n",
    "When left unchanged, MusicGen will revert to its default parameters.\n",
    "\n",
    "These are the conditioner parameters for the style conditioner:\n",
    "* `eval_q` (int): integer between 1 and 6 included that tells how many quantizers are used in the RVQ bottleneck\n",
    "    of the style conditioner. The higher eval_q is, the more style information passes through the model.\n",
    "* `excerpt_length` (float): float between 1.5 and 4.5 that indicates which length is taken from the audio \n",
    "    conditioning to extract style. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.set_generation_params(\n",
    "    use_sampling=True,\n",
    "    top_k=250,\n",
    "    duration=30\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model can perform text-to-music, style-to-music and text-and-style-to-music.\n",
    "* Text-to-music can be done using `model.generate`, or `model.generate_with_chroma` with the wav condition being None. \n",
    "* Style-to-music and Text-and-Style-to-music can be done using `model.generate_with_chroma`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text-to-Music"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "model.set_generation_params(\n",
    "    duration=8, # generate 8 seconds, can go up to 30\n",
    "    use_sampling=True, \n",
    "    top_k=250,\n",
    "    cfg_coef=3., # Classifier Free Guidance coefficient \n",
    "    cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning\n",
    ")\n",
    "\n",
    "output = model.generate(\n",
    "    descriptions=[\n",
    "        '80s pop track with bassy drums and synth',\n",
    "        '90s rock song with loud guitars and heavy drums',\n",
    "        'Progressive rock drum and bass solo',\n",
    "        'Punk Rock song with loud drum and power guitar',\n",
    "        'Bluesy guitar instrumental with soulful licks and a driving rhythm section',\n",
    "        'Jazz Funk song with slap bass and powerful saxophone',\n",
    "        'drum and bass beat with intense percussions'\n",
    "    ],\n",
    "    progress=True, return_tokens=True\n",
    ")\n",
    "display_audio(output[0], sample_rate=32000)\n",
    "if USE_DIFFUSION_DECODER:\n",
    "    out_diffusion = mbd.tokens_to_wav(output[1])\n",
    "    display_audio(out_diffusion, sample_rate=32000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Style-to-Music\n",
    "For Style-to-Music, we don't need double CFG. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchaudio\n",
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "model.set_generation_params(\n",
    "    duration=8, # generate 8 seconds, can go up to 30\n",
    "    use_sampling=True, \n",
    "    top_k=250,\n",
    "    cfg_coef=3., # Classifier Free Guidance coefficient \n",
    "    cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning\n",
    ")\n",
    "\n",
    "model.set_style_conditioner_params(\n",
    "    eval_q=1, # integer between 1 and 6\n",
    "              # eval_q is the level of quantization that passes\n",
    "              # through the conditioner. When low, the models adheres less to the \n",
    "              # audio conditioning\n",
    "    excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt\n",
    "    )\n",
    "\n",
    "melody_waveform, sr = torchaudio.load(\"../assets/electronic.mp3\")\n",
    "melody_waveform = melody_waveform.unsqueeze(0).repeat(2, 1, 1)\n",
    "output = model.generate_with_chroma(\n",
    "    descriptions=[None, None], \n",
    "    melody_wavs=melody_waveform,\n",
    "    melody_sample_rate=sr,\n",
    "    progress=True, return_tokens=True\n",
    ")\n",
    "display_audio(output[0], sample_rate=32000)\n",
    "if USE_DIFFUSION_DECODER:\n",
    "    out_diffusion = mbd.tokens_to_wav(output[1])\n",
    "    display_audio(out_diffusion, sample_rate=32000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text-and-Style-to-Music\n",
    "For Text-and-Style-to-Music, if we use simple classifier free guidance, the models tends to ignore the text conditioning. We then, introduce double classifier free guidance \n",
    "$$l_{\\text{double CFG}} = l_{\\emptyset} + \\alpha [l_{style} + \\beta(l_{text, style} - l_{style}) - l_{\\emptyset}]$$\n",
    "\n",
    "For $\\beta=1$ we retrieve classic CFG but if $\\beta > 1$ we boost the text condition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchaudio\n",
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "model.set_generation_params(\n",
    "    duration=8, # generate 8 seconds, can go up to 30\n",
    "    use_sampling=True, \n",
    "    top_k=250,\n",
    "    cfg_coef=3., # Classifier Free Guidance coefficient \n",
    "    cfg_coef_beta=5., # double CFG is necessary for text-and-style conditioning\n",
    "                   # Beta in the double CFG formula. between 1 and 9. When set to 1 \n",
    "                   # it is equivalent to normal CFG. \n",
    ")\n",
    "\n",
    "model.set_style_conditioner_params(\n",
    "    eval_q=1, # integer between 1 and 6\n",
    "              # eval_q is the level of quantization that passes\n",
    "              # through the conditioner. When low, the models adheres less to the \n",
    "              # audio conditioning\n",
    "    excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt\n",
    "    )\n",
    "\n",
    "melody_waveform, sr = torchaudio.load(\"../assets/electronic.mp3\")\n",
    "melody_waveform = melody_waveform.unsqueeze(0).repeat(3, 1, 1)\n",
    "\n",
    "descriptions = [\"8-bit old video game music\", \"Chill lofi remix\", \"80s New wave with synthesizer\"]\n",
    "\n",
    "output = model.generate_with_chroma(\n",
    "    descriptions=descriptions,\n",
    "    melody_wavs=melody_waveform,\n",
    "    melody_sample_rate=sr,\n",
    "    progress=True, return_tokens=True\n",
    ")\n",
    "display_audio(output[0], sample_rate=32000)\n",
    "if USE_DIFFUSION_DECODER:\n",
    "    out_diffusion = mbd.tokens_to_wav(output[1])\n",
    "    display_audio(out_diffusion, sample_rate=32000)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "vscode": {
   "interpreter": {
    "hash": "b02c911f9b3627d505ea4a19966a915ef21f28afb50dbf6b2115072d27c69103"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}