--- license: apache-2.0 language: - en pipeline_tag: text-to-audio tags: - music_generation --- [//]: # (# InspireMusic)
[//]: # ()
[//]: # ( )
[//]: # ( )
[//]: # (
[//]: # ()
[//]: # (
)
[//]: # ()
[//]: # (
)
[//]: # ()
[//]: # (
)
[//]: # ()
[//]: # (
)
[//]: # ()
[//]: # (
)
[//]: # ( )
[//]: # (
)
[//]: # ()
[//]: # (
)
[//]: # ( )
[//]: # (
)
[//]: # ( )
[//]: # (
)
![]() |
Figure 1. An overview of the InspireMusic framework. We introduce InspireMusic, a unified framework for music, song and audio generation, capable of producing 48kHz long-form audio. InspireMusic employs an autoregressive transformer to generate music tokens in response to textual input. Complementing this, an ODE-based diffusion model, specifically flow matching, is utilized to reconstruct latent features from these generated music tokens. Then a vocoder generates audio waveforms from the reconstructed features. for input text, an ODE-based diffusion model, flow matching, to reconstruct latent features from the generated music tokens, and a vocoder to generate audio waveforms. InspireMusic is capable of text-to-music, music continuation, music reconstruction, and music super resolution tasks. It employs WavTokenizer as an audio tokenizer to convert 24kHz audio into 75Hz discrete tokens, while HifiCodec serves as a music tokenizer, transforming 48kHz audio into 150Hz latent features compatible with the flow matching model. |