sanchit-gandhi
commited on
Commit
·
2f57de7
1
Parent(s):
4966e19
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
inference: false
|
3 |
+
---
|
4 |
+
|
5 |
+
![encodec image](https://github.com/facebookresearch/encodec/raw/2d29d9353c2ff0ab1aeadc6a3d439854ee77da3e/architecture.png)
|
6 |
+
|
7 |
+
# Model Card for EnCodec
|
8 |
+
|
9 |
+
This model card provides details and information about EnCodec 32kHz, a state-of-the-art real-time audio codec developed by Meta AI.
|
10 |
+
This EnCodec checkpoint was trained specifically as part of the [MusicGen project](https://huggingface.co/docs/transformers/main/model_doc/musicgen),
|
11 |
+
and is intended to be used in conjuction with the MusicGen models.
|
12 |
+
|
13 |
+
## Model Details
|
14 |
+
|
15 |
+
### Model Description
|
16 |
+
|
17 |
+
EnCodec is a high-fidelity audio codec leveraging neural networks. It introduces a streaming encoder-decoder architecture with quantized latent space, trained in an end-to-end fashion.
|
18 |
+
The model simplifies and speeds up training using a single multiscale spectrogram adversary that efficiently reduces artifacts and produces high-quality samples.
|
19 |
+
It also includes a novel loss balancer mechanism that stabilizes training by decoupling the choice of hyperparameters from the typical scale of the loss.
|
20 |
+
Additionally, lightweight Transformer models are used to further compress the obtained representation while maintaining real-time performance. This variant of EnCodec is
|
21 |
+
trained on 20k of music data, consisting of an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music datasets.
|
22 |
+
|
23 |
+
- **Developed by:** Meta AI
|
24 |
+
- **Model type:** Audio Codec
|
25 |
+
|
26 |
+
### Model Sources
|
27 |
+
|
28 |
+
- **Repository:** [GitHub Repository](https://github.com/facebookresearch/audiocraft)
|
29 |
+
- **Paper:** [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284)
|
30 |
+
|
31 |
+
## Uses
|
32 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
33 |
+
|
34 |
+
### Direct Use
|
35 |
+
|
36 |
+
EnCodec can be used directly as an audio codec for real-time compression and decompression of audio signals.
|
37 |
+
It provides high-quality audio compression and efficient decoding. The model was trained on various bandwiths, which can be specified when encoding (compressing) and decoding (decompressing).
|
38 |
+
Two different setup exist for EnCodec:
|
39 |
+
|
40 |
+
- Non-streamable: the input audio is split into chunks of 1 seconds, with an overlap of 10 ms, which are then encoded.
|
41 |
+
- Streamable: weight normalizationis used on the convolution layers, and the input is not split into chunks but rather padded on the left.
|
42 |
+
|
43 |
+
### Downstream Use
|
44 |
+
|
45 |
+
This variant of EnCodec is designed to be used in conjunction with the official [MusicGen checkpoints](https://huggingface.co/models?search=facebook/musicgen-).
|
46 |
+
However, it can also be used standalone to encode audio files.
|
47 |
+
|
48 |
+
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
49 |
+
|
50 |
+
[More Information Needed]
|
51 |
+
|
52 |
+
## How to Get Started with the Model
|
53 |
+
|
54 |
+
Use the following code to get started with the EnCodec model using a dummy example from the LibriSpeech dataset (~9MB). First, install the required Python packages:
|
55 |
+
|
56 |
+
```
|
57 |
+
pip install --upgrade pip
|
58 |
+
pip install --upgrade transformers datasets[audio]
|
59 |
+
```
|
60 |
+
|
61 |
+
Then load an audio sample, and run a forward pass of the model:
|
62 |
+
|
63 |
+
```python
|
64 |
+
from datasets import load_dataset, Audio
|
65 |
+
from transformers import EncodecModel, AutoProcessor
|
66 |
+
|
67 |
+
|
68 |
+
# load a demonstration datasets
|
69 |
+
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
70 |
+
|
71 |
+
# load the model + processor (for pre-processing the audio)
|
72 |
+
model = EncodecModel.from_pretrained("facebook/encodec_48khz")
|
73 |
+
processor = AutoProcessor.from_pretrained("facebook/encodec_48khz")
|
74 |
+
|
75 |
+
# cast the audio data to the correct sampling rate for the model
|
76 |
+
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
|
77 |
+
audio_sample = librispeech_dummy[0]["audio"]["array"]
|
78 |
+
|
79 |
+
# pre-process the inputs
|
80 |
+
inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")
|
81 |
+
|
82 |
+
# explicitly encode then decode the audio inputs
|
83 |
+
encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
|
84 |
+
audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]
|
85 |
+
|
86 |
+
# or the equivalent with a forward pass
|
87 |
+
audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
|
88 |
+
```
|
89 |
+
|
90 |
+
## Evaluation
|
91 |
+
|
92 |
+
For evaluation results, refer to the [MusicGen evaluation scores](https://huggingface.co/facebook/musicgen-large#evaluation-results).
|
93 |
+
|
94 |
+
## Summary
|
95 |
+
|
96 |
+
EnCodec is a state-of-the-art real-time neural audio compression model that excels in producing high-fidelity audio samples at various sample rates and bandwidths.
|
97 |
+
The model's performance was evaluated across different settings, ranging from 24kHz monophonic at 1.5 kbps to 48kHz stereophonic, showcasing both subjective and
|
98 |
+
objective results. Notably, EnCodec incorporates a novel spectrogram-only adversarial loss, effectively reducing artifacts and enhancing sample quality.
|
99 |
+
Training stability and interpretability were further enhanced through the introduction of a gradient balancer for the loss weights.
|
100 |
+
Additionally, the study demonstrated that a compact Transformer model can be employed to achieve an additional bandwidth reduction of up to 40% without compromising
|
101 |
+
quality, particularly in applications where low latency is not critical (e.g., music streaming).
|
102 |
+
|
103 |
+
|
104 |
+
## Citation
|
105 |
+
|
106 |
+
**BibTeX:**
|
107 |
+
|
108 |
+
```
|
109 |
+
@misc{copet2023simple,
|
110 |
+
title={Simple and Controllable Music Generation},
|
111 |
+
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
|
112 |
+
year={2023},
|
113 |
+
eprint={2306.05284},
|
114 |
+
archivePrefix={arXiv},
|
115 |
+
primaryClass={cs.SD}
|
116 |
+
}
|
117 |
+
```
|