sanchit-gandhi commited on
Commit
2f57de7
·
1 Parent(s): 4966e19

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ ---
4
+
5
+ ![encodec image](https://github.com/facebookresearch/encodec/raw/2d29d9353c2ff0ab1aeadc6a3d439854ee77da3e/architecture.png)
6
+
7
+ # Model Card for EnCodec
8
+
9
+ This model card provides details and information about EnCodec 32kHz, a state-of-the-art real-time audio codec developed by Meta AI.
10
+ This EnCodec checkpoint was trained specifically as part of the [MusicGen project](https://huggingface.co/docs/transformers/main/model_doc/musicgen),
11
+ and is intended to be used in conjuction with the MusicGen models.
12
+
13
+ ## Model Details
14
+
15
+ ### Model Description
16
+
17
+ EnCodec is a high-fidelity audio codec leveraging neural networks. It introduces a streaming encoder-decoder architecture with quantized latent space, trained in an end-to-end fashion.
18
+ The model simplifies and speeds up training using a single multiscale spectrogram adversary that efficiently reduces artifacts and produces high-quality samples.
19
+ It also includes a novel loss balancer mechanism that stabilizes training by decoupling the choice of hyperparameters from the typical scale of the loss.
20
+ Additionally, lightweight Transformer models are used to further compress the obtained representation while maintaining real-time performance. This variant of EnCodec is
21
+ trained on 20k of music data, consisting of an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music datasets.
22
+
23
+ - **Developed by:** Meta AI
24
+ - **Model type:** Audio Codec
25
+
26
+ ### Model Sources
27
+
28
+ - **Repository:** [GitHub Repository](https://github.com/facebookresearch/audiocraft)
29
+ - **Paper:** [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284)
30
+
31
+ ## Uses
32
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
33
+
34
+ ### Direct Use
35
+
36
+ EnCodec can be used directly as an audio codec for real-time compression and decompression of audio signals.
37
+ It provides high-quality audio compression and efficient decoding. The model was trained on various bandwiths, which can be specified when encoding (compressing) and decoding (decompressing).
38
+ Two different setup exist for EnCodec:
39
+
40
+ - Non-streamable: the input audio is split into chunks of 1 seconds, with an overlap of 10 ms, which are then encoded.
41
+ - Streamable: weight normalizationis used on the convolution layers, and the input is not split into chunks but rather padded on the left.
42
+
43
+ ### Downstream Use
44
+
45
+ This variant of EnCodec is designed to be used in conjunction with the official [MusicGen checkpoints](https://huggingface.co/models?search=facebook/musicgen-).
46
+ However, it can also be used standalone to encode audio files.
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ## How to Get Started with the Model
53
+
54
+ Use the following code to get started with the EnCodec model using a dummy example from the LibriSpeech dataset (~9MB). First, install the required Python packages:
55
+
56
+ ```
57
+ pip install --upgrade pip
58
+ pip install --upgrade transformers datasets[audio]
59
+ ```
60
+
61
+ Then load an audio sample, and run a forward pass of the model:
62
+
63
+ ```python
64
+ from datasets import load_dataset, Audio
65
+ from transformers import EncodecModel, AutoProcessor
66
+
67
+
68
+ # load a demonstration datasets
69
+ librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
70
+
71
+ # load the model + processor (for pre-processing the audio)
72
+ model = EncodecModel.from_pretrained("facebook/encodec_48khz")
73
+ processor = AutoProcessor.from_pretrained("facebook/encodec_48khz")
74
+
75
+ # cast the audio data to the correct sampling rate for the model
76
+ librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
77
+ audio_sample = librispeech_dummy[0]["audio"]["array"]
78
+
79
+ # pre-process the inputs
80
+ inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")
81
+
82
+ # explicitly encode then decode the audio inputs
83
+ encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
84
+ audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]
85
+
86
+ # or the equivalent with a forward pass
87
+ audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
88
+ ```
89
+
90
+ ## Evaluation
91
+
92
+ For evaluation results, refer to the [MusicGen evaluation scores](https://huggingface.co/facebook/musicgen-large#evaluation-results).
93
+
94
+ ## Summary
95
+
96
+ EnCodec is a state-of-the-art real-time neural audio compression model that excels in producing high-fidelity audio samples at various sample rates and bandwidths.
97
+ The model's performance was evaluated across different settings, ranging from 24kHz monophonic at 1.5 kbps to 48kHz stereophonic, showcasing both subjective and
98
+ objective results. Notably, EnCodec incorporates a novel spectrogram-only adversarial loss, effectively reducing artifacts and enhancing sample quality.
99
+ Training stability and interpretability were further enhanced through the introduction of a gradient balancer for the loss weights.
100
+ Additionally, the study demonstrated that a compact Transformer model can be employed to achieve an additional bandwidth reduction of up to 40% without compromising
101
+ quality, particularly in applications where low latency is not critical (e.g., music streaming).
102
+
103
+
104
+ ## Citation
105
+
106
+ **BibTeX:**
107
+
108
+ ```
109
+ @misc{copet2023simple,
110
+ title={Simple and Controllable Music Generation},
111
+ author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
112
+ year={2023},
113
+ eprint={2306.05284},
114
+ archivePrefix={arXiv},
115
+ primaryClass={cs.SD}
116
+ }
117
+ ```