Update README.md
Browse files
README.md
CHANGED
@@ -15,6 +15,12 @@ tags:
|
|
15 |
|
16 |
These are model weights originally provided by the authors of the paper [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://arxiv.org/pdf/2201.02184.pdf).
|
17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip
|
19 |
movements and the produced sound.
|
20 |
|
|
|
15 |
|
16 |
These are model weights originally provided by the authors of the paper [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://arxiv.org/pdf/2201.02184.pdf).
|
17 |
|
18 |
+
<figure>
|
19 |
+
<img src="https://huggingface.co/vumichien/AV-HuBERT/blob/main/HuBert.png" alt="Audio-visual HuBERT">
|
20 |
+
<figcaption>Audio-visual HuBERT
|
21 |
+
</figcaption>
|
22 |
+
</figure>
|
23 |
+
|
24 |
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip
|
25 |
movements and the produced sound.
|
26 |
|