Update README.md
Browse files
README.md
CHANGED
@@ -16,8 +16,9 @@ tags:
|
|
16 |
These are model weights originally provided by the authors of the paper [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://arxiv.org/pdf/2201.02184.pdf).
|
17 |
|
18 |
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip
|
19 |
-
movements and the produced sound.
|
20 |
-
|
|
|
21 |
learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
|
22 |
|
23 |
## Datasets
|
|
|
16 |
These are model weights originally provided by the authors of the paper [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://arxiv.org/pdf/2201.02184.pdf).
|
17 |
|
18 |
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip
|
19 |
+
movements and the produced sound.
|
20 |
+
|
21 |
+
Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT
|
22 |
learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
|
23 |
|
24 |
## Datasets
|