File size: 1,541 Bytes
346b014 9139449 346b014 9139449 346b014 3d63ba1 7932189 3d63ba1 346b014 833e23b 346b014 1db1a91 1ae7cce 69101a4 1ae7cce 346b014 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
---
license: apache-2.0
language:
- en
pipeline_tag: automatic-speech-recognition
datasets:
- LRS3
tags:
- Audio Visual to Text
- Automatic Speech Recognition
---
## Model Description
These are model weights originally provided by the authors of the paper [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://arxiv.org/pdf/2201.02184.pdf).
<figure>
<img src="https://huggingface.co/vumichien/AV-HuBERT/resolve/main/HuBert.png" alt="Audio-visual HuBERT">
<figcaption>Audio-visual HuBERT
</figcaption>
</figure>
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip
movements and the produced sound.
Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT
learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
The official code of this paper in [here](https://github.com/facebookresearch/av_hubert)
## Example
<figure>
<img src="https://huggingface.co/vumichien/AV-HuBERT/resolve/main/lipreading.gif" alt="Audio-Visual Speech Recognition">
<figcaption> Speech Recognition from visual lip movement
</figcaption>
</figure>
## Datasets
The authors trained the model on lip-reading benchmark LRS3 datasets (433 hours). |