vumichien
/

AV-HuBERT

Automatic Speech Recognition

Audio Visual to Text

Automatic Speech Recognition

Model card Files Files and versions Community

vumichien commited on Jan 16, 2023

Commit

833e23b

·

1 Parent(s): 9139449

Update README.md

Files changed (1) hide show

README.md +3 -2

README.md CHANGED Viewed

@@ -16,8 +16,9 @@ tags:
 These are model weights originally provided by the authors of the paper [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://arxiv.org/pdf/2201.02184.pdf).
 Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip
-movements and the produced sound. Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for
-audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT
 learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
 ## Datasets

 These are model weights originally provided by the authors of the paper [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://arxiv.org/pdf/2201.02184.pdf).
 Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip
+movements and the produced sound.
+Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT
 learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
 ## Datasets