TVLT

Textless Vision-Language Transformer (TLVT) model, pre-trained-only. It was introduced in the paper TVLT: Textless Vision-Language Transformer by Tang et al. and first released in this repository.

Disclaimer: The team releasing TVLT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

TVLT is based on the MAE model, but extends it to audio-visual pre-training.

Intended uses & limitations

It's recommended to fine-tune the model on a task that involves audio and/or video.

How to use

For code examples, we refer to the documentation.

BibTeX entry and citation info

@misc{https://doi.org/10.48550/arxiv.2209.14156,
  doi = {10.48550/ARXIV.2209.14156},
  
  url = {https://arxiv.org/abs/2209.14156},
  
  author = {Tang, Zineng and Cho, Jaemin and Nie, Yixin and Bansal, Mohit},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {TVLT: Textless Vision-Language Transformer},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}