facebook
/

wav2vec2-conformer-rope-large-100h-ft

Automatic Speech Recognition

wav2vec2-conformer

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

wav2vec2-conformer-rope-large-100h-ft / README.md

patrickvonplaten's picture

patrickvonplaten

Update README.md

6750d7e over 2 years ago

|

history blame contribute delete

1.81 kB

	---
	language: en
	datasets:
	- librispeech_asr
	tags:
	- speech
	- audio
	- automatic-speech-recognition
	- hf-asr-leaderboard
	license: apache-2.0
	---

	# Wav2Vec2-Conformer-Large-100h with Rotary Position Embeddings

	Wav2Vec2 Conformer with rotary position embeddings, pretrained on 960h hours of Librispeech and fine-tuned on 100 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

	Paper: [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171)

	Authors: Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino

	The results of Wav2Vec2-Conformer can be found in Table 3 and Table 4 of the [official paper](https://arxiv.org/abs/2010.05171).

	The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

	# Usage

	To transcribe audio files the model can be used as a standalone acoustic model as follows:

	```python
	from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
	from datasets import load_dataset
	import torch

	# load model and processor
	processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rope-large-100h-ft")
	model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rope-large-100h-ft")

	# load dummy dataset and read soundfiles
	ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

	# tokenize
	input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values

	# retrieve logits
	logits = model(input_values).logits

	# take argmax and decode
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	```