File size: 2,490 Bytes
7dc5deb 8c089c6 19cf82a 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 7dc5deb 8c089c6 19cf82a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
---
base_model:
- ginic/hyperparam_tuning_1_wav2vec2-large-xlsr-buckeye-ipa
language:
- en
license: mpl-2.0
metrics:
- cer
pipeline_tag: automatic-speech-recognition
---
# XLSR-TIMIT-B0: Fine-tuned on TIMIT for Phonemic Transcription
This model leverages the pretrained checkpoint [ginic/hyperparam_tuning_1_wav2vec2-large-xlsr-buckeye-ipa](https://huggingface.co/ginic/data_seed_4_wav2vec2-large-xlsr-buckeye-ipa) and is fine-tuned on the [TIMIT Darpa English Corpus](https://github.com/philipperemy/timit) to transcribe audio into phonemic representations for the English language.
**Performance**
- Training Loss: 4.73
- Validation Loss: 1.048
- Test Results (TIMIT test set):
- Average Weighted Distance: 18.06
- Standard Deviation (Weighted Distance): 12.9
- Average Character Error Rate (CER): 0.14
- Standard Deviation (CER): 0.07
**Model Information**
- Number of Epochs: 40
- Learning Rate: 5e-6
- Optimizer: Adam
- Datasets Used: TIMIT, Darpa English Corpus
**Example Outputs**
1. **Prediction**: `lizteɪkðɪsdɹɾiteɪbklɔθiðiklinizfɹmi`
**Ground Truth**: `lizteɪkðɪsdɹɾiteɪbəklɔtiðiklinizfɹmi`
**Weighted Feature Edit Distance**: 7.875
**CER**: 0.0556
2. **Prediction**: `ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiɾimpɛɾikoʊts`
**Ground Truth**: `ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiŋinpɛɾikoʊts`
**Weighted Feature Edit Distance**: 2.375
**CER**: 0.0588
## Limitations
This phonemic transcription model is fine-tuned on an English speech corpus that does not encompass all dialects and languages. We acknowledge that it may significantly underperform for any unseen languages. We aim to release models and datasets that better serve all populations and languages in the future.
---
# Usage
To transcribe audio files, this model can be used as follows:
```python
from transformers import AutoModelForCTC, AutoProcessor
import torch
# Load model and processor
model = AutoModelForCTC.from_pretrained("KoelLabs/xlsr-timit-b0")
processor = AutoProcessor.from_pretrained("KoelLabs/xlsr-timit-b0")
# Prepare input
audio_input = "path_to_your_audio_file.wav" # Replace with your file
input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values
# Retrieve logits
with torch.no_grad():
logits = model(input_values).logits
# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription) |