Wav2Vec2-XLS-R-300m finetuned on the data of Korean pronunciations of English speakers.
This repository contains a finetuned Wav2Vec2-xls-r-300m model for phoneme recognition task. The model was trained and evaluated on βthe spoken Korean voice of native English speakersβ provided by AIHub https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=71469
Creator & Uploader: Sehyun Oh ([email protected])
Data Information
Dataset Name: the spoken Korean voice of native English speakers.
Data Type: Speech recordings of English speakers speaking Korean.
Annotation: Each utterance is annotated with korean words and phoneme sequences.
Train Set: 124,626 samples, 121.75 hours
Valid Set: 15,066 samples, 14.94 hours
Test Set: 15,091 samples, 14.78 hours
Training Procedure
The model was fine-tuned for phoneme recognition using the Hugging Face transformers
library. Below are the training steps:
- Data preprocessing to align audio with phoneme labels.
- Wav2Vec2-XLS-R-300M model fine-tuning with CTC loss.
- Evaluation on validation and test sets.
Training Hyperparameters
- Epochs: 50
- Learning Rate: 0.0001
- Warmup Ratio: 0.1
- Scheduler: Linear
- Batch Size: 8
- Loss Reduction: Mean
- Feature Extractor Freeze: Enabled
Training Results
The following metrics were achieved during training:
- Final Training Loss: 0.0473
- Validation Loss: 0.1541
- Word Error Rate (WER) on Validation Set: 0.1156
Test Results
The model was evaluated on the test dataset with the following performance:
- Word Error Rate (WER): 0.0315
- Character Error Rate (CER): 0.0230
- Phoneme Error Rate (PER): 0.0315
Phoneme Data Example
Below is an example of how the dataset is structured for this phoneme recognition task:
- Prompt means the specific sentence or word that the pariticipant has to read.
- True Phonemes indicate the phonemes transcribed based on how the participant's pronunciation sounds
- Predicted Phonemes represent the finetuned model's prediction of phonemes.
- Sample :
- Prompt: μ¬κΈ° λΉλΉλ°₯ ν κ·Έλ¦ κ°λ€ μ£ΌμΈμ.
- True Phonemes of Korean pronunciation: γ γ± γ £ γ γ £ γ γ £ γ γ γ γ γ γ γ΄ γ± γ ‘ γΉ γ ‘ γ· γ± γ γ· γ γ· γ γ γ γ
- Predicted Phonemes: γ γ± γ £ γ γ £ γ γ £ γ γ γ γ γ γ γ΄ γ± γ ‘ γΉ γ ‘ γ· γ± γ γ· γΈ γ γ γ γ γ γ
Training Logs
TensorBoard logs are available for detailed training analysis:
events.out.tfevents.1741331703.oem-WS-C621E-SAGE-Series.3197499.0
events.out.tfevents.1741696761.oem-WS-C621E-SAGE-Series.3197499.1
Use the following command to visualize logs:
tensorboard --logdir=./logs/
- Downloads last month
- 0