Wav2Vec2-XLS-R-300m finetuned on the data of Korean pronunciations of English speakers.

This repository contains a finetuned Wav2Vec2-xls-r-300m model for phoneme recognition task. The model was trained and evaluated on β€œthe spoken Korean voice of native English speakers” provided by AIHub https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=71469

Creator & Uploader: Sehyun Oh ([email protected])

Data Information

  • Dataset Name: the spoken Korean voice of native English speakers.

  • Data Type: Speech recordings of English speakers speaking Korean.

  • Annotation: Each utterance is annotated with korean words and phoneme sequences.

  • Train Set: 124,626 samples, 121.75 hours

  • Valid Set: 15,066 samples, 14.94 hours

  • Test Set: 15,091 samples, 14.78 hours

Training Procedure

The model was fine-tuned for phoneme recognition using the Hugging Face transformers library. Below are the training steps:

  1. Data preprocessing to align audio with phoneme labels.
  2. Wav2Vec2-XLS-R-300M model fine-tuning with CTC loss.
  3. Evaluation on validation and test sets.

Training Hyperparameters

  • Epochs: 50
  • Learning Rate: 0.0001
  • Warmup Ratio: 0.1
  • Scheduler: Linear
  • Batch Size: 8
  • Loss Reduction: Mean
  • Feature Extractor Freeze: Enabled

Training Results

The following metrics were achieved during training:

  • Final Training Loss: 0.0473
  • Validation Loss: 0.1541
  • Word Error Rate (WER) on Validation Set: 0.1156

Test Results

The model was evaluated on the test dataset with the following performance:

  • Word Error Rate (WER): 0.0315
  • Character Error Rate (CER): 0.0230
  • Phoneme Error Rate (PER): 0.0315

Phoneme Data Example

Below is an example of how the dataset is structured for this phoneme recognition task:

  • Prompt means the specific sentence or word that the pariticipant has to read.
  • True Phonemes indicate the phonemes transcribed based on how the participant's pronunciation sounds
  • Predicted Phonemes represent the finetuned model's prediction of phonemes.
  • Sample :
  • Prompt: μ—¬κΈ° λΉ„λΉ”λ°₯ ν•œ 그릇 κ°–λ‹€ μ£Όμ„Έμš”.
  • True Phonemes of Korean pronunciation: γ…• γ„± γ…£ γ…‚ γ…£ γ…‚ γ…£ ㅁ γ…‚ ㅏ γ…‚ γ…Ž ㅏ γ„΄ γ„± γ…‘ γ„Ή γ…‘ γ„· γ„± ㅏ γ„· ㅏ γ„· γ…— γ…… γ…” γ…›
  • Predicted Phonemes: γ…• γ„± γ…£ γ…‚ γ…£ γ…‚ γ…£ ㅁ γ…ƒ ㅏ γ…‚ γ…Ž ㅏ γ„΄ γ„± γ…‘ γ„Ή γ…‘ γ„· γ„± ㅏ γ„· γ„Έ ㅏ γ…ˆ γ…œ γ…… γ…” γ…›

Training Logs

TensorBoard logs are available for detailed training analysis:

  • events.out.tfevents.1741331703.oem-WS-C621E-SAGE-Series.3197499.0
  • events.out.tfevents.1741696761.oem-WS-C621E-SAGE-Series.3197499.1

Use the following command to visualize logs:

tensorboard --logdir=./logs/
Downloads last month
0
Safetensors
Model size
316M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.