Wav2Vec2-XLS-R-300m finetuned on the data of Korean pronunciations of English speakers.

This repository contains a finetuned Wav2Vec2-xls-r-300m model for phoneme recognition task. The model was trained and evaluated on “the spoken Korean voice of native English speakers” provided by AIHub https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=71469

Creator & Uploader: Sehyun Oh ([email protected])

Data Information

Dataset Name: the spoken Korean voice of native English speakers.
Data Type: Speech recordings of English speakers speaking Korean.
Annotation: Each utterance is annotated with korean words and phoneme sequences.
Train Set: 124,626 samples, 121.75 hours
Valid Set: 15,066 samples, 14.94 hours
Test Set: 15,091 samples, 14.78 hours

The model was fine-tuned for phoneme recognition using the Hugging Face transformers library. Below are the training steps:

The following metrics were achieved during training:

The model was evaluated on the test dataset with the following performance:

Below is an example of how the dataset is structured for this phoneme recognition task:

Prompt means the specific sentence or word that the pariticipant has to read.
True Phonemes indicate the phonemes transcribed based on how the participant's pronunciation sounds
Predicted Phonemes represent the finetuned model's prediction of phonemes.
Sample :
Prompt: 여기 비빔밥 한 그릇 갖다 주세요.
True Phonemes of Korean pronunciation: ㅕ ㄱ ㅣ ㅂ ㅣ ㅂ ㅣ ㅁ ㅂ ㅏ ㅂ ㅎ ㅏ ㄴ ㄱ ㅡ ㄹ ㅡ ㄷ ㄱ ㅏ ㄷ ㅏ ㄷ ㅗ ㅅ ㅔ ㅛ
Predicted Phonemes: ㅕ ㄱ ㅣ ㅂ ㅣ ㅂ ㅣ ㅁ ㅃ ㅏ ㅂ ㅎ ㅏ ㄴ ㄱ ㅡ ㄹ ㅡ ㄷ ㄱ ㅏ ㄷ ㄸ ㅏ ㅈ ㅜ ㅅ ㅔ ㅛ

TensorBoard logs are available for detailed training analysis:

Use the following command to visualize logs:

tensorboard --logdir=./logs/