lIlBrother's picture
Update: ์‚ฌ์šฉ example ์ˆ˜์ •
ddbced5
|
raw
history blame
3.76 kB
metadata
language:
  - ko
license: apache-2.0
library_name: transformers
tags:
  - audio
  - automatic-speech-recognition
datasets:
  - KsponSpeech
metrics:
  - wer

ko-spelling-wav2vec2-conformer-del-1s

Table of Contents

Model Details

  • Model Description: ํ•ด๋‹น ๋ชจ๋ธ์€ wav2vec2-conformer base architecture์— scratch pre-training ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    Wav2Vec2ConformerForCTC๋ฅผ ์ด์šฉํ•˜์—ฌ KsponSpeech์— ๋Œ€ํ•œ Fine-Tuning ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

  • Dataset use AIHub KsponSpeech
    Datasets๋Š” ํ•ด๋‹น Data๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜์—ฌ ์ž„์˜๋กœ ๋งŒ๋“ค์–ด ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
    del-1s์˜ ์˜๋ฏธ๋Š” 1์ดˆ ์ดํ•˜์˜ ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
    ํ•ด๋‹น ๋ชจ๋ธ์€ ์ฒ ์ž์ „์‚ฌ ๊ธฐ์ค€์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. (์ˆซ์ž์™€ ์˜์–ด๋Š” ๊ฐ ํ‘œ๊ธฐ๋ฒ•์„ ๋”ฐ๋ฆ„)

  • Developed by: TADev (@lIlBrother, @ddobokki, @jp42maru)

  • Language(s): Korean

  • License: apache-2.0

  • Parent Model: See the wav2vec2-conformer for more information about the pre-trained base model. (ํ•ด๋‹น ๋ชจ๋ธ์€ wav2vec2-conformer base architecture์— scratch pre-training ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.)

Evaluation

Just using load_metric("wer") and load_metric("wer") in huggingface datasets library

How to Get Started With the Model

import librosa
from pyctcdecode import build_ctcdecoder
from transformers import (
    AutoConfig,
    AutoFeatureExtractor,
    AutoModelForCTC,
    AutoTokenizer,
    Wav2Vec2ProcessorWithLM,
)
from transformers.pipelines import AutomaticSpeechRecognitionPipeline

audio_path = ""

# ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ €, ์˜ˆ์ธก์„ ์œ„ํ•œ ๊ฐ ๋ชจ๋“ˆ๋“ค์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.
model = AutoModelForCTC.from_pretrained("42MARU/ko-spelling-wav2vec2-conformer-del-1s")
feature_extractor = AutoFeatureExtractor.from_pretrained("42MARU/ko-spelling-wav2vec2-conformer-del-1s")
tokenizer = AutoTokenizer.from_pretrained("42MARU/ko-spelling-wav2vec2-conformer-del-1s")
beamsearch_decoder = build_ctcdecoder(
    labels=list(tokenizer.encoder.keys()),
    kenlm_model_path=None,
)
processor = Wav2Vec2ProcessorWithLM(
    feature_extractor=feature_extractor, tokenizer=tokenizer, decoder=beamsearch_decoder
)

# ์‹ค์ œ ์˜ˆ์ธก์„ ์œ„ํ•œ ํŒŒ์ดํ”„๋ผ์ธ์— ์ •์˜๋œ ๋ชจ๋“ˆ๋“ค์„ ์‚ฝ์ž….
asr_pipeline = AutomaticSpeechRecognitionPipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    decoder=processor.decoder,
    device=-1,
)

# ์Œ์„ฑํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ค๊ณ  beamsearch ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํŠน์ •ํ•˜์—ฌ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
raw_data, _ = librosa.load(audio_path, sr=16000)
kwargs = {"decoder_kwargs": {"beam_width": 100}}
pred = asr_pipeline(inputs=raw_data, **kwargs)["text"]
# ๋ชจ๋ธ์ด ์ž์†Œ ๋ถ„๋ฆฌ ์œ ๋‹ˆ์ฝ”๋“œ ํ…์ŠคํŠธ๋กœ ๋‚˜์˜ค๋ฏ€๋กœ, ์ผ๋ฐ˜ String์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ค„ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
result = unicodedata.normalize("NFC", pred)
print(result)
# ์•ˆ๋…•ํ•˜์„ธ์š” 123 ํ…Œ์ŠคํŠธ์ž…๋‹ˆ๋‹ค.

Beam-100 Result (WER):

"clean" "other"
22.01 27.34