File size: 4,024 Bytes
ae00f76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
language:
  - ko
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - speech
  - audio
---

# hubert-base-korean

## Model Details

Hubert(Hidden-Unit BERT)๋Š” Facebook์—์„œ ์ œ์•ˆํ•œ Speech Representation Learning ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
Hubert๋Š” ๊ธฐ์กด์˜ ์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ, ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ raw waveform์—์„œ ๋ฐ”๋กœ ํ•™์Šตํ•˜๋Š” self-supervised learning ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

https://huggingface.co/team-lucid/hubert-base-korean ๋ฅผ ๋ฒ ์ด์Šค๋ชจ๋ธ๋กœ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. 


## How to Get Started with the Model

### Pytorch

```py
import torch
import librosa
from transformers import AutoFeatureExtractor, AutoConfig
import whisper
from pytorch_lightning import Trainer
import pytorch_lightning as pl
from torch import nn
from transformers import HubertForSequenceClassification

class MyLitModel(pl.LightningModule):
    def __init__(self, audio_model_name, num_label2s, n_layers=1, projector=True, classifier=True, dropout=0.07, lr_decay=1):
        super(MyLitModel, self).__init__()
        self.config = AutoConfig.from_pretrained(audio_model_name)
        self.config.output_hidden_states = True
        self.audio_model = HubertForSequenceClassification.from_pretrained(audio_model_name, config=self.config)
        self.label2_classifier = nn.Linear(self.audio_model.config.hidden_size, num_label2s)
        self.intensity_regressor = nn.Linear(self.audio_model.config.hidden_size, 1)

    def forward(self, audio_values, audio_attn_mask=None):
        outputs = self.audio_model(input_values=audio_values, attention_mask=audio_attn_mask)
        label2_logits = self.label2_classifier(outputs.hidden_states[-1][:, 0, :])
        intensity_preds = self.intensity_regressor(outputs.hidden_states[-1][:, 0, :]).squeeze(-1)
        return label2_logits, intensity_preds

# ๋ชจ๋ธ ๊ด€๋ จ ์„ค์ •
audio_model_name = "team-lucid/hubert-base-korean"
NUM_LABELS = 7
SAMPLING_RATE = 16000

# Hubert ๋ชจ๋ธ ๋กœ๋“œ
pretrained_model_path = "" # ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ
hubert_model = MyLitModel.load_from_checkpoint(
    pretrained_model_path,
    audio_model_name=audio_model_name,
    num_label2s=NUM_LABELS,
)
hubert_model.eval()
hubert_model.to("cuda" if torch.cuda.is_available() else "cpu")

# Feature extractor ๋กœ๋“œ
feature_extractor = AutoFeatureExtractor.from_pretrained(audio_model_name)

# ์Œ์„ฑ ํŒŒ์ผ ์ฒ˜๋ฆฌ
audio_path = ""  # ์ฒ˜๋ฆฌํ•  ์Œ์„ฑ ํŒŒ์ผ ๊ฒฝ๋กœ
audio_np, _ = librosa.load(audio_path, sr=SAMPLING_RATE, mono=True)
inputs = feature_extractor(raw_speech=audio_np, return_tensors="pt", sampling_rate=SAMPLING_RATE)
audio_values = inputs["input_values"].to(hubert_model.device)
audio_attn_mask = inputs.get("attention_mask", None)
if audio_attn_mask is not None:
    audio_attn_mask = audio_attn_mask.to(hubert_model.device)

# ๊ฐ์ • ๋ถ„์„
with torch.no_grad():
    if audio_attn_mask is None:
        label2_logits, intensity_preds = hubert_model(audio_values)
    else:
        label2_logits, intensity_preds = hubert_model(audio_values, audio_attn_mask)

emotion_label = torch.argmax(label2_logits, dim=-1).item()
emotion_intensity = intensity_preds.item()

print(f"Emotion Label: {emotion_label}, Emotion Intensity: {emotion_intensity}")




```

## Training Details

### Training Data

ํ•ด๋‹น ๋ชจ๋ธ์€ AI hub์˜ ๊ฐ์ • ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋Œ€ํ™”์Œ์„ฑ๋ฐ์ดํ„ฐ์…‹ (https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=263) ์ค‘
๊ฐ ๋ผ๋ฒจ ๋ณ„ ๋ฐ์ดํ„ฐ์…‹ 1000๊ฐœ์”ฉ, ์ด 7000๊ฐœ๋ฅผ ํ™œ์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. 


### Training Procedure

๊ฐ 7๊ฐ€์ง€ ๊ฐ์ • (ํ–‰๋ณต, ๋ถ„๋…ธ, ํ˜์˜ค, ๊ณตํฌ, ์ค‘๋ฆฝ, ์Šฌํ””, ๋†€๋žŒ)๊ณผ ๊ฐ ๊ฐ์ •์˜ ๊ฐ•๋„(0-2)๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” ๋ฉ€ํ‹ฐํ…Œ์Šคํฌ ๋ชจ๋ธ๋กœ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

#### Training Hyperparameters

| Hyperparameter      | Base    | 
|:--------------------|---------|
| Learning Rates      | 1e-5    |
| Learning Rate Decay | 0.8     |
| Batch Size          | 8       |    
| Weight Decay        | 0.01    |    
| Epoch               | 30      |