File size: 6,929 Bytes
05805c5
 
 
 
 
 
 
 
 
5e61455
05805c5
5e61455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05805c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
license: apache-2.0
tags:
- automatic-speech-recognition
- smi
- sami
library_name: transformers
language: fi
base_model:
- GetmanY1/wav2vec2-base-sami-22k
model-index:
- name: wav2vec2-base-sami-22k-finetuned
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Sami-1h-test
      type: sami-1h-test
      args: fi
    metrics:
    - name: Test WER
      type: wer
      value: 43.07
    - name: Test CER
      type: cer
      value: 16.5
---

# Sámi Wav2vec2-Base ASR

[GetmanY1/wav2vec2-base-sami-22k](https://huggingface.co/GetmanY1/wav2vec2-base-sami-22k) fine-tuned on 20 hours of 16kHz sampled speech audio from the [Sámi Parliament sessions](https://sametinget.kommunetv.no/archive).

When using the model make sure that your speech input is also sampled at 16Khz.

## Model description

The Sámi Wav2Vec2 Base has the same architecture and uses the same training objective as the English one described in [Paper](https://arxiv.org/abs/2006.11477).

[GetmanY1/wav2vec2-base-sami-22k](https://huggingface.co/GetmanY1/wav2vec2-base-sami-22k) is a large-scale, 95-million parameter monolingual model pre-trained on 22.4k hours of unlabeled Sámi speech from [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/).
You can read more about the pre-trained model from [this paper](TODO).

The model was evaluated on 1 hour of out-of-domain read-aloud and spontaneous speech of varying audio quality.

## Intended uses

You can use this model for Sámi ASR (speech-to-text). 

### How to use

To transcribe audio files the model can be used as a standalone acoustic model as follows:

```
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-base-sami-22k-finetuned")
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-base-sami-22k-finetuned")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
```

### Prefix Beam Search

In our experiments (see [paper](TODO)), we observed a slight improvement in terms of Character Error Rate (CER) when using prefix beam search compared to greedy decoding, primarily due to a reduction in deletions. Below is our adapted version of [corticph/prefix-beam-search](https://github.com/corticph/prefix-beam-search) for use with wav2vec 2.0 in HuggingFace Transformers.
Note that an external language model (LM) **is not required**, as the function defaults to a uniform probability when none is provided.

```
import re
import numpy as np

def prefix_beam_search(ctc, lm=None, k=25, alpha=0.30, beta=5, prune=0.001):
	"""
	Performs prefix beam search on the output of a CTC network.

	Args:
		ctc (np.ndarray): The CTC output. Should be a 2D array (timesteps x alphabet_size)
		lm (func): Language model function. Should take as input a string and output a probability.
		k (int): The beam width. Will keep the 'k' most likely candidates at each timestep.
		alpha (float): The language model weight. Should usually be between 0 and 1.
		beta (float): The language model compensation term. The higher the 'alpha', the higher the 'beta'.
		prune (float): Only extend prefixes with chars with an emission probability higher than 'prune'.

	Returns:
		string: The decoded CTC output.
	"""

	lm = (lambda l: 1) if lm is None else lm # if no LM is provided, just set to function returning 1
	W = lambda l: re.findall(r'\w+[\s|>]', l)
	alphabet = list({k: v for k, v in sorted(processor.tokenizer.vocab.items(), key=lambda item: item[1])})
	alphabet = list(map(lambda x: x.replace(processor.tokenizer.special_tokens_map['eos_token'], '>') \
						.replace(processor.tokenizer.special_tokens_map['pad_token'], '%') \
						.replace('|', ' '), alphabet))


	F = ctc.shape[1]
	ctc = np.vstack((np.zeros(F), ctc)) # just add an imaginative zero'th step (will make indexing more intuitive)
	T = ctc.shape[0]

	# STEP 1: Initiliazation
	O = ''
	Pb, Pnb = defaultdict(Counter), defaultdict(Counter)
	Pb[0][O] = 1
	Pnb[0][O] = 0
	A_prev = [O]
	# END: STEP 1

	# STEP 2: Iterations and pruning
	for t in range(1, T):
		pruned_alphabet = [alphabet[i] for i in np.where(ctc[t] > prune)[0]]
		for l in A_prev:
			if len(l) > 0 and l.endswith('>'):
				Pb[t][l] = Pb[t - 1][l]
				Pnb[t][l] = Pnb[t - 1][l]
				continue  
			for c in pruned_alphabet:
				c_ix = alphabet.index(c)
				# END: STEP 2
				
				# STEP 3: “Extending” with a blank
				if c == '%':
					Pb[t][l] += ctc[t][0] * (Pb[t - 1][l] + Pnb[t - 1][l]) 
				# END: STEP 3
				
				# STEP 4: Extending with the end character
				else:
					l_plus = l + c
					if len(l) > 0 and l.endswith(c):
						Pnb[t][l_plus] += ctc[t][c_ix] * Pb[t - 1][l]
						Pnb[t][l] += ctc[t][c_ix] * Pnb[t - 1][l]
				# END: STEP 4

					# STEP 5: Extending with any other non-blank character and LM constraints
					elif len(l.replace(' ', '')) > 0 and c in (' ', '>'):
						lm_prob = lm(l_plus.strip(' >')) ** alpha
						Pnb[t][l_plus] += lm_prob * ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
					else:
						Pnb[t][l_plus] += ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
					# END: STEP 5

					# STEP 6: Make use of discarded prefixes
					if l_plus not in A_prev:
						Pb[t][l_plus] += ctc[t][0] * (Pb[t - 1][l_plus] + Pnb[t - 1][l_plus])
						Pnb[t][l_plus] += ctc[t][c_ix] * Pnb[t - 1][l_plus]
					# END: STEP 6

		# STEP 7: Select most probable prefixes
		A_next = Pb[t] + Pnb[t]
		sorter = lambda l: A_next[l] * (len(W(l)) + 1) ** beta
		A_prev = sorted(A_next, key=sorter, reverse=True)[:k]
		# END: STEP 7

	return A_prev[0].strip('>')

def map_to_pred_prefix_beam_search(batch):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_values = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to(device)).logits
    probs = torch.softmax(logits, dim=-1)
    transcription = [prefix_beam_search(probs[0].cpu().numpy(), lm=None)]
    batch["transcription"] = transcription
    return batch

result = ds.map(map_to_pred_prefix_beam_search, batched=True, batch_size=1, remove_columns=["speech"])
```

## Team Members

- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)
- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)

Feel free to contact us for more details 🤗