dkounadis
/

wav2small

Audio Classification

emotion-recognition

speech-emotion-recognition

Model card Files Files and versions Community

wav2small / README.md

dkounadis's picture

do not use clip() for distillation

9f61606 verified about 1 month ago

|

3.66 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- en
	pipeline_tag: audio-classification
	tags:
	- wavlm
	- wav2vec2
	- msp-podcast
	- emotion-recognition
	- speech
	- valence
	- arousal
	- dominance
	- speech-emotion-recognition
	- dkounadis
	---

	# Arousal - Dominance - Valence

	Dimensional Speech Emotion Recognition model of simultaneous use of [WavLM](https://huggingface.co/3loi/SER-Odyssey-Baseline-WavLM-Multi-Attributes) / [Wav2Vec2.0](https://hf.rst.im/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim).
	Achieves `0.6760566` valence CCC on [MSP Podcast Test 1](https://paperswithcode.com/sota/speech-emotion-recognition-on-msp-podcast). Used as teacher for [Wav2Small](https://arxiv.org/abs/2408.13920).



	[PapersWithCode](https://paperswithcode.com/dataset/msp-podcast)

	```
	Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition.
	Dionyssos Kounadis-Bastian, Oliver Schrüfer, Anna Derington, Hagen Wierstorf,
	Florian Eyben, Felix Burkhardt, Björn Schuller.
	2024, arXiV Preprint
	```

	<table style="width:500px">
	<tr><th colspan=6 align="center" >CCC MSP Podcast v1.7</th></tr>
	<tr><th colspan=3 align="center">Test 1</th><th colspan=3 align="center">Test 2</th></tr>
	<tr> <td>Val</td> <td>Dom</td> <td>Aro</td> <td>Val</td> <td>Dom</td> <td>Aro</td> </tr>
	<tr> <td> 0.6760566 </td> <td>0.6840044</td> <td>0.7620181</td> <td>0.4229267</td> <td>0.4684658</td> <td>0.4857733</td> </tr>
	</table>



	# How To
	```python
	import librosa
	import torch
	import types
	import torch.nn as nn
	from transformers import AutoModelForAudioClassification
	from transformers.models.wav2vec2.modeling_wav2vec2 import (Wav2Vec2Model,
	Wav2Vec2PreTrainedModel)


	signal = torch.from_numpy(
	librosa.load('test.wav', sr=16000)[0])[None, :]
	device = 'cpu'

	class ADV(nn.Module):

	def __init__(self, config):

	super().__init__()

	self.dense = nn.Linear(config.hidden_size, config.hidden_size)
	self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

	def forward(self, x):

	x = self.dense(x)
	x = torch.tanh(x)

	return self.out_proj(x)


	class Dawn(Wav2Vec2PreTrainedModel):
	r"""https://arxiv.org/abs/2203.07378"""

	def __init__(self, config):

	super().__init__(config)

	self.wav2vec2 = Wav2Vec2Model(config)
	self.classifier = ADV(config)

	def forward(self, x):
	x -= x.mean(1, keepdim=True)
	variance = (x * x).mean(1, keepdim=True) + 1e-7
	x = self.wav2vec2(x / variance.sqrt())
	return self.classifier(x.last_hidden_state.mean(1))


	def _forward(self, x):
	'''x: (batch, audio-samples-16KHz)'''
	x = (x + self.config.mean) / self.config.std # sgn
	x = self.ssl_model(x, attention_mask=None).last_hidden_state
	# pool
	h = self.pool_model.sap_linear(x).tanh()
	w = torch.matmul(h, self.pool_model.attention).softmax(1)
	mu = (x * w).sum(1)
	x = torch.cat(
	[
	mu,
	((x * x * w).sum(1) - mu * mu).clamp(min=1e-7).sqrt()
	], 1)
	return self.ser_model(x)


	# WavLM

	base = AutoModelForAudioClassification.from_pretrained(
	'3loi/SER-Odyssey-Baseline-WavLM-Multi-Attributes',
	trust_remote_code=True).to(device).eval()
	base.forward = types.MethodType(_forward, base)

	# Wav2Vec2

	dawn = Dawn.from_pretrained(
	'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
	).to(device).eval()


	def wav2small(x):
	return .5 * dawn(x) + .5 * base(x)

	pred = wav2small(signal.to(device))
	print(f'Arousal={pred[0, 0]} '
	f'Dominance={pred[0, 1]} ',
	f'Valence={pred[0, 2]}')
	```