Emotion Detection From Speech

This model is the fine-tuned version of DistilHuBERT which classifies emotions from audio inputs.

Approach

  1. Dataset: The Ravdess dataset, comprising 1,440 audio files with 8 emotion labels: calm, happy, sad, angry, fearful, surprise, neutral, and disgust.
  2. Model Fine-Tuning: The DistilHuBERT model was fine-tuned for 7 epochs with a learning rate of 5e-5, achieving an accuracy of 98% on the test dataset.

Data Preprocessing

  • Sampling Rate: Audio files were resampled to 16kHz to match the model's requirements.
  • Padding: Audio clips shorter than 30 seconds were zero-padded.
  • Train-Test Split: 80% of the samples were used for training, and 20% for testing.

Model Architecture

  • DistilHuBERT: A lightweight variant of HuBERT, fine-tuned for emotion classification.
  • Fine-Tuning Setup:
    • Optimizer: AdamW
    • Loss Function: Cross-Entropy
    • Learning Rate: 5e-5
    • Warm-up Ratio: 0.1
    • Epochs: 7

Results

  • Accuracy: 0.98 on the test dataset
  • Loss: 0.10 on the test dataset

Usage

from transformers import pipeline

pipe = pipeline(
    "audio-classification",
    model="BilalHasan/distilhubert-finetuned-ravdess",
)

emotion = pipe(path_to_your_audio)

Demo

You can access the live demo of the app on Hugging Face Spaces.

Downloads last month
0
Safetensors
Model size
23.7M params
Tensor type
F32
ยท
Inference Examples
Inference API (serverless) does not yet support flair models for this pipeline type.

Model tree for BilalHasan/distilhubert-finetuned-ravdess

Finetuned
(423)
this model

Space using BilalHasan/distilhubert-finetuned-ravdess 1