prosody_asrtbsc_distilbert-uncased-best

Normalized ASR translations with prosody encoding cross attention multi-label DAC

Model description

Prosody encoder: 2 layer transformer encoder with initial dense projection Backbone: DistilBert uncased
Pooling: Self attention
Multi-label classification head: 2 dense layers with two dropouts 0.3 and Tanh activation inbetween

Training and evaluation data

Trained on normalized Whisper small transcripts.
Evaluated on ground truth (GT) and normalized Whisper small transcripts (E2E).

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.001
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 20
  • mixed_precision_training: Native AMP

Framework versions

  • Transformers 4.41.1
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1
Downloads last month
41
Safetensors
Model size
2.16M params
Tensor type
F32
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Dataset used to train Masioki/prosody_asrtbsc_distilbert-uncased-best

Evaluation results