π§ DiCoW v3.3 β Target-Speaker ASR
This repository hosts DiCoW v3.3, a Target-Speaker ASR (TS-ASR) model developed by BUT Speech@FIT. It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.
This model version incorporates the refinements and training strategies described in the paper SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper.
π§ What's New in v3.3?
This version represents a significant stabilization and enhancement over the original DiCoW (v1):
- β‘ Improved Conditioning: Introduces FDDT (Frame-Level Diarization Dependent Transformation) layers before positional embeddings for better signal modulation.
- π Reduced Error: achieved ~50% relative reduction in tcpWER on Libri3Mix compared to v1.
- π οΈ Training Stability: Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
- π Robustness: Trained with STNO noise injection and SpecAugment to handle imperfect diarization.
β‘ Quick Usage
1. Run Interactive Demo (Gradio)
The easiest way to use this model is via the DiCoW inference repository. We provide a Gradio app that handles diarization and STNO mask generation automatically:
python app.py
2. Load in Python
If you want to download and load the model manually for your own scripts:
from transformers import AutoModelForSpeechSeq2Seq
# Load the model (requires remote code for custom FDDT layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"BUT-FIT/DiCoW_v3_3",
trust_remote_code=True
)
# Note: The model expects specific STNO conditioning inputs.
# See inference.py in the GitHub repo for the full pipeline.
𧬠Want to build your own DiCoW?
It's all yours with just two commands! This model is fully open-source and reproducible using our toolkit.
1. Data Preparation Clone the mt-asr-data-prep repository and run the setup script to generate the required manifests:
./prepare.sh --single-mic-only --root-dir /path/to/workdir
2. Training
Clone the training repository TS-ASR-Whisper and launch the experiment using the pre-configured dicow_v3 recipe:
sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=dicow_v3
π Performance Snapshot (tcpWER)
Metric: Time-Constrained Minimum Permutation WER (5s collar)
| Dataset | DiCoW v1 (Baseline) | DiCoW v3.3 (This Model) |
|---|---|---|
| Libri2Mix (Both) | 21.6% | 9.7% |
| LibriSpeechMix (2) | 17.9% | 3.1% |
| AMI (SDM) | 21.4% | 18.7% |
| NOTSOFAR-1 (Small-SC) | 29.8% | 26.6% |
Scores based on DiariZen Diarization. See paper for Real Diarization results. π View Full Leaderboard
βοΈ Model Details
- Base Architecture: Whisper large-v3-turbo
- Conditioning: Frame-Level Diarization-Dependent Transformations (FDDT)
- Input: 30s Audio + 4-channel STNO Mask
- Training Data: AMI, NOTSOFAR-1, LibriMix (2/3 spk), Synthetic LibriSpeech Mixtures.
β οΈ Limitations
- Diarization Dependent: Performance is heavily dependent on the quality of the input diarization.
- Ambiguity: In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in the SE-DiCoW model).
π Citations
If you use this model, please cite the following papers:
@article{polok2026sedicow,
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
author={Alexander Polok and Dominik Klement and Samuele Cornell and Matthew Wiesner and Jan ΔernockΓ½ and Sanjeev Khudanpur and LukΓ‘Ε‘ Burget},
journal={arXiv preprint arXiv:2601.19194},
year={2026}
}
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
author = {Alexander Polok et al.}
}
@INPROCEEDINGS{10887683,
title={Target Speaker ASR with Whisper},
author={Polok, Alexander et al.},
booktitle={ICASSP 2025},
year={2025},
doi={10.1109/ICASSP49660.2025.10887683}
}
π¬ Contact
- Issues: GitHub Issues
- Email: [email protected]
- Downloads last month
- 136
Model tree for BUT-FIT/DiCoW_v3_3
Base model
openai/whisper-large-v3