Time stamp on word level & Speaker identification

#29
by chan-K - opened

Are there any functions for time stamp and speaker identification?

Timestamp should be supported in the coming weeks (on the official transformers), however speaker identification was not really part of the official release. Pretty sure you could train the model to predict a speaker_token before the predicted transcription!

@ArthurZ are you referring to the chunk level timestamps (as in the original Whisper repo) or word level timestamps ?

@tdeboissiere That's what I want to ask. I mean word level timestamps.

any further information found on this subject?

@tdeboissiere @chan-K Any further info regarding word level timestamps?

Hey all! @ArthurZ is integrating timestamp prediction into Transformers and should have it finished fairly shortly: https://github.com/huggingface/transformers/pull/20620#issuecomment-1344452967

For word level timestamps, you can check-out the WhisperX repo: https://github.com/m-bain/whisperX
This workflow combines the Whisper sequence level timestamps with word-level time-stamps from a CTC model to give accurate timestamps and text predictions.

Here is a repository to estimate word-level timestamps and confidence with Whisper : https://github.com/Jeronymous/whisper-timestamped

Contrarily to whisperX, the approach here does not need an additional wav2vec model, so it should be more robust.

That's very cool @Jeronymous ! Gonna check out the repo 🙌

@Jeronymous : just wanted to say thank you for this repo. Super travail !

@sanchit-gandhi how can one highlight words while running whisperx locally

@sanchit-gandhi This should also alleviate some of the timestamp issues of whisper especially around pauses. Would be cool to also have this evaluated on the ASR leaderboard. Also we found that removing symbols that have no clear accoustic representation from the DTW alignment like punctuation improves results slightly even for the original models. We will open a PR in the future :)

accompanying Interspeech paper: https://arxiv.org/abs/2408.16589

some further explanations of how the final model was created: https://huggingface.co/nyrahealth/CrisperWhisper

model: https://github.com/nyrahealth/CrisperWhisper/tree/main

Sign up or log in to comment