USCD REACH

university

http://repmus.ircam.fr/reach

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

haoheliu authored a paper 5 months ago

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

haoheliu authored a paper 8 months ago

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

haoheliu authored a paper 8 months ago

FlashSpeech: Efficient Zero-Shot Speech Synthesis

View all activity

ucsd-reach's activity

haoheliu

authored a paper 5 months ago

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Paper • 2407.14329 • Published Jul 19 • 4

haoheliu

authored 2 papers 8 months ago

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Paper • 2405.00233 • Published Apr 30 • 13

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Paper • 2404.14700 • Published Apr 23 • 29

MasalaDosa1337

authored a paper 9 months ago

Alt-Text with Context: Improving Accessibility for Images on Twitter

Paper • 2305.14779 • Published May 24, 2023

sanchit-gandhi

posted an update 10 months ago

Post

Why does returning timestamps help Whisper reduce hallucinations? 🧐

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:

The cat sat on the on the on the mat.

Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:

<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:

<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?