A newer version of the Gradio SDK is available:
5.13.1
Automatic Speech Recognition - Flax Examples
Sequence to Sequence
The script run_flax_speech_recognition_seq2seq.py
can be used to fine-tune any Flax Speech Sequence-to-Sequence Model
for automatic speech recognition on one of the official speech recognition datasets
or a custom dataset. This includes the Whisper model from OpenAI, or a warm-started Speech-Encoder-Decoder Model,
an example for which is included below.
Whisper Model
We can load all components of the Whisper model directly from the pretrained checkpoint, including the pretrained model weights, feature extractor and tokenizer. We simply have to specify the id of fine-tuning dataset and the necessary training hyperparameters.
The following example shows how to fine-tune the Whisper small checkpoint
on the Hindi subset of the Common Voice 13 dataset.
Note that before running this script you must accept the dataset's terms of use
and register your Hugging Face Hub token on your device by running huggingface-hub login
.
python run_flax_speech_recognition_seq2seq.py \
--model_name_or_path="openai/whisper-small" \
--dataset_name="mozilla-foundation/common_voice_13_0" \
--dataset_config_name="hi" \
--language="hindi" \
--train_split_name="train+validation" \
--eval_split_name="test" \
--output_dir="./whisper-small-hi-flax" \
--per_device_train_batch_size="16" \
--per_device_eval_batch_size="16" \
--num_train_epochs="10" \
--learning_rate="1e-4" \
--warmup_steps="500" \
--logging_steps="25" \
--generation_max_length="40" \
--preprocessing_num_workers="32" \
--dataloader_num_workers="32" \
--max_duration_in_seconds="30" \
--text_column_name="sentence" \
--overwrite_output_dir \
--do_train \
--do_eval \
--predict_with_generate \
--push_to_hub \
--use_auth_token
On a TPU v4-8, training should take approximately 25 minutes, with a final cross-entropy loss of 0.02 and word error rate of 34%. See the checkpoint sanchit-gandhi/whisper-small-hi-flax for an example training run.