trishv's picture
Upload 2383 files
96e9536

A newer version of the Gradio SDK is available: 5.9.1

Upgrade

Speech Recognition Pre-Training

Wav2Vec2 Speech Pre-Training

The script run_speech_wav2vec2_pretraining_no_trainer.py can be used to pre-train a Wav2Vec2 model from scratch.

In the script run_speech_wav2vec2_pretraining_no_trainer, a Wav2Vec2 model is pre-trained on audio data alone using Wav2Vec2's contrastive loss objective.

The following examples show how to fine-tune a "base"-sized Wav2Vec2 model as well as a "large"-sized Wav2Vec2 model using accelerate.


NOTE 1

Wav2Vec2's pre-training is known to be quite unstable. It is advised to do a couple of test runs with a smaller dataset, i.e. --dataset_config_names clean clean, --dataset_split_names validation test to find good hyper-parameters for learning_rate, batch_size, num_warmup_steps, and the optimizer. A good metric to observe during training is the gradient norm which should ideally be between 0.5 and 2.



NOTE 2

When training a model on large datasets it is recommended to run the data preprocessing in a first run in a non-distributed mode via --preprocessing_only so that when running the model in distributed mode in a second step the preprocessed data can easily be loaded on each distributed device.


Demo

In this demo run we pre-train a "base-sized" Wav2Vec2 model simply only on the validation and test data of librispeech_asr.

The demo is run on two Titan RTX (24 GB RAM each). In case you have less RAM available per device, consider reducing --batch_size and/or the --max_duration_in_seconds.

accelerate launch run_wav2vec2_pretraining_no_trainer.py \
    --dataset_name="librispeech_asr" \
    --dataset_config_names clean clean \
    --dataset_split_names validation test \
    --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
    --output_dir="./wav2vec2-pretrained-demo" \
    --max_train_steps="20000" \
    --num_warmup_steps="32000" \
    --gradient_accumulation_steps="8" \
    --learning_rate="0.005" \
    --weight_decay="0.01" \
    --max_duration_in_seconds="20.0" \
    --min_duration_in_seconds="2.0" \
    --logging_steps="1" \
    --saving_steps="10000" \
    --per_device_train_batch_size="8" \
    --per_device_eval_batch_size="8" \
    --adam_beta1="0.9" \
    --adam_beta2="0.98" \
    --adam_epsilon="1e-06" \
    --gradient_checkpointing \
    --mask_time_prob="0.65" \
    --mask_time_length="10"

The results of this run can be seen here.

Base

To pre-train "base-sized" Wav2Vec2 model, e.g. facebook/wav2vec2-base on librispeech_asr, the following command can be run:

accelerate launch run_wav2vec2_pretraining_no_trainer.py \
    --dataset_name=librispeech_asr \
    --dataset_config_names clean clean other \
    --dataset_split_names train.100 train.360 train.500 \
    --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
    --output_dir="./wav2vec2-pretrained-demo" \
    --max_train_steps="200000" \
    --num_warmup_steps="32000" \
    --gradient_accumulation_steps="4" \
    --learning_rate="0.001" \
    --weight_decay="0.01" \
    --max_duration_in_seconds="20.0" \
    --min_duration_in_seconds="2.0" \
    --logging_steps="1" \
    --saving_steps="10000" \
    --per_device_train_batch_size="8" \
    --per_device_eval_batch_size="8" \
    --adam_beta1="0.9" \
    --adam_beta2="0.98" \
    --adam_epsilon="1e-06" \
    --gradient_checkpointing \
    --mask_time_prob="0.65" \
    --mask_time_length="10"

The experiment was run on 8 GPU V100 (16 GB RAM each) for 4 days. In case you have more than 8 GPUs available for a higher effective batch_size, it is recommended to increase the learning_rate to 0.005 for faster convergence.

The results of this run can be seen here and the checkpoint pretrained for 85,000 steps can be accessed here

Large

To pre-train "large-sized" Wav2Vec2 model, e.g. facebook/wav2vec2-large-lv60, on librispeech_asr, the following command can be run:

accelerate launch run_wav2vec2_pretraining_no_trainer.py \ 
    --dataset_name=librispeech_asr \
    --dataset_config_names clean clean other \
    --dataset_split_names train.100 train.360 train.500 \
    --output_dir=./test \
    --max_train_steps=200000 \
    --num_warmup_steps=32000 \
    --gradient_accumulation_steps=8 \
    --learning_rate=0.001 \
    --weight_decay=0.01 \
    --max_duration_in_seconds=20.0 \
    --min_duration_in_seconds=2.0 \
    --model_name_or_path=./ 
    --logging_steps=1 \
    --saving_steps=10000 \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=4 \
    --adam_beta1=0.9 \
    --adam_beta2=0.98 \
    --adam_epsilon=1e-06 \
    --gradient_checkpointing \
    --mask_time_prob=0.65 \
    --mask_time_length=10

The experiment was run on 8 GPU V100 (16 GB RAM each) for 7 days. In case you have more than 8 GPUs available for a higher effective batch_size, it is recommended to increase the learning_rate to 0.005 for faster convergence.

The results of this run can be seen here and the checkpoint pretrained for 120,000 steps can be accessed here