Spaces:
Paused
Speech Recognition Pre-Training
Wav2Vec2 Speech Pre-Training
The script run_speech_wav2vec2_pretraining_no_trainer.py
can be used to pre-train a Wav2Vec2 model from scratch.
In the script run_speech_wav2vec2_pretraining_no_trainer
, a Wav2Vec2 model is pre-trained on audio data alone using Wav2Vec2's contrastive loss objective.
The following examples show how to fine-tune a "base"
-sized Wav2Vec2 model as well as a "large"
-sized Wav2Vec2 model using accelerate
.
NOTE 1
Wav2Vec2's pre-training is known to be quite unstable.
It is advised to do a couple of test runs with a smaller dataset,
i.e. --dataset_config_names clean clean
, --dataset_split_names validation test
to find good hyper-parameters for learning_rate
, batch_size
, num_warmup_steps
,
and the optimizer.
A good metric to observe during training is the gradient norm which should ideally be between 0.5 and 2.
NOTE 2
When training a model on large datasets it is recommended to run the data preprocessing
in a first run in a non-distributed mode via --preprocessing_only
so that
when running the model in distributed mode in a second step the preprocessed data
can easily be loaded on each distributed device.
Demo
In this demo run we pre-train a "base-sized"
Wav2Vec2 model simply only on the validation
and test data of librispeech_asr.
The demo is run on two Titan RTX (24 GB RAM each). In case you have less RAM available
per device, consider reducing --batch_size
and/or the --max_duration_in_seconds
.
accelerate launch run_wav2vec2_pretraining_no_trainer.py \
--dataset_name="librispeech_asr" \
--dataset_config_names clean clean \
--dataset_split_names validation test \
--model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
--output_dir="./wav2vec2-pretrained-demo" \
--max_train_steps="20000" \
--num_warmup_steps="32000" \
--gradient_accumulation_steps="8" \
--learning_rate="0.005" \
--weight_decay="0.01" \
--max_duration_in_seconds="20.0" \
--min_duration_in_seconds="2.0" \
--logging_steps="1" \
--saving_steps="10000" \
--per_device_train_batch_size="8" \
--per_device_eval_batch_size="8" \
--adam_beta1="0.9" \
--adam_beta2="0.98" \
--adam_epsilon="1e-06" \
--gradient_checkpointing \
--mask_time_prob="0.65" \
--mask_time_length="10"
The results of this run can be seen here.
Base
To pre-train "base-sized"
Wav2Vec2 model, e.g. facebook/wav2vec2-base
on librispeech_asr, the following command can be run:
accelerate launch run_wav2vec2_pretraining_no_trainer.py \
--dataset_name=librispeech_asr \
--dataset_config_names clean clean other \
--dataset_split_names train.100 train.360 train.500 \
--model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
--output_dir="./wav2vec2-pretrained-demo" \
--max_train_steps="200000" \
--num_warmup_steps="32000" \
--gradient_accumulation_steps="4" \
--learning_rate="0.001" \
--weight_decay="0.01" \
--max_duration_in_seconds="20.0" \
--min_duration_in_seconds="2.0" \
--logging_steps="1" \
--saving_steps="10000" \
--per_device_train_batch_size="8" \
--per_device_eval_batch_size="8" \
--adam_beta1="0.9" \
--adam_beta2="0.98" \
--adam_epsilon="1e-06" \
--gradient_checkpointing \
--mask_time_prob="0.65" \
--mask_time_length="10"
The experiment was run on 8 GPU V100 (16 GB RAM each) for 4 days.
In case you have more than 8 GPUs available for a higher effective batch_size
,
it is recommended to increase the learning_rate
to 0.005
for faster convergence.
The results of this run can be seen here and the checkpoint pretrained for 85,000 steps can be accessed here
Large
To pre-train "large-sized"
Wav2Vec2 model, e.g. facebook/wav2vec2-large-lv60,
on librispeech_asr, the following command can be run:
accelerate launch run_wav2vec2_pretraining_no_trainer.py \
--dataset_name=librispeech_asr \
--dataset_config_names clean clean other \
--dataset_split_names train.100 train.360 train.500 \
--output_dir=./test \
--max_train_steps=200000 \
--num_warmup_steps=32000 \
--gradient_accumulation_steps=8 \
--learning_rate=0.001 \
--weight_decay=0.01 \
--max_duration_in_seconds=20.0 \
--min_duration_in_seconds=2.0 \
--model_name_or_path=./
--logging_steps=1 \
--saving_steps=10000 \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=4 \
--adam_beta1=0.9 \
--adam_beta2=0.98 \
--adam_epsilon=1e-06 \
--gradient_checkpointing \
--mask_time_prob=0.65 \
--mask_time_length=10
The experiment was run on 8 GPU V100 (16 GB RAM each) for 7 days.
In case you have more than 8 GPUs available for a higher effective batch_size
,
it is recommended to increase the learning_rate
to 0.005
for faster convergence.
The results of this run can be seen here and the checkpoint pretrained for 120,000 steps can be accessed here