Spaces:
Runtime error
Runtime error
File size: 5,695 Bytes
2b7bf83 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
# Kaldi-style all-in-one recipes
This repository provides [Kaldi](https://github.com/kaldi-asr/kaldi)-style recipes, as the same as [ESPnet](https://github.com/espnet/espnet).
Currently, the following recipes are supported.
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): English female speaker
- [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut): Japanese female speaker
- [JSSS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss_corpus): Japanese female speaker
- [CSMSC](https://www.data-baker.com/open_source.html): Mandarin female speaker
- [CMU Arctic](http://www.festvox.org/cmu_arctic/): English speakers
- [JNAS](http://research.nii.ac.jp/src/en/JNAS.html): Japanese multi-speaker
- [VCTK](https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html): English multi-speaker
- [LibriTTS](https://arxiv.org/abs/1904.02882): English multi-speaker
- [YesNo](https://arxiv.org/abs/1904.02882): English speaker (For debugging)
## How to run the recipe
```bash
# Let us move on the recipe directory
$ cd egs/ljspeech/voc1
# Run the recipe from scratch
$ ./run.sh
# You can change config via command line
$ ./run.sh --conf <your_customized_yaml_config>
# You can select the stage to start and stop
$ ./run.sh --stage 2 --stop_stage 2
# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2
# If you want to resume training from 10000 steps checkpoint
$ ./run.sh --stage 2 --resume <path>/<to>/checkpoint-10000steps.pkl
```
You can check the command line options in `run.sh`.
The integration with job schedulers such as [slurm](https://slurm.schedmd.com/documentation.html) can be done via `cmd.sh` and `conf/slurm.conf`.
If you want to use it, please check [this page](https://kaldi-asr.org/doc/queue.html).
All of the hyperparameters are written in a single yaml format configuration file.
Please check [this example](https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/ljspeech/voc1/conf/parallel_wavegan.v1.yaml) in ljspeech recipe.
You can monitor the training progress via tensorboard.
```bash
$ tensorboard --logdir exp
```
![](https://user-images.githubusercontent.com/22779813/68100080-58bbc500-ff09-11e9-9945-c835186fd7c2.png)
If you want to accelerate the training, you can try distributed multi-gpu training based on apex.
You need to install apex for distributed training. Please make sure you already installed it.
Then you can run distributed multi-gpu training via following command:
```bash
# in the case of the number of gpus = 8
$ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" ./run.sh --stage 2 --n_gpus 8
```
In the case of distributed training, the batch size will be automatically multiplied by the number of gpus.
Please be careful.
## How to make the recipe for your own dateset
Here, I will show how to make the recipe for your own dataset.
1. Setup your dataset to be the following structure.
```bash
# For single-speaker case
$ tree /path/to/databse
/path/to/database
βββ utt_1.wav
βββ utt_2.wav
β ...
βββ utt_N.wav
# The directory can be nested, but each filename must be unique
# For multi-speaker case
$ tree /path/to/databse
/path/to/database
βββ spk_1
β βββ utt1.wav
βββ spk_2
β βββ utt1.wav
β ...
βββ spk_N
βββ utt1.wav
...
# The directory under each speaker can be nested, but each filename in each speaker directory must be unique
```
2. Copy the template directory.
```bash
cd egs
# For single speaker case
cp -r template_single_spk <your_dataset_name>
# For multi speaker case
cp -r template_multi_spk <your_dataset_name>
# Move on your recipe
cd egs/<your_dataset_name>/voc1
```
3. Modify the options in `run.sh`.
What you need to change at least in `run.sh` is as follows:
- `db_root`: Root path of the database.
- `num_dev`: The number of utterances for development set.
- `num_eval`: The number of utterances for evaluation set.
4. Modify the hyperpameters in `conf/parallel_wavegan.v1.yaml`.
What you need to change at least in config is as follows:
- `sampling_rate`: If you can specify the lower sampling rate, the audio will be downsampled by sox.
5. (Optional) Change command backend in `cmd.sh`.
If you are not familiar with kaldi and run in your local env, you do not need to change.
See more info on https://kaldi-asr.org/doc/queue.html.
6. Run your recipe.
```bash
# Run all stages from the first stage
./run.sh
# If you want to specify CUDA device
CUDA_VISIBLE_DEVICES=0 ./run.sh
```
If you want to try the other advanced model, please check the config files in `egs/ljspeech/voc1/conf`.
## Run training using ESPnet2-TTS recipe within 5 minutes
Make sure already you finished the espnet2-tts recipe experiments (at least starting the training).
```bash
cd egs
# Please use single spk template for both single and multi spk case
cp -r template_single_spk <recipe_name>
# Move on your recipe
cd egs/<recipe_name>/voc1
# Make symlink of data directory (Better to use absolute path)
mkdir dump data
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw dump/
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/tr_no_dev data/train_nodev
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/dev data/dev
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/eval1 data/eval
# Edit config to match TTS model setting
vim conf/parallel_wavegan.v1.yaml
# Run from stage 1
./run.sh --stage 1 --conf conf/parallel_wavegan.v1.yaml
```
That's it!
|