|
# Sharded Feature Extraction and K-means Application |
|
|
|
This folder contains scripts for preparing HUBERT labels from tsv files, the |
|
steps are: |
|
1. feature extraction |
|
2. k-means clustering |
|
3. k-means application |
|
|
|
|
|
## Data preparation |
|
|
|
`*.tsv` files contains a list of audio, where each line is the root, and |
|
following lines are the subpath for each audio: |
|
``` |
|
<root-dir> |
|
<audio-path-1> |
|
<audio-path-2> |
|
... |
|
``` |
|
|
|
|
|
## Feature extraction |
|
|
|
### MFCC feature |
|
Suppose the tsv file is at `${tsv_dir}/${split}.tsv`. To extract 39-D |
|
mfcc+delta+ddelta features for the 1st iteration HUBERT training, run: |
|
```sh |
|
python dump_mfcc_feature.py ${tsv_dir} ${split} ${nshard} ${rank} ${feat_dir} |
|
``` |
|
This would shard the tsv file into `${nshard}` and extract features for the |
|
`${rank}`-th shard, where rank is an integer in `[0, nshard-1]`. Features would |
|
be saved at `${feat_dir}/${split}_${rank}_${nshard}.{npy,len}`. |
|
|
|
|
|
### HUBERT feature |
|
To extract features from the `${layer}`-th transformer layer of a trained |
|
HUBERT model saved at `${ckpt_path}`, run: |
|
```sh |
|
python dump_hubert_feature.py ${tsv_dir} ${split} ${ckpt_path} ${layer} ${nshard} ${rank} ${feat_dir} |
|
``` |
|
Features would also be saved at `${feat_dir}/${split}_${rank}_${nshard}.{npy,len}`. |
|
|
|
- if out-of-memory, decrease the chunk size with `--max_chunk` |
|
|
|
|
|
## K-means clustering |
|
To fit a k-means model with `${n_clusters}` clusters on 10% of the `${split}` data, run |
|
```sh |
|
python learn_kmeans.py ${feat_dir} ${split} ${nshard} ${km_path} ${n_cluster} --percent 0.1 |
|
``` |
|
This saves the k-means model to `${km_path}`. |
|
|
|
- set `--precent -1` to use all data |
|
- more kmeans options can be found with `-h` flag |
|
|
|
|
|
## K-means application |
|
To apply a trained k-means model `${km_path}` to obtain labels for `${split}`, run |
|
```sh |
|
python dump_km_label.py ${feat_dir} ${split} ${km_path} ${nshard} ${rank} ${lab_dir} |
|
``` |
|
This would extract labels for the `${rank}`-th shard out of `${nshard}` shards |
|
and dump them to `${lab_dir}/${split}_${rank}_${shard}.km` |
|
|
|
|
|
Finally, merge shards for `${split}` by running |
|
```sh |
|
for rank in $(seq 0 $((nshard - 1))); do |
|
cat $lab_dir/${split}_${rank}_${nshard}.km |
|
done > $lab_dir/${split}.km |
|
``` |
|
|