File size: 4,360 Bytes
dbccd16
 
 
 
 
 
 
 
 
 
db945d9
dbccd16
 
 
 
 
 
 
 
 
 
 
 
 
2e38a74
dbccd16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db945d9
dbccd16
 
 
 
 
 
 
 
db945d9
dbccd16
 
 
 
 
 
 
106c9f7
 
 
 
dbccd16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: mit
---

# ClassicVC

ClassicVC is an any-to-any voice conversion model that enables users to design their original speaker styles 
by selecting the coordinates from the continuous latent spaces. 
The model components are implemented using PyTorch and fully compatible with ONNX.

[MMCXLI](https://github.com/lyodos/mmcxli) provides the dedicated graphical user interface (GUI) for ClassicVC. 
It runs on wxPython and ONNX Runtime. 
Users can download the ONNX files and try out speech conversion 
without having to install PyTorch or train a model with their own voice data.


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** Lyodos (Lyodos the City of the Museum)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [GitHub](https://github.com/lyodos/classic-vc)

----

## Uses

Based on the MIT License, users can use the model codes and checkpoints for research purpose.
It is provided with no guarantees.

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

[MMCXLI](https://github.com/lyodos/mmcxli)

### Out-of-Scope Use

This model was prototyped as a hobbyist's research into any-to-any voice conversion, 
and we make no guarantees especially regarding its reliability or real-time operation. 

As for use in situations involving an unspecified number of people, such as web broadcasting, 
and mission-critical applications, including medical, transportation, infrastructure, and weapon systems, 
we cannot prohibit such use as the developer, since the MIT License is the only stated license, but we do not encourage it.

[More Information Needed]

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

We used three large-scale speech corpora (LibriSpeech, Samrómur Children 21.09, and VoxCeleb 1 and 2) 
to make the latent space of speakers that can be embedded using the style encoder of ClassicVC 
as inclusive as possible of all natural human voice.


### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

The [Notebook 01 of the ClassicVC repository](https://github.com/lyodos/classic-vc) provides the procedure for offline (non real-time) voice conversion.

[The MMCXLI repository](https://github.com/lyodos/mmcxli) provides GUI, which depends on local Python environment.

----

## Training Details

### Training Data

The model checkpoints provided here were trained on the following three datasets.

1. LibriSpeech ASR corpus
* V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964.
* https://ieeexplore.ieee.org/document/7178964
* https://openslr.org/12/

2. Samrómur Children 21.09

* Mena, Carlos; et al., 2021, Samromur Children 21.09, CLARIN-IS, http://hdl.handle.net/20.500.12537/185.
* https://repository.clarin.is/repository/xmlui/handle/20.500.12537/185
* https://openslr.org/117/

3. VoxCeleb 1 and 2

* A. Nagrani*, J. S. Chung*, A. Zisserman, "VoxCeleb: a large-scale speaker identification dataset", Interspeech 2017
* J. S. Chung*, A. Nagrani*, A. Zisserman, "VoxCeleb2: Deep Speaker Recognition", Interspeech 2018
* A. Nagrani*, J. S. Chung*, W. Xie, A. Zisserman, "VoxCeleb: Large-scale speaker verification in the wild", Computer Speech and Language, 2019
* https://huggingface.co/datasets/ProgramComputer/voxceleb/tree/main/vox2


### Training Procedure

The [Notebook 02 of the ClassicVC repository](https://github.com/lyodos/classic-vc) provides the procedure for data preparation.

The [Notebook 03 of the ClassicVC repository](https://github.com/lyodos/classic-vc) provides the training code.