File size: 4,139 Bytes
d3913c7
 
 
 
 
 
 
 
8b0c89d
612d33a
2ab9d6a
 
 
8b0c89d
 
 
 
 
2ab9d6a
8b0c89d
 
2ab9d6a
 
 
 
 
 
 
 
 
 
 
 
 
21d6e52
2ab9d6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
language:
- ja
library_name: transformers
tags:
- jvs
- pyopenjtalk
- speech-to-text
pipeline_tag: text-to-speech
inference: false
---

# SpeechT5 (TTS task) for Japanese
SpeechT5 model fine-tuned for Japanese speech synthesis (text-to-speech) on [JVS]("https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus").
This model utilizes the JVS dataset which encompasses 100 speakers.
From this dataset, speaker embeddings were crafted, segregating them based on male and female voice types, and producing a unique speaker embedding vector.
This 16-dimensional speaker embedding vector is designed with an aim to provide a voice quality that is independent of any specific speaker.

Trained from [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts).
Modified tokenizer powered by [Open Jtalk](https://open-jtalk.sp.nitech.ac.jp/).


# Model description
See [original model card](https://huggingface.co/microsoft/speecht5_tts#model-description)
My modified codes licensed under MIT Licence.

# Usage
Install requirements
```bash
pip install transformers sentencepiece pyopnjtalk # or pyopenjtalk-prebuilt
```

Download a modified code.
```bash
curl -O https://huggingface.co/esnya/japanese_speecht5_tts/resolve/main/speecht5_openjtalk_tokenizer.py
```

(`SpeechToTextPipeline` is not released yet.)
```py
import numpy as np
from transformers import (
    SpeechT5ForTextToSpeech,
    SpeechT5HifiGan,
    SpeechT5FeatureExtractor,
    SpeechT5Processor,
)
from speecht5_openjtalk_tokenizer import SpeechT5OpenjtalkTokenizer
import soundfile
import torch

model_name = "esnya/japanese_speecht5_tts"
with torch.no_grad():

    model = SpeechT5ForTextToSpeech.from_pretrained(
        model_name, device_map="cuda", torch_dtype=torch.bfloat16
    )

    tokenizer = SpeechT5OpenjtalkTokenizer.from_pretrained(model_name)
    feature_extractor = SpeechT5FeatureExtractor.from_pretrained(model_name)
    processor = SpeechT5Processor(feature_extractor, tokenizer)
    vocoder = SpeechT5HifiGan.from_pretrained(
        "microsoft/speecht5_hifigan", device_map="cuda", torch_dtype=torch.bfloat16
    )

    input = "εΎθΌ©γ―ηŒ«γ§γ‚γ‚‹γ€‚εε‰γ―γΎγ η„‘γ„γ€‚γ©γ“γ§η”Ÿγ‚ŒγŸγ‹γ¨γ‚“γ¨θ¦‹ε½“γŒγ€γ‹γ¬γ€‚"
    input_ids = processor(text=input, return_tensors="pt").input_ids.to(model.device)

    speaker_embeddings = np.random.uniform(
        -1, 1, (1, 16)
    )  # (batch_size, speaker_embedding_dim = 16), first dimension means male (-1.0) / female (1.0)
    speaker_embeddings = torch.FloatTensor(speaker_embeddings).to(
        device=model.device, dtype=model.dtype
    )

    waveform = model.generate_speech(
        input_ids,
        speaker_embeddings,
        vocoder=vocoder,
    )

    waveform = waveform / waveform.abs().max()  # normalize
    waveform = waveform.reshape(-1).cpu().float().numpy()

    soundfile.write(
        "output.wav",
        waveform,
        vocoder.config.sampling_rate,
    )
```

# Background

The motivation behind developing this model stems from the noticeable lack of Japanese generation models in SpeechT5 TTS, or their scarcity at best. Additionally, the g2p functionality of Open Jtalk (pyopenjtalk) enabled us to achieve a vocabulary closely resembling English models. It's important to note that the special modifications and enhancements were primarily applied to the tokenizer. Unlike the default setup, our modified tokenizer separately extracts and retains characters other than phonation to ensure more accurate text-to-speech conversion.

# Limitations

One known issue with this model is that when multiple sentences are fed into it, the latter parts may result in extended silences. As a temporary solution, until this is rectified, it is recommended to split and generate each sentence individually.

# License
Model inherits [JVS Corpus](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus).

# See also
- Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, and Hiroshi Saruwatari, "JVS corpus: free Japanese multi-speaker voice corpus," arXiv preprint, 1908.06248, Aug. 2019.