File size: 9,810 Bytes
fe98679
874d788
fe98679
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
caed5a1
78740e3
fe98679
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
874d788
fe98679
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
<!DOCTYPE html>
<html>
  <head>
    <title>Hibiki</title>
    <link
      href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css"
      rel="stylesheet"
    />
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/font/bootstrap-icons.min.css">
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script>
    <script src="https://unpkg.com/wavesurfer.js@7"></script>
    <script src="helper.js" defer></script>
    <script>
      function _setup_callback(elem, elems) {
        elem.addEventListener("play", function () {
          for (other of elems) {
            if (other !== elem) {
              other.pause();
            }
          }
        });
      }

      document.addEventListener('DOMContentLoaded', function () {
        var elems = document.body.getElementsByTagName("audio");
        for (elem of elems) {
          _setup_callback(elem, elems);
        }
      });
    </script>
    <style>
      td {
        vertical-align: middle;
        text-align: center;
      }
      audio {
        width: 20vw;
        min-width: 100px;
        max-width: 100%;
      }
      h1, h2, h3, h4, h5, h6, body, b, strong, th {
        color: #595959;
      }
      .ratio-8x5 {
        --bs-aspect-ratio: 62.5%;
      }
      .btn-secondary {
        padding: 0.1rem 0.8rem;
        font-size: small
      }
      .container {
        max-width: 1620px;
      }
    </style>
  </head>
  <body>
    <div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
      <div class="text-center">
        <h1>High-Fidelity Simultaneous Speech-To-Speech Translation</h1>
        <p class="lead">
          <a href="https://kyutai.org" target="_blank" rel="noopener noreferrer">Kyutai</a>
          - code on <a href="https://github.com/kyutai-labs/hibiki" target="_blank" rel="noopener noreferrer">github</a>
        </p>
      </div>
      <p>
      <b>Abstract.</b>
        We introduce <i>Hibiki</i> ('echo' in Japanese)
        Hibiki leverages a multistream language model to synchronously process
        source and target speech, and jointly produces text and audio tokens to
        perform speech-to-text and speech-to-speech translation.
        We furthermore address the fundamental challenge of <i>simultaneous</i> interpretation,
        which unlike its <i>consecutive</i> counterpart---where one waits for
        the end of the source utterance to start translating--- adapts its flow
        to accumulate just enough context to produce a correct translation in
        real-time, chunk by chunk. <br />
        To do so, we introduce a weakly-supervised method that leverages the
        perplexity of an off-the-shelf text translation system to identify
        optimal delays on a per-word basis and create aligned synthetic data.
        After supervised training, Hibiki performs adaptive, simultaneous
        speech translation with vanilla temperature sampling. On a
        French-English simultaneous speech translation task, Hibiki demonstrates
        state-of-the-art performance in translation quality, speaker fidelity
        and naturalness. Moreover, the simplicity of its inference process
        makes it compatible with batched translation and even real-time
        on-device deployment.
      </p>
    </div>

    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>In the Wild Examples<a id="vis"/></h3>
      <p class="mb-0">
      </p>
      <div class="container pt-3 table-responsive">
        <table class="table table-hover" width="100%">
          <tr>
            <td witdth="50%">
            <video class="embed-responsive-item" style="max-width: 80%; min-width: 400px;" controls>
              <source src="videos/RPckvIkNWhE_ss301_to390_babel_numerique_arte.mp4" type="video/mp4">
              Your browser does not support HTML video.
            </video>
            </td>
            <td width="50%">
            <video class="embed-responsive-item" style="max-width: 80%; min-width: 400px;" controls>
              <source src="videos/uNAmODXvAiQ_ss9_message_a_caractere_informatif.mp4" type="video/mp4">
              Your browser does not support HTML video.
            </video>
            </td>
          <tr>
            <td>
            This example comes from a video explaining automated translation.
            (<a href="https://www.youtube.com/watch?v=RPckvIkNWhE" target="_blank">source</a>, original video (c) Arte)
            </td>
            <td>
            This example comes from a humoristic video. The source voice is high pitch on purpose,
            it is a good showcase of how well Hibiki replicates pitch and prosody and how robust it is to
            background noise <b>as no denoising is applied to the audio which is fed raw to Hibiki</b>.
            (<a href="https://www.youtube.com/watch?v=uNAmODXvAiQ" target="_blank">source</a>, original video (c) Canal+)
            </td>
          </tr>
        </table>
      </div>
    </div>

    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Examples with Ground Truth Interpretation<a id="vis"/></h3>
      <p class="mb-0">
      These samples come from the VoxPopuli dataset where the ground truth is real human 
      interpretation.
      The volume for the sources has been reduced so that it's easier to hear the translations.
      </p>
      <div class="container pt-3 table-responsive">
        <table class="table table-hover" id="vis-table2"></table>
      </div>
    </div>

    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Multistream Visualization<a id="vis"/></h3>
      <p class="mb-0">
      The audio for the source and translated versions are on different channels. Use headphones
      to hear both at the same time. These samples are the same as in the voxpopuli section with CFG
      set to 3.
      </p>
      <div class="container pt-3 table-responsive">
        <table class="table table-hover" id="vis-table"></table>
      </div>
    </div>


    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Impact of Classifier-Free Guidance<a id="voxpopuli"/></h3>
      <p class="mb-0">
        Samples taken from the VoxPopuli dataset. The Hibiki samples are presented with different levels
        of classifier-free guidance (CFG). The higher the CFG value, the closer the generated voice will
        be to the original voice. This results in very strong accents for the generations with the higher
        values.
      </p>

      <div class="container pt-3 table-responsive">
        <table
          class="table table-hover"
          id="voxpopuli-table"
        >
          <thead>
            <tr>
              <th style="text-align: center">Source</th>
              <th style="text-align: center">Hibiki CFG-1</th>
              <th style="text-align: center">Hibiki CFG-3</th>
              <th style="text-align: center">Hibiki CFG-10</th>
              <th style="text-align: center">Seamless</th>
            </tr>
          </thead>
          <tbody>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr>
          </tbody>
        </table>
      </div>
    </div>
    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Long-form Simultaneous Translations<a id="ntrex"/></h3>
      <p class="mb-0">
        Samples taken from the audio NTREX dataset.
      </p>

      <div class="container pt-3 table-responsive">
        <table
          class="table table-hover"
          id="ntrex-table"
        >
          <thead>
            <tr>
              <th style="text-align: center;min-width: 200px;">Source</th>
              <th style="text-align: center;">Hibiki</th>
              <th style="text-align: center">Seamless</th>
            </tr>
          </thead>
          <tbody>
             <tr> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td></tr>
          </tbody>
        </table>
      </div>
    </div>

    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Short-form Simultaneous Translations<a id="cvss-c"/></h3>
      <p class="mb-0">
        Samples taken from the CVSS-C dataset.
      </p>

      <div class="container pt-3 table-responsive">
        <table
          class="table table-hover"
          id="cvss-table"
        >
          <thead>
            <tr>
              <th style="text-align: center;min-width: 200px;">Source</th>
              <th style="text-align: center;">Hibiki</th>
              <th style="text-align: center">Seamless</th>
            </tr>
          </thead>
          <tbody>
             <tr> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td></tr>
             <tr> <td></td> <td></td> <td></td></tr>
          </tbody>
        </table>
      </div>
    </div>

    <div class="container p-5 mb-5 bg-white rounded">
      <p class="mb-0">
        This page was adapted from the <a href="https://google-research.github.io/seanet/soundstorm/examples">SoundStorm project page</a>.
      </p>
    </div>


  </body>
</html>