Spaces:
Running
Running
<html> | |
<head> | |
<title>Hibiki</title> | |
<link | |
href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" | |
rel="stylesheet" | |
/> | |
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/font/bootstrap-icons.min.css"> | |
<meta charset="utf-8" /> | |
<meta name="viewport" content="width=device-width, initial-scale=1" /> | |
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script> | |
<script src="https://unpkg.com/wavesurfer.js@7"></script> | |
<script src="helper.js" defer></script> | |
<script> | |
function _setup_callback(elem, elems) { | |
elem.addEventListener("play", function () { | |
for (other of elems) { | |
if (other !== elem) { | |
other.pause(); | |
} | |
} | |
}); | |
} | |
document.addEventListener('DOMContentLoaded', function () { | |
var elems = document.body.getElementsByTagName("audio"); | |
for (elem of elems) { | |
_setup_callback(elem, elems); | |
} | |
}); | |
</script> | |
<style> | |
td { | |
vertical-align: middle; | |
text-align: center; | |
} | |
audio { | |
width: 20vw; | |
min-width: 100px; | |
max-width: 100%; | |
} | |
h1, h2, h3, h4, h5, h6, body, b, strong, th { | |
color: #595959; | |
} | |
.ratio-8x5 { | |
--bs-aspect-ratio: 62.5%; | |
} | |
.btn-secondary { | |
padding: 0.1rem 0.8rem; | |
font-size: small | |
} | |
.container { | |
max-width: 1620px; | |
} | |
</style> | |
</head> | |
<body> | |
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded"> | |
<div class="text-center"> | |
<h1>High-Fidelity Simultaneous Speech-To-Speech Translation</h1> | |
<p class="lead"> | |
<a href="https://kyutai.org" target="_blank" rel="noopener noreferrer">Kyutai</a> | |
- code on <a href="https://github.com/kyutai-labs/hibiki" target="_blank" rel="noopener noreferrer">github</a> | |
</p> | |
</div> | |
<p> | |
<b>Abstract.</b> | |
We introduce <i>Hibiki</i> ('echo' in Japanese) | |
Hibiki leverages a multistream language model to synchronously process | |
source and target speech, and jointly produces text and audio tokens to | |
perform speech-to-text and speech-to-speech translation. | |
We furthermore address the fundamental challenge of <i>simultaneous</i> interpretation, | |
which unlike its <i>consecutive</i> counterpart---where one waits for | |
the end of the source utterance to start translating--- adapts its flow | |
to accumulate just enough context to produce a correct translation in | |
real-time, chunk by chunk. <br /> | |
To do so, we introduce a weakly-supervised method that leverages the | |
perplexity of an off-the-shelf text translation system to identify | |
optimal delays on a per-word basis and create aligned synthetic data. | |
After supervised training, Hibiki performs adaptive, simultaneous | |
speech translation with vanilla temperature sampling. On a | |
French-English simultaneous speech translation task, Hibiki demonstrates | |
state-of-the-art performance in translation quality, speaker fidelity | |
and naturalness. Moreover, the simplicity of its inference process | |
makes it compatible with batched translation and even real-time | |
on-device deployment. | |
</p> | |
</div> | |
<div class="container shadow p-5 mb-5 bg-white rounded"> | |
<h3>In the Wild Examples<a id="vis"/></h3> | |
<p class="mb-0"> | |
</p> | |
<div class="container pt-3 table-responsive"> | |
<table class="table table-hover" width="100%"> | |
<tr> | |
<td witdth="50%"> | |
<video class="embed-responsive-item" style="max-width: 80%; min-width: 400px;" controls> | |
<source src="videos/RPckvIkNWhE_ss301_to390_babel_numerique_arte.mp4" type="video/mp4"> | |
Your browser does not support HTML video. | |
</video> | |
</td> | |
<td width="50%"> | |
<video class="embed-responsive-item" style="max-width: 80%; min-width: 400px;" controls> | |
<source src="videos/uNAmODXvAiQ_ss9_message_a_caractere_informatif.mp4" type="video/mp4"> | |
Your browser does not support HTML video. | |
</video> | |
</td> | |
<tr> | |
<td> | |
This example comes from a video explaining automated translation. | |
(<a href="https://www.youtube.com/watch?v=RPckvIkNWhE" target="_blank">source</a>, original video (c) Arte) | |
</td> | |
<td> | |
This example comes from a humoristic video. The source voice is high pitch on purpose, | |
it is a good showcase of how well Hibiki replicates pitch and prosody and how robust it is to | |
background noise <b>as no denoising is applied to the audio which is fed raw to Hibiki</b>. | |
(<a href="https://www.youtube.com/watch?v=uNAmODXvAiQ" target="_blank">source</a>, original video (c) Canal+) | |
</td> | |
</tr> | |
</table> | |
</div> | |
</div> | |
<div class="container shadow p-5 mb-5 bg-white rounded"> | |
<h3>Examples with Ground Truth Interpretation<a id="vis"/></h3> | |
<p class="mb-0"> | |
These samples come from the VoxPopuli dataset where the ground truth is real human | |
interpretation. | |
The volume for the sources has been reduced so that it's easier to hear the translations. | |
</p> | |
<div class="container pt-3 table-responsive"> | |
<table class="table table-hover" id="vis-table2"></table> | |
</div> | |
</div> | |
<div class="container shadow p-5 mb-5 bg-white rounded"> | |
<h3>Multistream Visualization<a id="vis"/></h3> | |
<p class="mb-0"> | |
The audio for the source and translated versions are on different channels. Use headphones | |
to hear both at the same time. These samples are the same as in the voxpopuli section with CFG | |
set to 3. | |
</p> | |
<div class="container pt-3 table-responsive"> | |
<table class="table table-hover" id="vis-table"></table> | |
</div> | |
</div> | |
<div class="container shadow p-5 mb-5 bg-white rounded"> | |
<h3>Impact of Classifier-Free Guidance<a id="voxpopuli"/></h3> | |
<p class="mb-0"> | |
Samples taken from the VoxPopuli dataset. The Hibiki samples are presented with different levels | |
of classifier-free guidance (CFG). The higher the CFG value, the closer the generated voice will | |
be to the original voice. This results in very strong accents for the generations with the higher | |
values. | |
</p> | |
<div class="container pt-3 table-responsive"> | |
<table | |
class="table table-hover" | |
id="voxpopuli-table" | |
> | |
<thead> | |
<tr> | |
<th style="text-align: center">Source</th> | |
<th style="text-align: center">Hibiki CFG-1</th> | |
<th style="text-align: center">Hibiki CFG-3</th> | |
<th style="text-align: center">Hibiki CFG-10</th> | |
<th style="text-align: center">Seamless</th> | |
</tr> | |
</thead> | |
<tbody> | |
<tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td> <td></td> <td></td></tr> | |
</tbody> | |
</table> | |
</div> | |
</div> | |
<div class="container shadow p-5 mb-5 bg-white rounded"> | |
<h3>Long-form Simultaneous Translations<a id="ntrex"/></h3> | |
<p class="mb-0"> | |
Samples taken from the audio NTREX dataset. | |
</p> | |
<div class="container pt-3 table-responsive"> | |
<table | |
class="table table-hover" | |
id="ntrex-table" | |
> | |
<thead> | |
<tr> | |
<th style="text-align: center;min-width: 200px;">Source</th> | |
<th style="text-align: center;">Hibiki</th> | |
<th style="text-align: center">Seamless</th> | |
</tr> | |
</thead> | |
<tbody> | |
<tr> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td></tr> | |
</tbody> | |
</table> | |
</div> | |
</div> | |
<div class="container shadow p-5 mb-5 bg-white rounded"> | |
<h3>Short-form Simultaneous Translations<a id="cvss-c"/></h3> | |
<p class="mb-0"> | |
Samples taken from the CVSS-C dataset. | |
</p> | |
<div class="container pt-3 table-responsive"> | |
<table | |
class="table table-hover" | |
id="cvss-table" | |
> | |
<thead> | |
<tr> | |
<th style="text-align: center;min-width: 200px;">Source</th> | |
<th style="text-align: center;">Hibiki</th> | |
<th style="text-align: center">Seamless</th> | |
</tr> | |
</thead> | |
<tbody> | |
<tr> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td></tr> | |
<tr> <td></td> <td></td> <td></td></tr> | |
</tbody> | |
</table> | |
</div> | |
</div> | |
<div class="container p-5 mb-5 bg-white rounded"> | |
<p class="mb-0"> | |
This page was adapted from the <a href="https://google-research.github.io/seanet/soundstorm/examples">SoundStorm project page</a>. | |
</p> | |
</div> | |
</body> | |
</html> | |