license: cc-by-nc-4.0
pipeline_tag: automatic-speech-recognition
SimulSeamless
Code for the paper: "SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation" published at IWSLT 2024.
📎 Requirements
To run the 🤖 Inference using your environment, please make sure that FBK-fairseq, SimulEval v1.1.0 and HuggingFace Transformers are installed.
To run the 💬 Inference using docker, install SimulEval v1.1.0 using commit
f1f5b9a69a47496630aa43605f1bd46e5484a2f4
.
🤖 Inference using your environment
Set --source
, and --target
as described in the
Fairseq Simultaneous Translation repository:
${LIST_OF_AUDIO}
is the list of audio paths and ${TGT_FILE}
the segment-wise references in the
target language.
Set ${TGT_LANG}
as the target language code in 3 characters. The list of supported language codes is available here.
For the source language, no language code has to be specified.
Depending on the target language, set ${LATENCY_UNIT}
to either word
(e.g., for German) or
char
(e.g., for Japanese), and ${BLEU_TOKENIZER}
to either 13a
(i.e., the standard sacreBLEU
tokenizer used, for example, to evaluate German) or char
(e.g., to evaluate character-level
languages such as Chinese or Japanese).
The simultaneous inference of SimulSeamless is based on
AlignAtt, thus the f parameter (${FRAME}
) and the
layer from which to extract the attention scores (${LAYER}
) have to be set accordingly.
Instruction to replicate IWSLT 2024 results ↙️
To replicate the results obtained to achieve 2 seconds of latency (measured by AL) on the test sets used by the IWSLT 2024 Simultaneous track, use the following values:
- en-de:
${TGT_LANG}=deu
,${FRAME}=6
,${LAYER}=3
,${SEG_SIZE}=1000
- en-ja:
${TGT_LANG}=jpn
,${FRAME}=1
,${LAYER}=0
,${SEG_SIZE}=400
- en-zh:
${TGT_LANG}=cmn
,${FRAME}=1
,${LAYER}=3
,${SEG_SIZE}=800
- cs-en:
${TGT_LANG}=eng
,${FRAME}=9
,${LAYER}=3
,${SEG_SIZE}=1000
❗️Please notice that ${FRAME}
can be adjusted to achieve lower/higher latency.
The SimulSeamless can be run with:
simuleval \
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_alignatt_seamlessm4t.AlignAttSeamlessS2T \
--source ${LIST_OF_AUDIO} \
--target ${TGT_FILE} \
--data-bin ${DATA_ROOT} \
--model-size medium --target-language ${TGT_LANG} \
--extract-attn-from-layer ${LAYER} --num-beams 5 \
--frame-num ${FRAME} \
--source-segment-size ${SEG_SIZE} \
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware \
--eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
--output ${OUT_DIR} \
--device cuda:0
If not already stored in your system, the SeamlessM4T model will be downloaded automatically when
running the script. The output will be saved in ${OUT_DIR}
.
We suggest running the inference using a GPU to speed up the process but the system can be run on any device (e.g., CPU) supported by SimulEval and HuggingFace.
💬 Inference using docker
To run SimulSeamless using docker, follow the steps below:
- Download the docker file by cloning this repository
- Load the docker image:
docker load -i simulseamless.tar
- Start the SimulEval standalone with GPU enabled:
docker run -e TGTLANG=${TGT_LANG} -e FRAME=${FRAME} -e LAYER=${LAYER} \
-e BLEU_TOKENIZER=${BLEU_TOKENIZER} -e LATENCY_UNIT=${LATENCY_UNIT} \
-e DEV=cuda:0 --gpus all --shm-size 32G \
-p 2024:2024 simulseamless:latest
- Start the remote evaluation with:
simuleval \
--remote-eval --remote-port 2024 \
--source ${LIST_OF_AUDIO} --target ${TGT_FILE} \
--source-type speech --target-type text \
--source-segment-size ${SEG_SIZE} \
--eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
--output ${OUT_DIR}
To set, ${TGT_LANG}
, ${FRAME}
, ${LAYER}
, ${BLEU_TOKENIZER}
, ${LATENCY_UNIT}
,
${LIST_OF_AUDIO}
, ${TGT_FILE}
, ${SEG_SIZE}
, and ${OUT_DIR}
refer to
🤖 Inference using your environment.
📍Citation
@inproceedings{papi-etal-2024-simulseamless,
title = "{S}imul{S}eamless: {FBK} at {IWSLT} 2024 Simultaneous Speech Translation",
author = "Papi, Sara and
Gaido, Marco and
Negri, Matteo and
Bentivogli, Luisa",
editor = "Salesky, Elizabeth and
Federico, Marcello and
Carpuat, Marine",
booktitle = "Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)",
month = aug,
year = "2024",
address = "Bangkok, Thailand (in-person and online)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.iwslt-1.11",
pages = "72--79",
}