metadata

license: cc-by-nc-4.0
pipeline_tag: automatic-speech-recognition

SimulSeamless

Code for the paper: "SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation" published at IWSLT 2024.

📎 Requirements

To run the 🤖 Inference using your environment, please make sure that FBK-fairseq, SimulEval v1.1.0 and HuggingFace Transformers are installed.
To run the 💬 Inference using docker, install SimulEval v1.1.0 using commit f1f5b9a69a47496630aa43605f1bd46e5484a2f4.

🤖 Inference using your environment

Set --source, and --target as described in the Fairseq Simultaneous Translation repository: ${LIST_OF_AUDIO} is the list of audio paths and ${TGT_FILE} the segment-wise references in the target language.

Set ${TGT_LANG} as the target language code in 3 characters. The list of supported language codes is available here. For the source language, no language code has to be specified.

Depending on the target language, set ${LATENCY_UNIT} to either word (e.g., for German) or char (e.g., for Japanese), and ${BLEU_TOKENIZER} to either 13a (i.e., the standard sacreBLEU tokenizer used, for example, to evaluate German) or char (e.g., to evaluate character-level languages such as Chinese or Japanese).

The simultaneous inference of SimulSeamless is based on AlignAtt, thus the f parameter (${FRAME}) and the layer from which to extract the attention scores (${LAYER}) have to be set accordingly.

Instruction to replicate IWSLT 2024 results ↙️

To replicate the results obtained to achieve 2 seconds of latency (measured by AL) on the test sets used by the IWSLT 2024 Simultaneous track, use the following values:

en-de: ${TGT_LANG}=deu, ${FRAME}=6, ${LAYER}=3, ${SEG_SIZE}=1000
en-ja: ${TGT_LANG}=jpn, ${FRAME}=1, ${LAYER}=0, ${SEG_SIZE}=400
en-zh: ${TGT_LANG}=cmn, ${FRAME}=1, ${LAYER}=3, ${SEG_SIZE}=800
cs-en: ${TGT_LANG}=eng, ${FRAME}=9, ${LAYER}=3, ${SEG_SIZE}=1000

❗️Please notice that ${FRAME} can be adjusted to achieve lower/higher latency.

The SimulSeamless can be run with:

simuleval \
    --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_alignatt_seamlessm4t.AlignAttSeamlessS2T \
    --source ${LIST_OF_AUDIO} \
    --target ${TGT_FILE} \
    --data-bin ${DATA_ROOT} \
    --model-size medium --target-language ${TGT_LANG} \
    --extract-attn-from-layer ${LAYER} --num-beams 5 \
    --frame-num ${FRAME} \
    --source-segment-size ${SEG_SIZE} \
    --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware \
    --eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
    --output ${OUT_DIR} \
    --device cuda:0

If not already stored in your system, the SeamlessM4T model will be downloaded automatically when running the script. The output will be saved in ${OUT_DIR}.

We suggest running the inference using a GPU to speed up the process but the system can be run on any device (e.g., CPU) supported by SimulEval and HuggingFace.

💬 Inference using docker

To run SimulSeamless using docker, follow the steps below:

Download the docker file by cloning this repository
Load the docker image:

docker load -i simulseamless.tar

Start the SimulEval standalone with GPU enabled:

docker run -e TGTLANG=${TGT_LANG} -e FRAME=${FRAME} -e LAYER=${LAYER} \
    -e BLEU_TOKENIZER=${BLEU_TOKENIZER} -e LATENCY_UNIT=${LATENCY_UNIT} \
    -e DEV=cuda:0 --gpus all --shm-size 32G \
    -p 2024:2024 simulseamless:latest

Start the remote evaluation with:

simuleval \
    --remote-eval --remote-port 2024 \
    --source ${LIST_OF_AUDIO} --target ${TGT_FILE} \
    --source-type speech --target-type text \
    --source-segment-size ${SEG_SIZE} \
    --eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
    --output ${OUT_DIR}

To set, ${TGT_LANG}, ${FRAME}, ${LAYER}, ${BLEU_TOKENIZER}, ${LATENCY_UNIT}, ${LIST_OF_AUDIO}, ${TGT_FILE}, ${SEG_SIZE}, and ${OUT_DIR} refer to 🤖 Inference using your environment.

📍Citation

@inproceedings{papi-etal-2024-simulseamless,
    title = "{S}imul{S}eamless: {FBK} at {IWSLT} 2024 Simultaneous Speech Translation",
    author = "Papi, Sara  and
      Gaido, Marco  and
      Negri, Matteo  and
      Bentivogli, Luisa",
    editor = "Salesky, Elizabeth  and
      Federico, Marcello  and
      Carpuat, Marine",
    booktitle = "Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand (in-person and online)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.iwslt-1.11",
    pages = "72--79",
}