FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
Abstract
Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.
Community
๐ Project Page: https://lucadellalib.github.io/focalcodec-web/
๐พ GitHub: https://github.com/lucadellalib/focalcodec
๐ Downstream Tasks: https://github.com/lucadellalib/audiocodecs
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models (2024)
- Long-Form Speech Generation with Spoken Language Models (2024)
- Metis: A Foundation Speech Generation Model with Masked Generative Pre-training (2025)
- ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling (2025)
- Whisper-GPT: A Hybrid Representation Audio Large Language Model (2024)
- CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic Recognition (2025)
- GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper