XCodec Mini - Neural Audio Codec

Model Description

XCodec Mini is a state-of-the-art neural audio codec designed for high-quality music compression and reconstruction. It combines semantic and acoustic encoding approaches to achieve efficient compression while maintaining audio quality.

Key Features

  • Dual Encoding Architecture

    • Semantic encoder for high-level musical features
    • Acoustic encoder for detailed sound information
    • Multi-scale processing for efficient compression
  • Advanced Compression

    • Multiple codebooks for flexible quality/size tradeoff
    • Support for 44.1kHz high-fidelity audio
    • Separate processing paths for vocals and instrumentals
  • Technical Specifications

    • Input: Raw audio at 44.1kHz
    • Output: Compressed representations and reconstructed audio
    • Model Size: [Add total size]
    • Compression Ratio: [Add typical ratio]

Intended Uses

  • High-quality music compression
  • Audio archival and storage
  • Music streaming applications
  • Audio processing pipelines

Training Data

The model was trained on a diverse dataset of music, including:

  • Various genres and styles
  • Vocal and instrumental tracks
  • High-quality studio recordings

Performance and Limitations

Strengths

  • High-quality audio reconstruction
  • Efficient compression ratios
  • Separate handling of vocals and instrumentals
  • Support for high sample rates

Limitations

  • Computationally intensive for real-time applications
  • Requires significant GPU memory
  • Best suited for offline processing
  • May introduce artifacts in extreme compression settings

Technical Specifications

Model Architecture

  1. Semantic Encoder

    • Based on HuBERT architecture
    • Captures high-level musical features
    • Outputs semantic tokens
  2. Acoustic Encoder

    • Multi-scale convolutional architecture
    • Processes detailed sound information
    • Generates acoustic tokens
  3. Dual Decoders

    • Separate decoders for vocals and instrumentals
    • Multi-stage reconstruction process
    • Quality-focused design

Input Requirements

  • Audio Format: WAV/MP3
  • Sample Rate: 44.1kHz
  • Channels: Mono/Stereo
  • Bit Depth: 16-bit

Output Format

  • Reconstructed Audio: 44.1kHz WAV
  • Intermediate Representations: Compressed tokens

Usage Guidelines

Hardware Requirements

  • GPU: NVIDIA GPU with 8GB+ VRAM
  • RAM: 16GB+ recommended
  • Storage: SSD recommended for faster processing

Software Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • CUDA 11.0+
  • Additional dependencies listed in installation guide

Ethical Considerations

  • Copyright: Users should ensure they have proper rights to process copyrighted material
  • Attribution: Proper attribution should be given when using this model
  • Data Privacy: Consider data privacy implications when processing sensitive audio

Additional Information

Model Weights

The model requires several checkpoint files:

  • Semantic Encoder: semantic_ckpts/hf_1_325000/pytorch_model.bin
  • Vocal Decoder: decoders/decoder_131000.pth
  • Instrumental Decoder: decoders/decoder_151000.pth
  • Final Checkpoint: final_ckpt/ckpt_00360000.pth

Contact

For issues and questions, please use the GitHub repository's issue tracker.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support audio-to-audio models for transformers library.