XCodec Mini - Neural Audio Codec
Model Description
XCodec Mini is a state-of-the-art neural audio codec designed for high-quality music compression and reconstruction. It combines semantic and acoustic encoding approaches to achieve efficient compression while maintaining audio quality.
Key Features
Dual Encoding Architecture
- Semantic encoder for high-level musical features
- Acoustic encoder for detailed sound information
- Multi-scale processing for efficient compression
Advanced Compression
- Multiple codebooks for flexible quality/size tradeoff
- Support for 44.1kHz high-fidelity audio
- Separate processing paths for vocals and instrumentals
Technical Specifications
- Input: Raw audio at 44.1kHz
- Output: Compressed representations and reconstructed audio
- Model Size: [Add total size]
- Compression Ratio: [Add typical ratio]
Intended Uses
- High-quality music compression
- Audio archival and storage
- Music streaming applications
- Audio processing pipelines
Training Data
The model was trained on a diverse dataset of music, including:
- Various genres and styles
- Vocal and instrumental tracks
- High-quality studio recordings
Performance and Limitations
Strengths
- High-quality audio reconstruction
- Efficient compression ratios
- Separate handling of vocals and instrumentals
- Support for high sample rates
Limitations
- Computationally intensive for real-time applications
- Requires significant GPU memory
- Best suited for offline processing
- May introduce artifacts in extreme compression settings
Technical Specifications
Model Architecture
Semantic Encoder
- Based on HuBERT architecture
- Captures high-level musical features
- Outputs semantic tokens
Acoustic Encoder
- Multi-scale convolutional architecture
- Processes detailed sound information
- Generates acoustic tokens
Dual Decoders
- Separate decoders for vocals and instrumentals
- Multi-stage reconstruction process
- Quality-focused design
Input Requirements
- Audio Format: WAV/MP3
- Sample Rate: 44.1kHz
- Channels: Mono/Stereo
- Bit Depth: 16-bit
Output Format
- Reconstructed Audio: 44.1kHz WAV
- Intermediate Representations: Compressed tokens
Usage Guidelines
Hardware Requirements
- GPU: NVIDIA GPU with 8GB+ VRAM
- RAM: 16GB+ recommended
- Storage: SSD recommended for faster processing
Software Requirements
- Python 3.8+
- PyTorch 2.0+
- CUDA 11.0+
- Additional dependencies listed in installation guide
Ethical Considerations
- Copyright: Users should ensure they have proper rights to process copyrighted material
- Attribution: Proper attribution should be given when using this model
- Data Privacy: Consider data privacy implications when processing sensitive audio
Additional Information
Model Weights
The model requires several checkpoint files:
- Semantic Encoder:
semantic_ckpts/hf_1_325000/pytorch_model.bin
- Vocal Decoder:
decoders/decoder_131000.pth
- Instrumental Decoder:
decoders/decoder_151000.pth
- Final Checkpoint:
final_ckpt/ckpt_00360000.pth
Contact
For issues and questions, please use the GitHub repository's issue tracker.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support audio-to-audio models for transformers library.