@singhsidhukuldeep on Hugging Face: "Good folks at @nvidia and @Tsinghua_Uni have released LLAMA-MESH

Post

2299

Good folks at @nvidia and @Tsinghua_Uni have released LLAMA-MESH - A Revolutionary Approach to 3D Content Generation!

This innovative framework enables the direct generation of 3D meshes from natural language prompts while maintaining strong language capabilities.

Here is the Architecture & Implementation!

>> Core Components

Model Foundation
- If you haven't guessed it yet, it's built on the LLaMA-3.1-8B-Instruct base model
- Maintains original language capabilities while adding 3D generation
- Context length is set to 8,000 tokens

3D Representation Strategy
- Uses the OBJ file format for mesh representation
- Quantizes vertex coordinates into 64 discrete bins per axis
- Sorts vertices by z-y-x coordinates, from lowest to highest
- Sorts faces by the lowest vertex indices for consistency

Data Processing Pipeline
- Filters meshes to a maximum of 500 faces for computational efficiency
- Applies random rotations (0°, 90°, 180°, 270°) for data augmentation
- Generates ~125k mesh variations from 31k base meshes
- Uses Cap3D-generated captions for text descriptions

>> Training Framework

Dataset Composition
- 40% Mesh Generation tasks
- 20% Mesh Understanding tasks
- 40% General Conversation (UltraChat dataset)
- 8x training turns for generation, 4x for understanding

Training Configuration
- Deployed on 32 A100 GPUs (for Nvidia, this is literally in-house)
- 21,000 training iterations
- Global batch size: 128
- AdamW optimizer with a 1e-5 learning rate
- 30-step warmup with cosine scheduling
- Total training time: approximately 3 days (based on the paper)

This research opens exciting possibilities for intuitive 3D content creation through natural language interaction. The future of digital design is conversational!

Join the conversation