|
--- |
|
library_name: transformers |
|
tags: |
|
- EchoLLaMA |
|
license: apache-2.0 |
|
datasets: |
|
- AquaLabs/Spatial-DPO-Dataset |
|
language: |
|
- en |
|
base_model: |
|
- meta-llama/Llama-3.2-1B-Instruct |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# EchoLLaMA: 3D-to-Speech with Multimodal AI |
|
|
|
[](https://huggingface.co/AquaLabs/EchoLLaMA-1B) |
|
[](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise) |
|
[](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/) |
|
|
|
## Overview |
|
|
|
EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes. |
|
|
|
## Model Architecture |
|
|
|
The EchoLLaMA pipeline integrates four specialized models: |
|
|
|
1. **Image Analysis**: |
|
- DETR (DEtection TRansformer) for object detection |
|
- MiDaS for monocular depth estimation |
|
- Moondream for holistic image captioning |
|
|
|
2. **Text Generation**: |
|
- LLaMA-3.2-1B-Instruct fine-tuned with DPO |
|
|
|
3. **Speech Synthesis**: |
|
- Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset |
|
|
|
4. **Speech Recognition**: |
|
- SpeechRecognition package for transcribing user speech input |
|
|
|
## Key Features |
|
|
|
- **3D Object Detection Matrix**: Constructs a grid-based representation of detected objects with spatial coordinates |
|
- **Depth-Aware Scene Understanding**: Incorporates relative depth values to capture 3D relationships |
|
- **Natural Language Generation**: Produces coherent and contextually rich descriptions |
|
- **High-Quality Speech Synthesis**: Converts textual descriptions into natural-sounding speech |
|
|
|
## Training Details |
|
|
|
### LLaMA Model |
|
|
|
The LLaMA-3.2-1B-Instruct model was fine-tuned using: |
|
|
|
- **Technique**: Direct Preference Optimization (DPO) with LoRA |
|
- **Dataset**: 2000 samples from COCO 2017 processed with DETR, and Moondream |
|
- **Chosen Responses**: Generated by DeepSeek-V3-0324 |
|
- **Rejected Responses**: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct |
|
- **Training Parameters**: |
|
- LoRA Rank: 8 |
|
- β (DPO): 0.1 |
|
- Learning Rate: 2×10⁻⁵ with cosine decay |
|
- Batch Size: 16 (with 2×8 accumulation) |
|
- Sequence Length: 8192 |
|
- **Hardware**: 2×T4 GPU |
|
- **Training Time**: 1 hour 40 minutes |
|
|
|
### Orpheus Model |
|
|
|
The Orpheus-3B-0.1-ft TTS model was fine-tuned using: |
|
|
|
- **Technique**: Low-Rank Adaptation (LoRA) |
|
- **Dataset**: Elise English speech dataset |
|
- **Training Parameters**: |
|
- LoRA Rank (r): 64 |
|
- LoRA Alpha (α): 64 |
|
- LoRA Dropout: 0 |
|
- Learning Rate: 2×10⁻⁴ |
|
- **Hardware**: 2×T4 GPU |
|
- **Training Time**: 47 minutes |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
# Clone the repository |
|
git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git |
|
cd EchoLLaMA-Pipeline |
|
``` |
|
|
|
And run the Jupyter Notebook file. |
|
|
|
## Pipeline Flow |
|
|
|
1. Image is processed with DETR for object detection and MiDaS for depth estimation |
|
2. Moondream generates a caption describing the image content |
|
3. The object detection matrix and caption are combined into a prompt |
|
4. LLaMA-3.2-1B-Instruct generates a detailed textual description |
|
5. Orpheus-3B-0.1-ft converts the text into speech |
|
|
|
## Dataset |
|
|
|
The training dataset contains 1999 samples, each consisting of: |
|
- An image-derived prompt with object detection matrix and caption |
|
- A chosen response from DeepSeek-V3-0324 |
|
- A rejected response from LLaMA-3.2-1B-Instruct |
|
|
|
You can access the dataset at [AquaLabs/Spatial-DPO-Dataset](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/) |
|
|
|
## Model Weights |
|
|
|
- LLaMA-3.2-1B-Instruct (fine-tuned): [AquaLabs/EchoLLaMA-1B](https://huggingface.co/AquaLabs/EchoLLaMA-1B) |
|
- Orpheus-3B-0.1-ft (fine-tuned): [AquaLabs/Orpheus-3B-0.1-ft-Elise](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise) |
|
|
|
## Contributors |
|
|
|
- Ahmet Erdem Pamuk - [GitHub](https://github.com/ahmeterdempmk) | [Hugging Face](https://huggingface.co/ahmeterdempmk) |
|
- Emir Kaan Özdemir - [GitHub](https://github.com/emirkaanozdemr) | [Hugging Face](https://huggingface.co/emirkaanozdemr) |
|
- Şuayp Talha Kocabay - [GitHub](https://github.com/suayptalha) | [Hugging Face](https://huggingface.co/suayptalha) |
|
|
|
## License |
|
|
|
This project is licensed under the Apache-2.0 License. |
|
|
|
Details are provided in the [paper](). |