File size: 4,581 Bytes
152ddff 3a3ebac 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 69130a8 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b 152ddff 9b1237b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
library_name: transformers
tags:
- EchoLLaMA
license: apache-2.0
datasets:
- AquaLabs/Spatial-DPO-Dataset
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation
---
# EchoLLaMA: 3D-to-Speech with Multimodal AI
[](https://huggingface.co/AquaLabs/EchoLLaMA-1B)
[](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise)
[](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/)
## Overview
EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes.
## Model Architecture
The EchoLLaMA pipeline integrates four specialized models:
1. **Image Analysis**:
- DETR (DEtection TRansformer) for object detection
- MiDaS for monocular depth estimation
- Moondream for holistic image captioning
2. **Text Generation**:
- LLaMA-3.2-1B-Instruct fine-tuned with DPO
3. **Speech Synthesis**:
- Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset
4. **Speech Recognition**:
- SpeechRecognition package for transcribing user speech input
## Key Features
- **3D Object Detection Matrix**: Constructs a grid-based representation of detected objects with spatial coordinates
- **Depth-Aware Scene Understanding**: Incorporates relative depth values to capture 3D relationships
- **Natural Language Generation**: Produces coherent and contextually rich descriptions
- **High-Quality Speech Synthesis**: Converts textual descriptions into natural-sounding speech
## Training Details
### LLaMA Model
The LLaMA-3.2-1B-Instruct model was fine-tuned using:
- **Technique**: Direct Preference Optimization (DPO) with LoRA
- **Dataset**: 2000 samples from COCO 2017 processed with DETR, and Moondream
- **Chosen Responses**: Generated by DeepSeek-V3-0324
- **Rejected Responses**: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct
- **Training Parameters**:
- LoRA Rank: 8
- β (DPO): 0.1
- Learning Rate: 2×10⁻⁵ with cosine decay
- Batch Size: 16 (with 2×8 accumulation)
- Sequence Length: 8192
- **Hardware**: 2×T4 GPU
- **Training Time**: 1 hour 40 minutes
### Orpheus Model
The Orpheus-3B-0.1-ft TTS model was fine-tuned using:
- **Technique**: Low-Rank Adaptation (LoRA)
- **Dataset**: Elise English speech dataset
- **Training Parameters**:
- LoRA Rank (r): 64
- LoRA Alpha (α): 64
- LoRA Dropout: 0
- Learning Rate: 2×10⁻⁴
- **Hardware**: 2×T4 GPU
- **Training Time**: 47 minutes
## Usage
### Installation
```bash
# Clone the repository
git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git
cd EchoLLaMA-Pipeline
```
And run the Jupyter Notebook file.
## Pipeline Flow
1. Image is processed with DETR for object detection and MiDaS for depth estimation
2. Moondream generates a caption describing the image content
3. The object detection matrix and caption are combined into a prompt
4. LLaMA-3.2-1B-Instruct generates a detailed textual description
5. Orpheus-3B-0.1-ft converts the text into speech
## Dataset
The training dataset contains 1999 samples, each consisting of:
- An image-derived prompt with object detection matrix and caption
- A chosen response from DeepSeek-V3-0324
- A rejected response from LLaMA-3.2-1B-Instruct
You can access the dataset at [AquaLabs/Spatial-DPO-Dataset](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/)
## Model Weights
- LLaMA-3.2-1B-Instruct (fine-tuned): [AquaLabs/EchoLLaMA-1B](https://huggingface.co/AquaLabs/EchoLLaMA-1B)
- Orpheus-3B-0.1-ft (fine-tuned): [AquaLabs/Orpheus-3B-0.1-ft-Elise](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise)
## Contributors
- Ahmet Erdem Pamuk - [GitHub](https://github.com/ahmeterdempmk) | [Hugging Face](https://huggingface.co/ahmeterdempmk)
- Emir Kaan Özdemir - [GitHub](https://github.com/emirkaanozdemr) | [Hugging Face](https://huggingface.co/emirkaanozdemr)
- Şuayp Talha Kocabay - [GitHub](https://github.com/suayptalha) | [Hugging Face](https://huggingface.co/suayptalha)
## License
This project is licensed under the Apache-2.0 License.
Details are provided in the [paper](). |