File size: 4,581 Bytes
152ddff
 
3a3ebac
 
 
 
 
 
 
 
 
 
152ddff
 
9b1237b
152ddff
9b1237b
 
 
152ddff
9b1237b
152ddff
9b1237b
152ddff
9b1237b
152ddff
9b1237b
152ddff
9b1237b
 
 
 
152ddff
9b1237b
 
152ddff
9b1237b
 
152ddff
9b1237b
 
152ddff
9b1237b
152ddff
9b1237b
 
 
 
152ddff
 
 
9b1237b
152ddff
9b1237b
152ddff
9b1237b
69130a8
9b1237b
 
 
 
 
 
 
 
 
 
152ddff
9b1237b
152ddff
9b1237b
152ddff
9b1237b
 
 
 
 
 
 
 
 
152ddff
9b1237b
152ddff
9b1237b
152ddff
9b1237b
 
 
 
 
152ddff
9b1237b
152ddff
9b1237b
152ddff
9b1237b
 
 
 
 
152ddff
9b1237b
152ddff
9b1237b
 
 
 
152ddff
9b1237b
152ddff
9b1237b
152ddff
9b1237b
 
152ddff
9b1237b
152ddff
9b1237b
 
 
152ddff
9b1237b
152ddff
9b1237b
152ddff
9b1237b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
library_name: transformers
tags:
- EchoLLaMA
license: apache-2.0
datasets:
- AquaLabs/Spatial-DPO-Dataset
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation
---

# EchoLLaMA: 3D-to-Speech with Multimodal AI

[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-EchoLLaMA--1B-yellow)](https://huggingface.co/AquaLabs/EchoLLaMA-1B)
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Orpheus--3B--0.1--ft--Elise-blue)](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise)
[![Hugging Face](https://img.shields.io/badge/Dataset-Spatial--DPO--Dataset-green)](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/)

## Overview

EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes.

## Model Architecture

The EchoLLaMA pipeline integrates four specialized models:

1. **Image Analysis**: 
   - DETR (DEtection TRansformer) for object detection
   - MiDaS for monocular depth estimation
   - Moondream for holistic image captioning

2. **Text Generation**:
   - LLaMA-3.2-1B-Instruct fine-tuned with DPO

3. **Speech Synthesis**:
   - Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset

4. **Speech Recognition**:
   - SpeechRecognition package for transcribing user speech input

## Key Features

- **3D Object Detection Matrix**: Constructs a grid-based representation of detected objects with spatial coordinates
- **Depth-Aware Scene Understanding**: Incorporates relative depth values to capture 3D relationships
- **Natural Language Generation**: Produces coherent and contextually rich descriptions
- **High-Quality Speech Synthesis**: Converts textual descriptions into natural-sounding speech

## Training Details

### LLaMA Model

The LLaMA-3.2-1B-Instruct model was fine-tuned using:

- **Technique**: Direct Preference Optimization (DPO) with LoRA
- **Dataset**: 2000 samples from COCO 2017 processed with DETR, and Moondream
- **Chosen Responses**: Generated by DeepSeek-V3-0324
- **Rejected Responses**: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct
- **Training Parameters**:
  - LoRA Rank: 8
  - β (DPO): 0.1
  - Learning Rate: 2×10⁻⁵ with cosine decay
  - Batch Size: 16 (with 2×8 accumulation)
  - Sequence Length: 8192
- **Hardware**: 2×T4 GPU
- **Training Time**: 1 hour 40 minutes

### Orpheus Model

The Orpheus-3B-0.1-ft TTS model was fine-tuned using:

- **Technique**: Low-Rank Adaptation (LoRA)
- **Dataset**: Elise English speech dataset
- **Training Parameters**:
  - LoRA Rank (r): 64
  - LoRA Alpha (α): 64
  - LoRA Dropout: 0
  - Learning Rate: 2×10⁻⁴
- **Hardware**: 2×T4 GPU
- **Training Time**: 47 minutes

## Usage

### Installation

```bash
# Clone the repository
git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git
cd EchoLLaMA-Pipeline
```

And run the Jupyter Notebook file.

## Pipeline Flow

1. Image is processed with DETR for object detection and MiDaS for depth estimation
2. Moondream generates a caption describing the image content
3. The object detection matrix and caption are combined into a prompt
4. LLaMA-3.2-1B-Instruct generates a detailed textual description
5. Orpheus-3B-0.1-ft converts the text into speech

## Dataset

The training dataset contains 1999 samples, each consisting of:
- An image-derived prompt with object detection matrix and caption
- A chosen response from DeepSeek-V3-0324
- A rejected response from LLaMA-3.2-1B-Instruct

You can access the dataset at [AquaLabs/Spatial-DPO-Dataset](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/)

## Model Weights

- LLaMA-3.2-1B-Instruct (fine-tuned): [AquaLabs/EchoLLaMA-1B](https://huggingface.co/AquaLabs/EchoLLaMA-1B)
- Orpheus-3B-0.1-ft (fine-tuned): [AquaLabs/Orpheus-3B-0.1-ft-Elise](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise)

## Contributors

- Ahmet Erdem Pamuk - [GitHub](https://github.com/ahmeterdempmk) | [Hugging Face](https://huggingface.co/ahmeterdempmk)
- Emir Kaan Özdemir - [GitHub](https://github.com/emirkaanozdemr) | [Hugging Face](https://huggingface.co/emirkaanozdemr)
- Şuayp Talha Kocabay - [GitHub](https://github.com/suayptalha) | [Hugging Face](https://huggingface.co/suayptalha)

## License

This project is licensed under the Apache-2.0 License.

Details are provided in the [paper]().