library_name: transformers
tags:
- EchoLLaMA
license: apache-2.0
datasets:
- AquaLabs/Spatial-DPO-Dataset
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation
EchoLLaMA: 3D-to-Speech with Multimodal AI
Overview
EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes.
Model Architecture
The EchoLLaMA pipeline integrates four specialized models:
Image Analysis:
- DETR (DEtection TRansformer) for object detection
- MiDaS for monocular depth estimation
- Moondream for holistic image captioning
Text Generation:
- LLaMA-3.2-1B-Instruct fine-tuned with DPO
Speech Synthesis:
- Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset
Speech Recognition:
- SpeechRecognition package for transcribing user speech input
Key Features
- 3D Object Detection Matrix: Constructs a grid-based representation of detected objects with spatial coordinates
- Depth-Aware Scene Understanding: Incorporates relative depth values to capture 3D relationships
- Natural Language Generation: Produces coherent and contextually rich descriptions
- High-Quality Speech Synthesis: Converts textual descriptions into natural-sounding speech
Training Details
LLaMA Model
The LLaMA-3.2-1B-Instruct model was fine-tuned using:
- Technique: Direct Preference Optimization (DPO) with LoRA
- Dataset: 2000 samples from COCO 2017 processed with DETR, and Moondream
- Chosen Responses: Generated by DeepSeek-V3-0324
- Rejected Responses: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct
- Training Parameters:
- LoRA Rank: 8
- β (DPO): 0.1
- Learning Rate: 2×10⁻⁵ with cosine decay
- Batch Size: 16 (with 2×8 accumulation)
- Sequence Length: 8192
- Hardware: 2×T4 GPU
- Training Time: 1 hour 40 minutes
Orpheus Model
The Orpheus-3B-0.1-ft TTS model was fine-tuned using:
- Technique: Low-Rank Adaptation (LoRA)
- Dataset: Elise English speech dataset
- Training Parameters:
- LoRA Rank (r): 64
- LoRA Alpha (α): 64
- LoRA Dropout: 0
- Learning Rate: 2×10⁻⁴
- Hardware: 2×T4 GPU
- Training Time: 47 minutes
Usage
Installation
# Clone the repository
git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git
cd EchoLLaMA-Pipeline
And run the Jupyter Notebook file.
Pipeline Flow
- Image is processed with DETR for object detection and MiDaS for depth estimation
- Moondream generates a caption describing the image content
- The object detection matrix and caption are combined into a prompt
- LLaMA-3.2-1B-Instruct generates a detailed textual description
- Orpheus-3B-0.1-ft converts the text into speech
Dataset
The training dataset contains 1999 samples, each consisting of:
- An image-derived prompt with object detection matrix and caption
- A chosen response from DeepSeek-V3-0324
- A rejected response from LLaMA-3.2-1B-Instruct
You can access the dataset at AquaLabs/Spatial-DPO-Dataset
Model Weights
- LLaMA-3.2-1B-Instruct (fine-tuned): AquaLabs/EchoLLaMA-1B
- Orpheus-3B-0.1-ft (fine-tuned): AquaLabs/Orpheus-3B-0.1-ft-Elise
Contributors
- Ahmet Erdem Pamuk - GitHub | Hugging Face
- Emir Kaan Özdemir - GitHub | Hugging Face
- Şuayp Talha Kocabay - GitHub | Hugging Face
License
This project is licensed under the Apache-2.0 License.
Details are provided in the paper.