EchoLLaMA-1B / README.md

Update README.md

69130a8 verified 2 months ago

4.58 kB

	---
	library_name: transformers
	tags:
	- EchoLLaMA
	license: apache-2.0
	datasets:
	- AquaLabs/Spatial-DPO-Dataset
	language:
	- en
	base_model:
	- meta-llama/Llama-3.2-1B-Instruct
	pipeline_tag: text-generation
	---

	# EchoLLaMA: 3D-to-Speech with Multimodal AI

	[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-EchoLLaMA--1B-yellow)](https://huggingface.co/AquaLabs/EchoLLaMA-1B)
	[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Orpheus--3B--0.1--ft--Elise-blue)](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise)
	[![Hugging Face](https://img.shields.io/badge/Dataset-Spatial--DPO--Dataset-green)](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/)

	## Overview

	EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes.

	## Model Architecture

	The EchoLLaMA pipeline integrates four specialized models:

	1. Image Analysis:
	- DETR (DEtection TRansformer) for object detection
	- MiDaS for monocular depth estimation
	- Moondream for holistic image captioning

	2. Text Generation:
	- LLaMA-3.2-1B-Instruct fine-tuned with DPO

	3. Speech Synthesis:
	- Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset

	4. Speech Recognition:
	- SpeechRecognition package for transcribing user speech input

	## Key Features

	- 3D Object Detection Matrix: Constructs a grid-based representation of detected objects with spatial coordinates
	- Depth-Aware Scene Understanding: Incorporates relative depth values to capture 3D relationships
	- Natural Language Generation: Produces coherent and contextually rich descriptions
	- High-Quality Speech Synthesis: Converts textual descriptions into natural-sounding speech

	## Training Details

	### LLaMA Model

	The LLaMA-3.2-1B-Instruct model was fine-tuned using:

	- Technique: Direct Preference Optimization (DPO) with LoRA
	- Dataset: 2000 samples from COCO 2017 processed with DETR, and Moondream
	- Chosen Responses: Generated by DeepSeek-V3-0324
	- Rejected Responses: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct
	- Training Parameters:
	- LoRA Rank: 8
	- β (DPO): 0.1
	- Learning Rate: 2×10⁻⁵ with cosine decay
	- Batch Size: 16 (with 2×8 accumulation)
	- Sequence Length: 8192
	- Hardware: 2×T4 GPU
	- Training Time: 1 hour 40 minutes

	### Orpheus Model

	The Orpheus-3B-0.1-ft TTS model was fine-tuned using:

	- Technique: Low-Rank Adaptation (LoRA)
	- Dataset: Elise English speech dataset
	- Training Parameters:
	- LoRA Rank (r): 64
	- LoRA Alpha (α): 64
	- LoRA Dropout: 0
	- Learning Rate: 2×10⁻⁴
	- Hardware: 2×T4 GPU
	- Training Time: 47 minutes

	## Usage

	### Installation

	```bash
	# Clone the repository
	git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git
	cd EchoLLaMA-Pipeline
	```

	And run the Jupyter Notebook file.

	## Pipeline Flow

	1. Image is processed with DETR for object detection and MiDaS for depth estimation
	2. Moondream generates a caption describing the image content
	3. The object detection matrix and caption are combined into a prompt
	4. LLaMA-3.2-1B-Instruct generates a detailed textual description
	5. Orpheus-3B-0.1-ft converts the text into speech

	## Dataset

	The training dataset contains 1999 samples, each consisting of:
	- An image-derived prompt with object detection matrix and caption
	- A chosen response from DeepSeek-V3-0324
	- A rejected response from LLaMA-3.2-1B-Instruct

	You can access the dataset at [AquaLabs/Spatial-DPO-Dataset](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/)

	## Model Weights

	- LLaMA-3.2-1B-Instruct (fine-tuned): [AquaLabs/EchoLLaMA-1B](https://huggingface.co/AquaLabs/EchoLLaMA-1B)
	- Orpheus-3B-0.1-ft (fine-tuned): [AquaLabs/Orpheus-3B-0.1-ft-Elise](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise)

	## Contributors

	- Ahmet Erdem Pamuk - [GitHub](https://github.com/ahmeterdempmk) \| [Hugging Face](https://huggingface.co/ahmeterdempmk)
	- Emir Kaan Özdemir - [GitHub](https://github.com/emirkaanozdemr) \| [Hugging Face](https://huggingface.co/emirkaanozdemr)
	- Şuayp Talha Kocabay - [GitHub](https://github.com/suayptalha) \| [Hugging Face](https://huggingface.co/suayptalha)

	## License

	This project is licensed under the Apache-2.0 License.

	Details are provided in the [paper]().