EchoLLaMA-1B / README.md
suayptalha's picture
Update README.md
69130a8 verified
metadata
library_name: transformers
tags:
  - EchoLLaMA
license: apache-2.0
datasets:
  - AquaLabs/Spatial-DPO-Dataset
language:
  - en
base_model:
  - meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation

EchoLLaMA: 3D-to-Speech with Multimodal AI

Hugging Face Hugging Face Hugging Face

Overview

EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes.

Model Architecture

The EchoLLaMA pipeline integrates four specialized models:

  1. Image Analysis:

    • DETR (DEtection TRansformer) for object detection
    • MiDaS for monocular depth estimation
    • Moondream for holistic image captioning
  2. Text Generation:

    • LLaMA-3.2-1B-Instruct fine-tuned with DPO
  3. Speech Synthesis:

    • Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset
  4. Speech Recognition:

    • SpeechRecognition package for transcribing user speech input

Key Features

  • 3D Object Detection Matrix: Constructs a grid-based representation of detected objects with spatial coordinates
  • Depth-Aware Scene Understanding: Incorporates relative depth values to capture 3D relationships
  • Natural Language Generation: Produces coherent and contextually rich descriptions
  • High-Quality Speech Synthesis: Converts textual descriptions into natural-sounding speech

Training Details

LLaMA Model

The LLaMA-3.2-1B-Instruct model was fine-tuned using:

  • Technique: Direct Preference Optimization (DPO) with LoRA
  • Dataset: 2000 samples from COCO 2017 processed with DETR, and Moondream
  • Chosen Responses: Generated by DeepSeek-V3-0324
  • Rejected Responses: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct
  • Training Parameters:
    • LoRA Rank: 8
    • β (DPO): 0.1
    • Learning Rate: 2×10⁻⁵ with cosine decay
    • Batch Size: 16 (with 2×8 accumulation)
    • Sequence Length: 8192
  • Hardware: 2×T4 GPU
  • Training Time: 1 hour 40 minutes

Orpheus Model

The Orpheus-3B-0.1-ft TTS model was fine-tuned using:

  • Technique: Low-Rank Adaptation (LoRA)
  • Dataset: Elise English speech dataset
  • Training Parameters:
    • LoRA Rank (r): 64
    • LoRA Alpha (α): 64
    • LoRA Dropout: 0
    • Learning Rate: 2×10⁻⁴
  • Hardware: 2×T4 GPU
  • Training Time: 47 minutes

Usage

Installation

# Clone the repository
git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git
cd EchoLLaMA-Pipeline

And run the Jupyter Notebook file.

Pipeline Flow

  1. Image is processed with DETR for object detection and MiDaS for depth estimation
  2. Moondream generates a caption describing the image content
  3. The object detection matrix and caption are combined into a prompt
  4. LLaMA-3.2-1B-Instruct generates a detailed textual description
  5. Orpheus-3B-0.1-ft converts the text into speech

Dataset

The training dataset contains 1999 samples, each consisting of:

  • An image-derived prompt with object detection matrix and caption
  • A chosen response from DeepSeek-V3-0324
  • A rejected response from LLaMA-3.2-1B-Instruct

You can access the dataset at AquaLabs/Spatial-DPO-Dataset

Model Weights

Contributors

License

This project is licensed under the Apache-2.0 License.

Details are provided in the paper.