Fine-tuned Language Model for Preference Optimization (DPO)

Model Overview

This model is a fine-tuned version of Llama 3.2-3B-Instruct with Direct Preference Optimization (DPO), specialized for reward modeling tasks. It has been optimized using memory-efficient techniques including 4-bit quantization, gradient checkpointing, and parameter-efficient fine-tuning (PEFT). The model is tailored for tasks requiring language comprehension, instruction-based response generation, and preference-based ranking of responses.

Model Details

Base Model: unsloth/Llama-3.2-3B-Instruct
Fine-Tuning Objective: Preference Optimization (DPO) using pairs of accepted and rejected responses.
Training Framework: Built on Unsloth with integration to Hugging Face Datasets and Transformers.
Quantization: Utilizes 4-bit quantization for reduced memory usage, suitable for low VRAM devices.
Optimizations: Includes gradient checkpointing for enhanced memory efficiency and faster inference. The model has undergone fine-tuning using PEFT methods such as LoRA (Low-Rank Adaptation).
Training Data: Trained on the Intel/orca_dpo_pairs dataset containing instruction-input-response pairs for preference-based learning.

Model Capabilities

Text Generation: Capable of generating detailed and coherent text responses based on instructions or prompts.
Preference-Based Optimization: Fine-tuned to rank responses based on user feedback (chosen vs. rejected).
Long Contexts: Supports processing up to 2048 tokens of input efficiently, facilitated by internal RoPE scaling.
Faster Inference: Optimized for real-time text generation with streaming capabilities and low-latency responses.

Intended Use

This model can be applied to various natural language processing (NLP) tasks, including:

Question Answering: Responding to user queries with detailed and contextually accurate information.
Instruction Following: Generating responses based on user-defined tasks.
Preference Modeling: Ranking different responses based on preferences provided in training data.
Text Completion: Completing partially given texts based on provided instructions.

Limitations

Context Length: While capable of processing up to 2048 tokens, extremely long texts may require additional optimization or truncation.
Precision: The model's 4-bit quantization may result in minor loss of precision in certain edge cases requiring high accuracy.
Dataset Bias: Reflects biases present in the training dataset used for preference pairs labeling.

Technical Details

Model Architecture: Based on Llama 3.2 with 3 billion parameters.
Training Method: Fine-tuned using Direct Preference Optimization (DPO).
Optimizer: Utilizes AdamW optimizer with 8-bit precision for efficiency.
Batch Size: Effective batch size of 8 (2 per device with 4-step gradient accumulation).
Training Configuration:
- Learning rate: 5e-6
- Warm-up ratio: 0.1
- Epochs: 1
- Max sequence length: 2048 tokens
Mixed Precision Training: Supports FP16 and BFloat16 depending on hardware.

Usage Instructions

Install Dependencies

Ensure torch, transformers, unsloth, and other required libraries are installed for inference and fine-tuning.

Load Pretrained Model

You can load the model using FastLanguageModel.from_pretrained() by specifying the model name and optimization settings.

Fine-Tuning

Apply PEFT and quantization strategies (e.g., LoRA, gradient checkpointing) using the dataset of preference pairs for fine-tuning.

Inference

Use the FastLanguageModel.for_inference() method to enable optimized text generation, which supports streaming inference for real-time output.

Performance Metrics

Training Loss: 1.19
Training Runtime: 1974.06 seconds (approximately 32 minutes)
Steps Per Second: 0.063
Samples Per Second: 0.507

Model Version

Version: Unsloth 2025.1.7 (Patched version)
Training Date: January 2025

Acknowledgements

This model was trained using the Unsloth framework with contributions from Intel and Hugging Face for data and tools.

Notebook

Access the implementation notebook for this model here. This notebook provides detailed steps for fine-tuning and deploying the model.

SURESHBEEKHANI
/

llama_3_2_3B-dpo-rlhf-fine-tuning