ORPO-Tuned Llama2-1B-Instruct

NB: Done purely as a fine-tuning exercise. Not intedned for any practical use.

This model is a fine-tuned version of Meta's Llama-3.2-1B-Instruct using ORPO (Optimizing Reward with Policy Optimization). The model was trained to better align with human preferences using a curated preference dataset from mlabonne/orpo-dpo-mix-40k.

Model Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Training Method: ORPO (Optimizing Reward with Policy Optimization) with LoRA
Training Dataset: mlabonne/orpo-dpo-mix-40k (subset of 100 examples)
Framework: Hugging Face Transformers, TRL, PEFT
Training Date: November 2024
License: Same as base model (Llama 2)

Training Process

The model was fine-tuned using LoRA (Low-Rank Adaptation) with the following configuration:

LoRA Parameters

r=16 (rank)
lora_alpha=32
lora_dropout=0.05
bias="none"
task_type="CAUSAL_LM"

Training Parameters

Learning rate: 1e-5
Batch size: 4
Gradient accumulation steps: 4
Maximum steps: 100
Warmup steps: 10
Gradient checkpointing: Enabled
FP16 training: Enabled
Maximum sequence length: 512
Maximum prompt length: 512
Optimizer: AdamW

Evaluation Results

The model was evaluated on the HellaSwag benchmark with the following configuration:

Batch size: 64 (auto-detected)
Full evaluation set
Zero-shot setting
FP16 precision

Results:

Metric	Value	Standard Error
Accuracy	45.20%	±0.50%
Normalized Accuracy	60.78%	±0.49%

illeto
/

finetunning-week2