|
--- |
|
library_name: transformers |
|
datasets: |
|
- mlabonne/orpo-dpo-mix-40k |
|
base_model: |
|
- meta-llama/Llama-3.2-1B-Instruct |
|
--- |
|
# ORPO-Tuned Llama2-1B-Instruct |
|
|
|
NB: Done purely as a fine-tuning exercise. Not intedned for any practical use. |
|
|
|
This model is a fine-tuned version of Meta's Llama-3.2-1B-Instruct using ORPO (Optimizing Reward with Policy Optimization). The model was trained to better align with human preferences using a curated preference dataset from mlabonne/orpo-dpo-mix-40k. |
|
|
|
|
|
## Model Details |
|
|
|
- **Base Model**: meta-llama/Llama-3.2-1B-Instruct |
|
- **Training Method**: ORPO (Optimizing Reward with Policy Optimization) with LoRA |
|
- **Training Dataset**: mlabonne/orpo-dpo-mix-40k (subset of 100 examples) |
|
- **Framework**: Hugging Face Transformers, TRL, PEFT |
|
- **Training Date**: November 2024 |
|
- **License**: Same as base model (Llama 2) |
|
|
|
## Training Process |
|
|
|
The model was fine-tuned using LoRA (Low-Rank Adaptation) with the following configuration: |
|
|
|
### LoRA Parameters |
|
- r=16 (rank) |
|
- lora_alpha=32 |
|
- lora_dropout=0.05 |
|
- bias="none" |
|
- task_type="CAUSAL_LM" |
|
|
|
### Training Parameters |
|
- Learning rate: 1e-5 |
|
- Batch size: 4 |
|
- Gradient accumulation steps: 4 |
|
- Maximum steps: 100 |
|
- Warmup steps: 10 |
|
- Gradient checkpointing: Enabled |
|
- FP16 training: Enabled |
|
- Maximum sequence length: 512 |
|
- Maximum prompt length: 512 |
|
- Optimizer: AdamW |
|
|
|
## Evaluation Results |
|
|
|
The model was evaluated on the HellaSwag benchmark with the following configuration: |
|
- Batch size: 64 (auto-detected) |
|
- Full evaluation set |
|
- Zero-shot setting |
|
- FP16 precision |
|
|
|
Results: |
|
| Metric | Value | Standard Error | |
|
|--------|-------|---------------| |
|
| Accuracy | 45.20% | ±0.50% | |
|
| Normalized Accuracy | 60.78% | ±0.49% | |