Model Card for SmolLM Variation DPO Fine-Tuning for RLHF

This model is a fine-tuned variant of the SmolLM-135M-Instruct model using Direct Preference Optimization (DPO) for Reinforcement Learning from Human Feedback (RLHF). It has been optimized to prefer responses that align better with human judgments by directly optimizing the difference in log probabilities between chosen and rejected responses.

Task Description

The project involves:

  • Converting the training dataset into a prompt-only format.
  • Using a pre-trained reward model and value model to provide scalar rewards.
  • Fine-tuning the policy model with PPO (via TRL's PPOTrainer) to maximize reward margins and improve output quality.
  • Comparing generation quality before and after fine-tuning.

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated and then refined with details from our experiments and discussions.

  • Developed by: Ekaterina Aleksanian
  • Model type: Causal Language Model (autoregressive) fine-tuned using Direct Preference Optimization (DPO)
  • Language(s) (NLP): English
  • License: MIT License
  • Finetuned from model [optional]: SmolLM-135M-Instruct

How to Get Started with the Model

You can load the model using the Hugging Face Transformers library. For example:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("estnafinema0/smolLM-variation-dpo")
tokenizer = AutoTokenizer.from_pretrained("estnafinema0/smolLM-variation-dpo")

# Generate a response
messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt")
output_ids = model.generate(inputs.input_ids, max_new_tokens=256, do_sample=True)
response = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(response)

Generation Examples

Before Fine-Tuning (Reference Policy):

User: What's your morning routine like?
Assistant: What a wonderful question! As a conversational AI, I have my own wake-up routine tailored to my needs. Here's my "wake-up routine" for you:

**Morning Routine (10-15 minutes)**

1. **10:00 AM**: Start with a gentle exercise routine to loosen up my muscles and get my blood flowing. This could be a simple yoga pose like downward-facing dog, child's pose, or a standing forward fold.
2. **11:00 AM**: Finish my morning routine with a healthy breakfast or a hearty snack, such as oatmeal with fruits, nuts, or Greek yogurt with honey.
3. **6:00 PM**: Take my morning bus or drive to my workspace to start my morning routine. During my commute, I set my alarm clock, get dressed, and start my daily activities.
4. **7:00 PM**: Work on an important project or focus on other tasks, such as researching for my thesis or preparing for a project deadline.
5. **8:00 PM**: Have a quick lunch break, so I can get back to my desk or grab some lunch before my morning exercise routine begins.

After DPO Fine-Tuning:

User: What's your morning routine like?
Assistant: As the morning kicks in, I take on a new routine for myself, which I call my "Morning Reminders". They get me going, reminding me to start my day with purpose and intention, and helping me to prioritize my goals, responsibilities, and well-being.

Here's my morning routine:

**Morning Routine (5-6 minutes) - Day One (Morning)**

* 10 minutes: (15-20 minutes maximum)
    + Wake up and stretch (20-30 minutes).
    + Get dressed for work or school (60-90 minutes)
    + Take a few minutes to meditate (5-7 minutes).
    + Get breakfast ready (e.g., oatmeal, banana, scrambled eggs, toast, or oatmeal) (20-30 minutes).
* 15 minutes: (2-3 hours)
    + Get ready for work or school (e.g., tidy up, organize my schedule, get a new book, exercise).
    + Plan out my day, prioritizing tasks and activities.
    + Focus on my goals and priorities.
    + Reflect and review my progress.

Training Details

Training Data

The model was fine-tuned using the HumanLLMs/Human-Like-DPO-Dataset, which includes prompts, chosen responses, and rejected responses. The data was reformatted into a chat template where the prompt is paired with both a chosen and a rejected response.

Training Procedure

The training procedure involved:

  • Converting the raw dataset into a chat template format.
  • Tokenizing the data with a focus on preserving the important portions of the prompt and assistant responses.
  • Fine-tuning the SmolLM-135M-Instruct model using a custom DPO loss that optimizes the difference in log probabilities between chosen and rejected responses.
  • Training was performed on an A100 GPU using Automatic Mixed Precision (AMP) for efficiency.
  • Training Hyperparameters:
    • Beta (scaling factor): 1.0
    • Batch Size: 16
    • Max Sequence Length: 1024
    • Max Prompt Length: 256

Preprocessing

The dataset was processed by applying a chat template that formats the conversation with appropriate system, user, and assistant roles. Tokenization was done with truncation applied to ensure that only the most relevant tokens (especially at the end of the prompt) are kept.

Speeds, Sizes, Times

Training on an A100 GPU took approximately 2 minutes per epoch. The final model size is similar to the base model, optimized for inference with AMP.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was performed using a subset of the training data (in-distribution) and an external dataset (wikitext-2-raw-v1) to assess the model's generalization.

Factors

Evaluation factors include the log probability assigned to in-distribution versus out-of-distribution examples, reward margins, and qualitative generation quality.

Metrics

The primary evaluation metrics include:

  • Log Probability Distribution: Comparing in-distribution data with external data.
  • Reward Margins: The difference in log probabilities between chosen and rejected responses.
  • Reward Accuracy: The fraction of examples where the chosen response is assigned a higher reward than the rejected one.

Results

The evaluation shows that the fine-tuned model assigns higher log probabilities to in-distribution data compared to external examples, indicating successful alignment. Reward margins and accuracy metrics improved noticeably after fine-tuning.

Summary

In summary, the DPO fine-tuned model demonstrates enhanced ability to generate human-aligned responses. It effectively differentiates between responses preferred by human evaluators and those that are not, as evidenced by the evaluation metrics and qualitative generation examples.

Evaluation Graph

Below is the histogram comparing the log probability distributions between in-distribution (training) data and out-of-distribution (external) data.

image/png

Figure 1: The histogram shows that the model assigns higher (less negative) log probabilities to in-distribution data compared to external data, indicating that the model considers training data more probable.

Technical Specifications [optional]

Model Architecture and Objective

The model is based on the SmolLM-135M-Instruct architecture, a causal language model. The objective of fine-tuning is to align the model with human preferences using a Direct Preference Optimization (DPO) loss function, which directly optimizes the difference in log probabilities between chosen and rejected responses.

Compute Infrastructure

The fine-tuning was conducted on an A100 GPU with AMP enabled for efficiency.

Hardware

  • NVIDIA A100 GPU

Software

  • PyTorch
  • Hugging Face Transformers
  • TRL (for reinforcement learning components)

APA:

Aleksanian, E. (2025). SmolLM Variation DPO Fine-Tuning for RLHF. Retrieved from https://huggingface.co/estnafinema0/smolLM-variation-dpo

Glossary

  • DPO (Direct Preference Optimization): A method to fine-tune language models by directly optimizing the difference in log probabilities between responses preferred by humans and those that are not.
  • RLHF: Reinforcement Learning from Human Feedback.
  • AMP: Automatic Mixed Precision, a technique to accelerate training.

More Information

For further details, please refer to the associated repositories:

Model Card Authors

  • Ekaterina Aleksanian

Model Card Contact

For any questions or feedback regarding this model card, please contact Ekaterina Aleksanian at [[email protected]]. ```

Downloads last month
3
Safetensors
Model size
135M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for estnafinema0/smolLM-variation-dpo

Finetuned
(131)
this model

Collection including estnafinema0/smolLM-variation-dpo