qgallouedec
/

online-dpo-qwen2-4

Text Generation

Generated from Trainer

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Model Card for online-dpo-qwen2-4

This model is a fine-tuned version of Qwen/Qwen2-0.5B-Instruct on the trl-lib/ultrafeedback-prompt dataset. It has been trained using TRL.

Quick start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="qgallouedec/online-dpo-qwen2-4", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=500)[0]
print(output["generated_text"][1]["content"])

Training procedure

This model was trained with Online DPO, a method introduced in Direct Language Model Alignment from Online AI Feedback.

Framework versions

TRL: 0.12.0.dev0
Transformers: 4.45.0.dev0
Pytorch: 2.4.1
Datasets: 3.0.0
Tokenizers: 0.19.1

Downloads last month: 233

Safetensors

Model size

494M params

Tensor type

F32

·

Inference Examples

Text Generation

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for qgallouedec/online-dpo-qwen2-4

Base model

Qwen/Qwen2-0.5B

Finetuned

Qwen/Qwen2-0.5B-Instruct

Finetuned

(56)

this model

Dataset used to train qgallouedec/online-dpo-qwen2-4