--- base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit tags: - text-generation-inference - transformers - unsloth - qwen2 - trl - grpo - Reinforcement license: apache-2.0 language: - en datasets: - openai/gsm8k --- # Uploaded model - **Developed by:** vishal042002 - **License:** apache-2.0 - **Finetuned from model :** unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) Run The Model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "vishal042002/Qwen2.5-3B-GRPO", torch_dtype="auto", device_map="cuda" ) tokenizer = AutoTokenizer.from_pretrained("vishal042002/Qwen2.5-3B-GRPO") text = "Look at this series: 36, 34, 30, 28, 24, … What number should come next?" inputs = tokenizer(text, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=128, temperature=0.7, top_p=0.9 ) response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] print(response) ``` ## Training Details This model was fine-tuned using Generalized Reward-Preference Optimization (GRPO), a reinforcement learning technique that combines reward optimization with preference learning. ### Training Process - Base Model: Qwen 2.5 3B - Method: GRPO (Generalized Reward-Preference Optimization) - Training Focus: The model was trained to balance between maximizing reward signals while respecting human preferences - Learning Approach: The model learns from both explicit rewards and pairwise preference data, helping it to generate more aligned and high-quality responses GRPO enhances the model's capabilities by: - Incorporating both reward signals and preference learning - Maintaining a balance between performance optimization and preference alignment - Reducing the potential for reward hacking while preserving desired model behaviors