Post
everybody and their dog is fine-tuning Gemma 3 today, so I thought I'd do a longer post on the tips and sharp edges I find. let's go!
1. has to be install everything form main and nightly. this is what I'm working with to get unsloth and TRL running
plus this with
2. will brown's code to turn GSM8k into a reasoning dataset is a nice toy experiment https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb
3. with a learning rate of 5e-6 rewards and loss stayed flat for the first 100 or so steps.
4. so far none of my runs have undermined the outputs after 1 epoch. therefore, I'm mainly experimenting with bigger LoRA adapters.
5. vision fine-tuning isn't available in TRL's GRPOTrainer, so stick to text datasets. but no need to load the model differently in transformers or Unsloth
if you want an introduction to GRPO, check out the reasoning course, it walks you through the algorithm, theory, and implementation in a smooth way.
https://huggingface.co/reasoning-course
1. has to be install everything form main and nightly. this is what I'm working with to get unsloth and TRL running
git+https://github.com/huggingface/transformers@main
git+https://github.com/huggingface/trl.git@main
bitsandbytes
peft
plus this with
--no-deps
git+https://github.com/unslothai/unsloth-zoo.git@nightly
git+https://github.com/unslothai/unsloth.git@nightly
2. will brown's code to turn GSM8k into a reasoning dataset is a nice toy experiment https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb
3. with a learning rate of 5e-6 rewards and loss stayed flat for the first 100 or so steps.
4. so far none of my runs have undermined the outputs after 1 epoch. therefore, I'm mainly experimenting with bigger LoRA adapters.
from trl import GRPOConfig
training_args = GRPOConfig(
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
logging_steps = 1,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 1,
num_generations = 2,
max_prompt_length = 256,
max_completion_length = 1024 - 256,
num_train_epochs = 1,
max_steps = 250,
save_steps = 250,
max_grad_norm = 0.1,
report_to = "none",
)
5. vision fine-tuning isn't available in TRL's GRPOTrainer, so stick to text datasets. but no need to load the model differently in transformers or Unsloth
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it)
if you want an introduction to GRPO, check out the reasoning course, it walks you through the algorithm, theory, and implementation in a smooth way.
https://huggingface.co/reasoning-course