GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?
The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).
π Supercharge your LLM apps with Langfuse on Hugging Face Spaces!
Langfuse brings end-to-end observability and tooling to accelerate your dev workflow from experiments through production
Now available as a Docker Space directly on the HF Hub! π€
π Trace everything: monitor LLM calls, retrieval, and agent actions with popular frameworks 1β£ One-click deployment: on Spaces with persistent storage and integrated OAuth π Simple Prompt Management: Version, edit, and update without redeployment β Intuitive Evals: Collect user feedback, run model/prompt evaluations, and improve quality π Dataset Creation: Build datasets directly from production data to enhance future performance
Kudos to the Langfuse team for this collab and the awesome, open-first product theyβre building! π @marcklingen@Clemo@MJannik
small but mighty π₯ you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM π«°π» also with gradient accumulation simulated batch size is 16 β¨ I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work π https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb