Custom Vibe Coding Quest Part 2: 🚙 Fine-Tuning Gemma 3 for Code Reasoning

Last week I started a quest to build custom coding models. I’m inspired by the whole vibe coding thing, but focusing on personalization and fine-tuning. I wrote more about the quest in the previous blog post.
This week, I'm releasing the first model and sharing the journey's ups and downs.
TLDR: The model is way better at hard reasoning (+11% on LiveCodeBench), but loses out on latency. The model is here: https://huggingface.co/burtenshaw/GemmaCoder3-12B
The Gemma 3 on Open R1 Mission
The first mission is to build the core coding LLM that I can use for most code generation and understanding. This model needs the heavy lifting of starting out new projects and generating functional completions.
The recently released Gemma-3 seemed like a great starting point. If you want more details on Gemma 3, check out this blog post. Gemma 3 shows strong general performance and has a long context length of 128K. This is ideal for working with large code bases and complex code entities.
I noticed two areas for improvement if I’m going to use it for a code specific project. Firstly, there’s no code variant like with the Qwen series, just a general model which I expect contains a load of parameters I don’t need. Second, it doesn’t use reasoning tokens to ‘think’ on hard problems. Open AI and Anthropic have both released thinking variants of their core models, and they’re handy in cursor when working on tough problems. So it would be useful to get a model that can think.
Just a reminder, my prerequisites for this quest are: can be trained in Google Colab and runs on my laptop. Therefore, I’ll start with the 12 billion parameter variant of Gemma 3. Which is feasible to train on 1 A100 and when quantized to either 4 or 8 bit works with 32GB of RAM. This is probably the upper bounds of what we want to deal with because we won’t get to do many ablations, but it’s worth a try!
Supervised Fine-tuning to think on code
I used Supervised Fine-tuning (SFT) to adapt the pre-trained Gemma 3 model. SFT involves training the model on a dataset of instruction and response pairs – in this case, one focused on code reasoning.
SFT is not the most advanced algorithm in the post training toolkit, but it’s straightforward to implement and interpret. Based on research like the Deepseek R1 paper, we know that models are able to learn from sophisticated models through SFT on reasoning chains.
We’ll use the codeforces-cot dataset which is based on problems for Google Deepmind’s CodeForces challenge. Fine-tuning on the codeforces-cots dataset will allow the model to learn to reason on hard coding problems. The same strategy has already been used by the Open-R1 team to create the Olympic Coder series of models.
Evaluating Code and General Performance
We need some way to test the generated code and that's where LiveCodeBench comes in. LiveCodeBench uses an updating list of code challenges meant to test competitors in code competitions like Codeforces. This has two important effects on our experiment:
- The challenges are comtamination free and constantly updated, unlike alternatives like HumanEval and MBPP.
- Our training and evaluation sets are very similar, but might not be suited to real world usages. I'll explore this below and it's expanded on here Olympic Coder series of models.
GemmaCoder achieved a 32.9% average pass rate on the LiveCodeBench problems. This was a significant improvement over the base model's 21.9% performance, but is still less than models like OlympicCoder-7B which achieved a 40.7%.. We can probably that improvement to two main attributes:
- OlympicCoder 7B is based on QueenCoder-2.5-7B which has a few rounds of extra pre-training and post-training for code specific tasks.
- OlympicCoder 7B was finetuned with codeforces-cot on 32k token instead of the 8k I was limited to here.
All is not lost, OlympicCode-7B is optimized for competitive coding from the codeforces-cot dataset and we’re trying to build a general purpose coding model. There’s a chance that the model has retained some useful skills that we need. Let's explore GemmaCoder's performance on more general benchmarks to evaluate it's abilities beyond code.
Benchmark | OlympiCoder-7B | GemmaCoder-12B | Gemma3-12B-it |
---|---|---|---|
ARC | 36.2% | 49.6% | 54.2% |
Winogrande | 52.9% | 63.9% | 63.5% |
MMLU | 52.9% | 61.0% | 69.5% |
HellaSwag | 44.2% | 54.0% | 63.5% |
LiveCodeBench | 40.7% | 32.9%* | 21.9% |
*Evaluated with thinking 'always on'
We can see that GemmaCoder's abilities beyond coding have suffered from its fine-tuning, but it still has reasonable performance on the general benchmarks. For evaluation, I set thinking to 'always on' for LiveCodeBench by prepending the prompt with a <think>
token. The Olympic Coder series found that this improve performance on the LiveCodeBench problems. On other benchmarks, the model was able to generate a response without the <think>
token.
Vibe Testing
Let's look beyond the scores and try GemmaCoder out in the IDE to understand the vibe.
Task 1: Base64 Encode function for production
The first task I wanted to try is a simple day to day coding task that starts from a very basic prompt and leave a lot open to interpretation and style.
Write a python function to Base64 an image in production
Gemma3-12B-IT does a fine job of this, generating a function that does that job witha general error catch. Not something you would want in production, but it encode your image paths well enough.
import base64
def image_to_base64(image_path: str) -> str:
"""
Encodes an image file in Base64.
Parameters:
image_path (str): The path to the image file.
Returns:
str: The Base64 encoded string of the image, or None if an error occurred.
"""
try:
with open(image_path, "rb") as image_file:
encoded_bytes = base64.b64encode(image_file.read())
return encoded_bytes.decode('utf-8')
except Exception as e:
print(f"An error occurred: {e}")
return None
GemmaCoder without thinking takes a similar approach starting from a file path and string. It does add some more error handling for a FileNotFoundError
which is handy and a lot more inline notes, which could be your vibe or not.
import base64
def image_to_base64(image_path):
"""
Encodes an image file into a Base64 string.
Args:
image_path: The path to the image file.
Returns:
A Base64 encoded string representation of the image, or None if there was an error.
"""
try:
with open(image_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode('utf-8') # Decode to string for easier handling
return encoded_string
except FileNotFoundError:
print(f"Error: Image file not found at {image_path}")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
If we let GemmaCoder think it goes way further and implements a script with further error handling, file type checks, and a function for the image object rather than the path. Personally, I'm not sure all of this would be necessary, but maybe we're over using the thinking abilities here.
import base64
from PIL import Image
from io import BytesIO
def image_to_base64(image_object):
"""
Encodes a PIL Image object into a Base64 string.
Args:
image_object: A PIL Image object.
Returns:
A Base64 encoded string representation of the image, or None if there was an error.
"""
try:
buffered = BytesIO()
image_object.save(buffered, format=image_object.format if image_object.format else "PNG") # Save to BytesIO object
img_byte = buffered.getvalue()
encoded_string = base64.b64encode(img_byte).decode('utf-8') # Decode to string for easier handling
return encoded_string
except Exception as e:
print(f"An error occurred: {e}")
return None
if __name__ == '__main__':
# Example usage:
try:
# Open an image using PIL
img = Image.open("your_image.jpg") # Replace "your_image.jpg" with an actual image path
# Convert the PIL Image object to Base64
base64_string = image_to_base64(img)
if base64_string:
print("Image encoded to Base64 successfully!")
# You can now use the base64_string (e.g., embed in HTML, send in JSON)
# print(base64_string[:50]) # Print the first 50 characters as an example
# To decode the Base64 string back to an image (example):
img_data = base64.b64decode(base64_string)
img_from_base64 = Image.open(BytesIO(img_data))
img_from_base64.show() # This will open the image using your default image viewer
except FileNotFoundError:
print("Error: Example image file not found.")
except Exception as e:
print(f"An error occurred during example usage: {e}")
The main disadvantage of thinking here is latency, GemmaCoder took an extra minute to respond to this respond when thinking.
Task 2: Unicorns!
Qualitatively, OlympicCode-7B was the best at handling problems needing logical reasoning and multi-step solutions, as we saw in the LiveCodeBench scores. However, it still struggled with tasks that involved layout. Let's explore GemmaCoder's visual abillities. I'm a fan of Adam K Dean GPT unicorn project where he challenges LLMs to draw unicorns. I thought this would be a fun test for GemmaCoder.
First, here's an example from Gemma3-12B-IT.
That's clearly not a unicorn but if we squint we can see some relevant aspects.
- it has a body
- there are four legs below the body
- it has a tail
- there's a horn like thing above the head
Sure, we're stretching here, but if you take a look at GPT Unicorn, you'll see that it's not the worst effort.
Now, let's look at GemmaCoder-12B. It does a slight better job of drawing a unicorn with these improvements:
- the unicorn is white
- it has independent legs, head, and body
I also explored the GemmaCoder-12B on a number of competitive coding problems. I could not identify the shortcomings in Gemma-12B-Codeforces. in relation to Gemma or OlympicCoder. I expect that I probably didn’t push its reasoning abilities to the max like the LiveCodeBench evals do.
Development Setup
To use local models in my IDE I use this setup, which I also descibe in this blog post:
- VSCode for development
- LMStudio for local inference
- Continue.dev VSCode extension for chatting, agents, and completion.
I also customised the chat templates to contain the <think>
token or not and loaded both variations into LMStudio as separate models. Within Continue, I could then define two models and swap them out for problems that need more thinking.
[
{
"apiBase": "http://localhost:1234/v1/",
"model": "gemmacoder3-12b-thinking",
"title": "gemmacoder3-12b-thinking",
"provider": "lmstudio"
},{
"apiBase": "http://localhost:1234/v1/",
"model": "gemmacoder3-12b",
"title": "gemmacoder3-12b",
"provider": "lmstudio"
},
]
Conclusion
This small experiment shows that supervised fine-tuning can significantly improve Gemma 3's code reasoning on hard competitive coding problems, learning from Codeforces-CoT. Also, we can maintain some of the base properties of the language model we’re tuning, which is great.
That said, The jury is still out on whether this is the kind of model I would want to use on a daily basis. Thinking is great for hard problem, but in most cases I want a 'good enough' answer quickly rather than the perfect one a minute later. In the next round of this quests, I'll explore helpfulness and latency to see if I can get a model that completes code quickly and accurately.