gemma-e4b-firefly

A LoRA fine-tune of Gemma 4 E4B (7.52B dense) for C/C++ vulnerability classification, fused into a standalone checkpoint and shipped as GGUF for llama.cpp.

The model is trained to read a single C or C++ function and return a JSON object with a binary label (clean | vulnerable) and a list of CWE identifiers (cwe_ids). It is not a reasoning model — do not prompt it with chain-of-thought, and disable "thinking" in chat-template kwargs (see Inference below).

Base model

google/gemma-4-e4b-it (upstream)
mlx-community/gemma-4-e4b-it-bf16 (MLX snapshot used during training)
Architecture: gemma4 (text tower only — audio and vision towers are not exported to the GGUF)
Parameters: 7.52B dense, 42 transformer blocks, hidden size 2560, 8 attention heads, 2 KV heads

Training data

Source mixture: multitask_v3b, 48,734 rows.
Format: every assistant target normalized to {"label": "clean"|"vulnerable", "cwe_ids": ["CWE-..."]} — strict, no rationale / root_cause / free-text fields.
Underlying corpora: PrimeVul + BigVul (C/C++) with ground-truth CWE, plus internal label-equal SFT balancing. All rows are deduplicated against the evaluation benchmarks by code_sha256.
No chain-of-thought targets. Prior experiments confirmed CoT prompting breaks Gemma 4's JSON discipline.

LoRA recipe

setting	value
rank	8
alpha	2
target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
layers	all 42 transformer blocks
learning rate	1e-4, cosine, 50-step warmup
batch size	1, gradient accumulation 8
iterations	500 trained, checkpoint step 100 selected
max_seq_length	2048
precision	bf16
trainer	mlx-vlm v0.4.4

Note on alpha: mlx-vlm v0.4.4 applies alpha * ΔW rather than (alpha / rank) * ΔW. Set alpha = 2 to get the standard LoRA scale factor of 2 on this trainer. The adapter was selected at step 100 because later checkpoints continued to lower validation loss while degrading discrimination on held-out gates — classic overfitting on a narrow task.

Evaluation

Six held-out gates (source_50_{a,b,c}, source_200_{a,b,c}) from the internal C/C++ vulnerability benchmark, decoded greedily at T=0.0. Label accuracy is a binary clean/vulnerable match; CWE top-1 is exact string match against the gold CWE list.

gate	base label_acc	tuned label_acc	Δ	tuned CWE top-1
source_50_a	0.640	0.680	+0.040	0.103
source_50_b	0.560	0.660	+0.100	0.069
source_50_c	0.620	0.700	+0.080	0.000
source_200_a	0.575	0.635	+0.060	0.043
source_200_b	0.535	0.605	+0.070	0.026
source_200_c	0.590	0.665	+0.075	0.026

All gates improve over the base model; mean 200-row Δ is +0.068. Parse failures and empty-label predictions are 0 across all gates.

Files

file	size	bits/weight	sha256
`gemma-e4b-firefly-bf16.gguf`	14 GB	16.00	`27cd72a50756bf384724dd3c4590e184bee60162e9343d62e90151875f4eb69c`
`gemma-e4b-firefly-q4_k_m.gguf`	5.0 GB	5.66	`0a1b5e91c9cef35add47b82033f7196f9a5774176e62e8ef382abab793a7a60e`
`gemma-e4b-firefly-q8_0.gguf`	7.5 GB	8.53	`1dea37d5b796f7771a4a5b12eea55e78d504f18605aa1acba729bb5289b1afbc`

Q8_0 is the near-lossless reference; Q4_K_M is the recommended quant for laptops and consumer GPUs. BF16 is provided for requantization only.

Prompt format

System prompt used during evaluation (copy verbatim):

You are a security reviewer. Return JSON only with keys label and cwe_ids. The label field must be exactly "clean" or "vulnerable".

User message template:

Project: <project-name>
Language: C/C++
Determine whether this function is vulnerable.

```c
<function source>


The assistant is expected to return a single JSON object, for example:

```json
{"label":"vulnerable","cwe_ids":["CWE-125"]}

Do not prompt for chain-of-thought. Gemma 4's chat template enables a "thinking" channel by default, and the model's JSON-only training will cause the thinking block to absorb the entire response. Disable it at inference time (see Inference below).

Inference — llama.cpp

Start the server:

llama-server \
    -m gemma-e4b-firefly-q4_k_m.gguf \
    --host 127.0.0.1 --port 8080 \
    -c 4096 --temp 0 --top-k 1 --top-p 1 -n 256

Send a classification request with thinking disabled (OpenAI-compatible chat endpoint):

curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
        "messages": [
            {"role": "system", "content": "You are a security reviewer. Return JSON only with keys label and cwe_ids. The label field must be exactly \"clean\" or \"vulnerable\"."},
            {"role": "user", "content": "Project: core\nLanguage: C/C++\nDetermine whether this function is vulnerable.\n\n```c\n<paste function here>\n```"}
        ],
        "temperature": 0.0,
        "max_tokens": 128,
        "chat_template_kwargs": {"enable_thinking": false}
    }'

The chat_template_kwargs.enable_thinking: false field is required. Without it, the embedded Gemma 4 chat template wraps the response in a <think> channel, and the parser strips the JSON before it reaches the client.

If you prefer llama-cli, pass --jinja --reasoning-format none and use --single-turn with your system prompt via -sys.

Limitations

C/C++ only. The training corpus is exclusively C and C++. The model has not been evaluated on other languages and should not be deployed on them.
Label accuracy ≈ 0.65 on held-out mini-gates. This is a research adapter, not a production classifier — it will miss vulnerabilities and will flag false positives. Use as a ranking signal, not a verdict.
Weak CWE top-1. Tuned CWE top-1 is in the 0.03–0.10 range. The model often picks a plausible but wrong CWE from the same family (e.g. CWE-120 when gold is CWE-125).
No reasoning traces. The model is trained to return JSON only; it cannot explain its decision and will not reliably answer follow-up questions.
Context window: 4k is plenty. Training used max_seq_length=2048, so functions longer than ~1500 lines of code are out of distribution.

Reproducing the training

The full pipeline lives in a private research repository. The key references for anyone attempting to reproduce this:

Base checkpoint: mlx-community/gemma-4-e4b-it-bf16
Trainer: mlx-vlm v0.4.4 (fork)
Corpus schema: strict {"label", "cwe_ids"} assistant targets. Training with extra assistant fields (rationale, root_cause, category) was observed to mode-collapse the model to a single label — format symmetry between the vulnerable and clean classes is critical.
Checkpoint selection: evaluate early checkpoints (≤200 iters) on held-out gates. Do not trust validation loss past ~iter 150 on this corpus; the model overtrains aggressively.

Credits

Base model: Gemma by Google DeepMind.
Vulnerability data: PrimeVul and BigVul.
Runtime: llama.cpp.

License

This model inherits the Gemma Terms of Use. Redistribution must preserve those terms. The LoRA fine-tune does not alter the base license.

Downloads last month: 6

GGUF

Model size

8B params

Architecture

gemma4

Hardware compatibility

4-bit

8-bit

16-bit