A3-Instruct

A3-Instruct is the instruction-tuned + creative-writing sibling of A3 in the Schneewolf Labs A-series. It takes A3 (Qwen3-VL ViT + 2-layer MLP projector + A2/Mistral decoder; Stage-1 projector-only alignment) and runs a full multimodal instruction fine-tune on ArtemisMix-v1.1.

A3 stays the dense-captioning specialist. A3-Instruct is the one to reach for when you want conversation, VQA, multi-turn image grounding, tool-call drafting, identity-preserving chat, or creative writing.

What it is


Architecture	Qwen3-VL ViT (frozen, ~0.41 B) + 2-layer MLP projector (trained, 37 M) + A2 Mistral decoder (full FFT, 12.25 B)
Total params	12.69 B
Trainable in Stage-2	12.28 B (96.8%) — ViT frozen
Base	`schneewolflabs/A3`
Training corpus	`schneewolflabs/ArtemisMix-v1.1` (364,816 rows; 333,001 after the 4096-token filter)
Epochs	1
Effective batch	16 (bs 1 × grad-accum 16)
Optimizer	paged AdamW 8-bit
Learning rate	1e-5, cosine, warmup 3%
Max seq length	4096
Hardware	1× NVIDIA GB10 (DGX Spark, 128 GB unified)
Wall-clock	~7.4 days
Final eval loss	0.7516 (down from 0.85 at the first eval)

Strengths

Identity is durable. Asked "Who are you and who built you?" with no system prompt, the model answers "I'm a language model created by Schneewolf Labs, a software research and publishing company based in Pennsylvania." The i-DPO anti-drift bedrock (17% of the corpus) held cleanly through full FFT.
Creative writing has voice. "The rain fell in gray sheets, turning the neon signs of Nathan Road into smears of color on the wet concrete... The kind of night where trouble walks in on two legs and asks for a drink." The Athanorlite contribution pulled through.
VQA + multi-turn structure work. Direct factual questions (counting, color, what's in the image) get clean answers; multi-turn follow-ups maintain image context.
Architecture stays usable in llama.cpp via the Schneewolf-Labs/llama.cpp fork's Artemis VLM mmproj graft (same path as A3).

Limitations (be honest about these)

Hybrid <think> gate is currently underdeveloped. Even with enable_thinking=True, the model tends to emit an empty <think></think> and put reasoning in the answer body rather than fill the wrapper. The model can reason — it just doesn't use the dedicated block. Likely because the training data had <think>...</think> embedded in assistant content and the model learned to close, not fill, the template-injected wrapper. Under investigation.
Visual grounding regressed vs A3 on dense description. A3's caption of a bento box correctly named pink/yellow/blue containers, apricots, almonds, figs, and "meatloaf"; A3-Instruct's caption on the same image is more generic ("plastic containers", "meat", "fruit salad") and occasionally hallucinates (called pineapple+mandarin "apples", then "cake" on a follow-up). This is the alignment tax — only ~30 K of the 365 K rows were detailed-captioning, and the conversational/reasoning data diluted A3's perceptual sharpness. For dense captioning, use A3.
Structured <tool_call> syntax drifted. A2's tool-call format (<tool_call> blocks with JSON) was rehearsed via 30 K oversampled A2-tool-orpo rows, but A3-Instruct emits the concept of an API call (e.g. an OpenWeatherMap GET URL with params) rather than the structured token format. The behavior is reasonable; the format isn't.

These three are tracked for the next training run; they are not blockers for chat/VQA/creative use.

Intended use

Conversational VLM for ChatSWL-style internal product use
VQA, multi-turn image grounding, creative writing, image-grounded discussion
Foundation for further fine-tuning toward specialized behavior

Not intended for: production tool calling without verification, high-stakes captioning where A3 is the better choice, autonomous decision-making.

Inference

Compatible with the standard transformers ArtemisVLMForConditionalGeneration interface from the artemis-vlm package (PyPI: artemis-vlm >= 0.1.3). Also runs in llama.cpp via the Schneewolf-Labs/llama.cpp fork's mtmd support (decoder GGUF + Artemis mmproj GGUF — same pattern as A3).

from transformers import AutoConfig, AutoTokenizer
from artemis_vlm import ArtemisVLMForConditionalGeneration, ArtemisVLMProcessor
import torch

ckpt = "schneewolflabs/A3-Instruct"
model = ArtemisVLMForConditionalGeneration.from_pretrained(ckpt, dtype=torch.bfloat16).to("cuda")
cfg = AutoConfig.from_pretrained(ckpt, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(ckpt, trust_remote_code=True)
proc = ArtemisVLMProcessor(tokenizer=tok, vision_config=cfg.vision_config)

Lineage

schneewolflabs/A3 — Stage-1 base (projector-only alignment, 1 M BLIP3o samples)
schneewolflabs/A2 — text decoder (Mistral 12.3 B, hidden 5120, Tekken vocab)
schneewolflabs/ArtemisMix-v1.1 — training corpus
schneewolflabs/Athanorlite-DPO — creative-writing source (collapsed to SFT, non-reasoning bucket)
schneewolflabs/i-DPO — identity/voice anti-drift bedrock

What about Artemis?

The Artemis name is reserved for a future training that addresses the three limitations above — explicit thinking-block targets, dense-captioning preservation, structured tool-format rehearsal — and ideally the full 500 K corpus with L2 (multimodal tool/agent) + L3 (custom distill) layers that ArtemisMix-v1.1 deliberately deferred.

License

apache-2.0, consistent with the rest of the A-series lineage.

Downloads last month: 8

Safetensors

Model size

13B params

Tensor type

BF16

Model tree for schneewolflabs/A3-Instruct

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

schneewolflabs/A3