A3-Instruct
A3-Instruct is the instruction-tuned + creative-writing sibling of A3 in the Schneewolf Labs A-series. It takes A3 (Qwen3-VL ViT + 2-layer MLP projector + A2/Mistral decoder; Stage-1 projector-only alignment) and runs a full multimodal instruction fine-tune on ArtemisMix-v1.1.
A3 stays the dense-captioning specialist. A3-Instruct is the one to reach for when you want conversation, VQA, multi-turn image grounding, tool-call drafting, identity-preserving chat, or creative writing.
What it is
| Architecture | Qwen3-VL ViT (frozen, ~0.41 B) + 2-layer MLP projector (trained, 37 M) + A2 Mistral decoder (full FFT, 12.25 B) |
| Total params | 12.69 B |
| Trainable in Stage-2 | 12.28 B (96.8%) — ViT frozen |
| Base | schneewolflabs/A3 |
| Training corpus | schneewolflabs/ArtemisMix-v1.1 (364,816 rows; 333,001 after the 4096-token filter) |
| Epochs | 1 |
| Effective batch | 16 (bs 1 × grad-accum 16) |
| Optimizer | paged AdamW 8-bit |
| Learning rate | 1e-5, cosine, warmup 3% |
| Max seq length | 4096 |
| Hardware | 1× NVIDIA GB10 (DGX Spark, 128 GB unified) |
| Wall-clock | ~7.4 days |
| Final eval loss | 0.7516 (down from 0.85 at the first eval) |
Strengths
- Identity is durable. Asked "Who are you and who built you?" with no system prompt, the model answers "I'm a language model created by Schneewolf Labs, a software research and publishing company based in Pennsylvania." The i-DPO anti-drift bedrock (17% of the corpus) held cleanly through full FFT.
- Creative writing has voice. "The rain fell in gray sheets, turning the neon signs of Nathan Road into smears of color on the wet concrete... The kind of night where trouble walks in on two legs and asks for a drink." The Athanorlite contribution pulled through.
- VQA + multi-turn structure work. Direct factual questions (counting, color, what's in the image) get clean answers; multi-turn follow-ups maintain image context.
- Architecture stays usable in llama.cpp via the
Schneewolf-Labs/llama.cppfork's Artemis VLM mmproj graft (same path as A3).
Limitations (be honest about these)
- Hybrid
<think>gate is currently underdeveloped. Even withenable_thinking=True, the model tends to emit an empty<think></think>and put reasoning in the answer body rather than fill the wrapper. The model can reason — it just doesn't use the dedicated block. Likely because the training data had<think>...</think>embedded in assistant content and the model learned to close, not fill, the template-injected wrapper. Under investigation. - Visual grounding regressed vs A3 on dense description. A3's caption of a bento box correctly named pink/yellow/blue containers, apricots, almonds, figs, and "meatloaf"; A3-Instruct's caption on the same image is more generic ("plastic containers", "meat", "fruit salad") and occasionally hallucinates (called pineapple+mandarin "apples", then "cake" on a follow-up). This is the alignment tax — only ~30 K of the 365 K rows were detailed-captioning, and the conversational/reasoning data diluted A3's perceptual sharpness. For dense captioning, use A3.
- Structured
<tool_call>syntax drifted. A2's tool-call format (<tool_call>blocks with JSON) was rehearsed via 30 K oversampled A2-tool-orpo rows, but A3-Instruct emits the concept of an API call (e.g. an OpenWeatherMap GET URL with params) rather than the structured token format. The behavior is reasonable; the format isn't.
These three are tracked for the next training run; they are not blockers for chat/VQA/creative use.
Intended use
- Conversational VLM for ChatSWL-style internal product use
- VQA, multi-turn image grounding, creative writing, image-grounded discussion
- Foundation for further fine-tuning toward specialized behavior
Not intended for: production tool calling without verification, high-stakes captioning where A3 is the better choice, autonomous decision-making.
Inference
Compatible with the standard transformers ArtemisVLMForConditionalGeneration interface from the artemis-vlm package (PyPI: artemis-vlm >= 0.1.3). Also runs in llama.cpp via the Schneewolf-Labs/llama.cpp fork's mtmd support (decoder GGUF + Artemis mmproj GGUF — same pattern as A3).
from transformers import AutoConfig, AutoTokenizer
from artemis_vlm import ArtemisVLMForConditionalGeneration, ArtemisVLMProcessor
import torch
ckpt = "schneewolflabs/A3-Instruct"
model = ArtemisVLMForConditionalGeneration.from_pretrained(ckpt, dtype=torch.bfloat16).to("cuda")
cfg = AutoConfig.from_pretrained(ckpt, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(ckpt, trust_remote_code=True)
proc = ArtemisVLMProcessor(tokenizer=tok, vision_config=cfg.vision_config)
Lineage
schneewolflabs/A3— Stage-1 base (projector-only alignment, 1 M BLIP3o samples)schneewolflabs/A2— text decoder (Mistral 12.3 B, hidden 5120, Tekken vocab)schneewolflabs/ArtemisMix-v1.1— training corpusschneewolflabs/Athanorlite-DPO— creative-writing source (collapsed to SFT, non-reasoning bucket)schneewolflabs/i-DPO— identity/voice anti-drift bedrock
What about Artemis?
The Artemis name is reserved for a future training that addresses the three limitations above — explicit thinking-block targets, dense-captioning preservation, structured tool-format rehearsal — and ideally the full 500 K corpus with L2 (multimodal tool/agent) + L3 (custom distill) layers that ArtemisMix-v1.1 deliberately deferred.
License
apache-2.0, consistent with the rest of the A-series lineage.
- Downloads last month
- 8