Atem Logo

Atem-4B

Ancient logic. Modern intelligence.

A 4B reasoning model trained via a single CoT-preserving SFT pass directly on Qwen3-4B, distilling multi-domain reasoning capability from frontier teacher models while keeping the base model's native thinking capability intact.

Base ModelMethodParametersLicense


Overview

Atem-4B is a 4B parameter reasoning model built via a single supervised fine-tuning pass on raw Qwen3-4B. Unlike the earlier 0.6B Atem line, which erased Qwen3's native thinking mode in Stage 1 and re-imposed an externally-distilled reasoning style in Stage 2, Atem-4B is trained in one combined CoT-preserving pass — building reasoning capability on top of the base model's intact native foundation rather than on a cleared one.

This is currently the flagship model in the Atem series and the first to use full 16-bit LoRA (not QLoRA) on a Qwen3 base, enabled by the available VRAM headroom at this scale.


Model Details

Property Value
Base model Qwen/Qwen3-4B
Training method Single-pass CoT-Preserving LoRA SFT
LoRA config r=64, alpha=128, dropout=0.05
Target modules q, k, v, o, gate, up, down projections
Parameters ~4.02B
Trainable (LoRA) params 125,042,688 (3.11% of base)
Training records 56,573 (after token-length filtering)
Think / No-think split 85% / 15%
Epochs 2 (ceiling; early stopping patience=3)
Effective batch size 64 (batch 8 × grad accum 8)
Learning rate 1e-4, cosine schedule, 5% warmup
Max sequence length 6,144 tokens
Precision bfloat16 (full 16-bit LoRA, not QLoRA)
Hardware NVIDIA A100-SXM4 80GB
License Apache 2.0

Design Notes

Why a single combined pass? The 0.6B Atem pipeline (Stage 1: erase thinking → Stage 2: rebuild with external CoT) created two compounding problems: the erasure pass cost real capability on multi-step reasoning relative to the base model, and the rebuild pass re-imposed an external reasoning style on a foundation that had already been structurally altered. Qualitative benchmarking confirmed this — base Qwen3-0.6B in thinking mode self-corrected on problems the no-think Atem-0.6B got wrong, and lm-eval showed a real ARC-Challenge regression after Stage 2. Atem-4B skips the erasure entirely: one pass, intact native reasoning, externally-distilled CoT styles layered on top of something that still works.

Why full 16-bit LoRA? Every prior Atem script used QLoRA because VRAM was the binding constraint at 0.6B–3B. At 4B with an 80GB A100, full 16-bit LoRA requires ~33GB — comfortably inside budget. Full 16-bit is both marginally faster and marginally more accurate than QLoRA at equivalent effective batch sizes, since QLoRA pays a real compute cost on quantize/dequantize operations at each step.

Why r=64? r=32 represented 3.54% of the 0.6B model but only 0.82% of a 14B model — LoRA params scale roughly linearly with hidden size while total model params scale closer to quadratically. At 4B, r=64 recovers 3.11% proportional capacity, close to the proven 0.6B baseline, without jumping to the "very complex" territory of r=128+.


Intended Use

Atem-4B is designed for general reasoning tasks where structured, step-by-step thinking adds value:

  • Multi-step mathematical reasoning
  • Code explanation, implementation, and debugging
  • Analytical reasoning and argument evaluation
  • Scientific explanation requiring technical depth
  • Logic, fallacy identification, and conditional reasoning
  • Concept explanation across diverse domains

Atem-4B is not designed for real-time information retrieval, factual lookup requiring post-training knowledge, or tasks where a fast direct answer without reasoning is preferred — use Atem-0.6B or Atem-Savant-0.6B for those.


Training Data

Atem-4B was trained on a corpus assembled from eight sources covering mathematics, coding, general reasoning, scientific reasoning, and medical reasoning. All sources include explicit chain-of-thought reasoning traces; 85% of training records were formatted with full think traces and 15% as direct answers to maintain no-think capability.

Dataset Records Source / Teacher
mitroitskii/OpenR1-Math-220k-formatted 10,000 DeepSeek-R1 — Mathematics (correctness-filtered)
Jackrong/Claude-opus-4.6-TraceInversion-9000x 7,000 Claude Opus 4.6 — Trace Inversion
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Math) 8,000 Kimi K2.5 — Mathematical Reasoning
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Distillation) 8,000 Kimi K2.5 — General Reasoning
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (PHD-Science) 8,000 Kimi K2.5 — Scientific Reasoning
WithinUsAI/MiniMax_M2.7_Distilled_5k 5,000 MiniMax M2.7
FreedomIntelligence/medical-o1-reasoning-SFT 6,000 Medical reasoning (English config)
Modotte/CodeX-2M-Thinking 15,000 Mixed — Coding with CoT
trjxter/DeepSeek-V4-Pro-Reasoning-8000x ~8,000 DeepSeek-V4-Pro
nvidia/OpenCodeReasoning 15,000 Mixed — Competitive coding
Total (post-filter) 56,573

Non-English reasoning traces (primarily CJK) were filtered at the trace level using an ASCII-ratio threshold; records with CJK traces were retained as no-think records rather than discarded entirely. Traces from all sources were passed via the reasoning_content message key rather than manual <think>...</think> concatenation, preventing silent truncation when reasoning traces contain the literal substring </think>.


Training Configuration

# Key hyperparameters
lora_r             = 64
lora_alpha         = 128
lora_dropout       = 0.05
max_seq_length     = 6144
learning_rate      = 1e-4
lr_scheduler       = 'cosine'
warmup_ratio       = 0.05
batch_size         = 8
grad_accumulation  = 8          # effective batch size: 64
num_epochs         = 2          # ceiling — early stopping may cut short
eval_steps         = 150
early_stopping_patience   = 3
early_stopping_threshold  = 0.001
nothink_ratio      = 0.15
load_in_4bit       = False      # full 16-bit LoRA
dtype              = bfloat16

Training used Unsloth with train_on_responses_only masking — loss was computed exclusively on assistant response tokens including reasoning traces for think records. Early stopping halts training automatically when validation loss fails to improve by more than 0.001 across three consecutive evaluation windows, preventing wasted compute at plateau. load_best_model_at_end=True ensures the merged checkpoint reflects the best-validation-loss state rather than the final training step.

Atem's default identity is baked into the chat template via a Jinja else-branch injection: callers without an explicit system message receive the Atem persona by default; explicit system messages override it normally. Verified against the real Qwen3-4B tokenizer before training began.


Evaluation

Benchmark Results

Evaluated against base Qwen3-4B (Qwen/Qwen3-4B) using lm-evaluation-harness. Both models were loaded in 4-bit for evaluation.

Task Base (Qwen3-4B) Atem-4B Delta
ARC-Challenge (0-shot, acc_norm) 53.1% 54.5% +1.5pp ✓
GSM8K (5-shot, strict-match) 82.8% 80.1% −2.7pp ⚠
HellaSwag (0-shot, acc_norm) 66.9% 69.8% +2.9pp
MMLU (0-shot, acc) 68.4% 68.7% +0.3pp —

HellaSwag (+2.9pp, 6.2σ) is the headline lm-eval result. This benchmark uses normalised log-likelihood scoring over multiple-choice options — it is insensitive to output formatting and cannot be influenced by style changes in generation. A ~3pp improvement here represents real commonsense reasoning transfer from the training corpus.

GSM8K note: The base Qwen3-4B achieves 82.8% on GSM8K — near the ceiling for this model class (comparable to Llama-3.1-8B-Instruct at 78.0%). The −2.7pp regression is likely a formatting artifact: Qwen3-4B in default thinking mode naturally produces #### {answer} format that lm-eval's strict-match regex extracts cleanly, while Atem-4B's CoT training may have shifted terminal answer formatting slightly. The math capability itself does not appear regressed — the qualitative benchmark showed correct solutions to the same category of problems, and HellaSwag's +2.9pp gain (format-independent, log-likelihood scoring) is the stronger signal for genuine reasoning improvement.

MMLU (+0.3pp) is within the combined bootstrap standard error of ±0.5pp (0.6σ) and should not be read as a reliable improvement at the aggregate level. The category breakdown tells a more consistent story: Social Sciences (+0.8pp) and Other (+0.7pp) both improved while STEM was flat (−0.3pp), matching the pattern seen across all other benchmarks — reasoning-oriented gains, no curriculum-knowledge gains. This is expected given the training corpus composition.

MMLU Category Base Atem-4B Delta
STEM 69.1% 68.8% −0.3pp
Other 71.3% 72.2% +0.8pp
Social Sciences 77.9% 78.7% +0.7pp
Humanities 59.7% 59.7% +0.1pp

Qualitative Evaluation

Atem-4B was evaluated against base Qwen3-4B across 30 domain-representative questions spanning coding, mathematics, analytical reasoning, general knowledge, and language. Key findings:

Domain Questions Outcome
Coding 8 Atem — more precise, correctly uses time.perf_counter(), cleaner implementations
Mathematics 6 Atem — correct on train speed (72 km/h), circular seating (12), Monty Hall (full simulation proof)
Analytical Reasoning 6 Atem — notably stronger on multi-explanation paradoxes and policy consequence chains
General Knowledge 5 Atem — Rayleigh scattering with λ⁻⁴ scaling, Hamiltonian dynamics on perpetual motion
Language & Logic 5 Comparable

The CoT-preserving design is visible in the outputs: Atem-4B produces selective think traces — engaging reasoning for complex problems, suppressing it for straightforward ones — rather than the uniform marathon traces (2000+ words) that base Qwen3-4B produces by default. This reflects the 85/15 think/no-think training split working as intended.


Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-4B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Explain the Monty Hall problem and why switching doors gives 2/3 probability."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=2000,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        do_sample=True,
        repetition_penalty=1.1,
    )

response = tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="EphAsad/Atem-4B",
    max_seq_length=6144,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "Given a sorted array with one duplicate, find it in O(log n) time."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=2000,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        do_sample=True,
    )

print(tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-4B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-4B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-4B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-4B:Q4_K_M

Sampling Parameters

Qwen3's own published guidance for thinking mode recommends temperature=0.6, top_p=0.95, top_k=20. These are the parameters used in the qualitative benchmark above and are recommended for general use. Do not use greedy decoding (temperature=0) with thinking mode — Qwen3's documentation explicitly warns this can cause repetition and degeneration.

System Prompt

Atem-4B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Available Files

File Size Description
model.safetensors ~8 GB Full bfloat16 merged weights
Atem-4b.Q4_K_M.gguf ~2.5 GB 4-bit quantised — recommended
Atem-4b.Q5_K_M.gguf ~2.89 GB 5-bit quantised
Atem-4b.Q8_0.gguf ~4.5 GB 8-bit quantised — near-lossless

Known Limitations

GSM8K formatting sensitivity. As noted in the evaluation section, the GSM8K regression is likely a formatting artifact rather than a capability regression. For production math applications, verify that your prompting setup elicits answers in the format your extraction logic expects.

6,144 token sequence length ceiling. The training corpus's longest reasoning traces (competitive programming, advanced mathematics) exceed 6,144 tokens and were dropped during formatting. The model has not been exposed to very long chain-of-thought traces and may perform below its potential on problems that genuinely require extended reasoning chains. Raise max_seq_length at inference time to allow longer generation budgets.

CoT activation is selective. Atem-4B does not think on every question — the 85/15 training split means it has learned to engage reasoning for complex problems and skip it for simpler ones. If you require thinking traces on a specific query type that the model is treating as simple, an explicit instruction in the system prompt can encourage it.

Single-pass SFT, no RLHF or DPO. Atem-4B has not undergone preference optimisation. Responses are accurate and structured but may not be as reliably aligned with user preferences in open-ended creative or instructional tasks compared to models that have undergone preference training.


Roadmap

  • Atem-14B: Single CoT-preserving pass on Qwen3-14B with the same methodology, full 16-bit LoRA, r=64

Citation

@misc{atem_4b_2026,
  author       = {Asad, Zain},
  title        = {Atem-4B: A 4B CoT-Preserving Reasoning Model via
                  Single-Pass SFT on Qwen3},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Atem-4B}},
}

License

Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-4B.


Built independently by Zain Asad — EphAsad

Downloads last month
373
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EphAsad/Atem-4B

Finetuned
Qwen/Qwen3-4B
Adapter
(1042)
this model
Adapters
2 models

Datasets used to train EphAsad/Atem-4B