Instructions to use EphAsad/Atem-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EphAsad/Atem-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="EphAsad/Atem-4B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("EphAsad/Atem-4B") model = AutoModelForCausalLM.from_pretrained("EphAsad/Atem-4B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use EphAsad/Atem-4B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="EphAsad/Atem-4B", filename="Atem-4b.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use EphAsad/Atem-4B with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf EphAsad/Atem-4B:Q4_K_M # Run inference directly in the terminal: llama cli -hf EphAsad/Atem-4B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf EphAsad/Atem-4B:Q4_K_M # Run inference directly in the terminal: llama cli -hf EphAsad/Atem-4B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf EphAsad/Atem-4B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf EphAsad/Atem-4B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf EphAsad/Atem-4B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf EphAsad/Atem-4B:Q4_K_M
Use Docker
docker model run hf.co/EphAsad/Atem-4B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use EphAsad/Atem-4B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "EphAsad/Atem-4B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/EphAsad/Atem-4B:Q4_K_M
- SGLang
How to use EphAsad/Atem-4B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "EphAsad/Atem-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "EphAsad/Atem-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use EphAsad/Atem-4B with Ollama:
ollama run hf.co/EphAsad/Atem-4B:Q4_K_M
- Unsloth Studio
How to use EphAsad/Atem-4B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EphAsad/Atem-4B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EphAsad/Atem-4B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for EphAsad/Atem-4B to start chatting
- Pi
How to use EphAsad/Atem-4B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf EphAsad/Atem-4B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "EphAsad/Atem-4B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use EphAsad/Atem-4B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf EphAsad/Atem-4B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default EphAsad/Atem-4B:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use EphAsad/Atem-4B with Docker Model Runner:
docker model run hf.co/EphAsad/Atem-4B:Q4_K_M
- Lemonade
How to use EphAsad/Atem-4B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull EphAsad/Atem-4B:Q4_K_M
Run and chat with the model
lemonade run user.Atem-4B-Q4_K_M
List all available models
lemonade list
Atem-4B
Ancient logic. Modern intelligence.
A 4B reasoning model trained via a single CoT-preserving SFT pass directly on Qwen3-4B, distilling multi-domain reasoning capability from frontier teacher models while keeping the base model's native thinking capability intact.
Overview
Atem-4B is a 4B parameter reasoning model built via a single supervised fine-tuning pass on raw Qwen3-4B. Unlike the earlier 0.6B Atem line, which erased Qwen3's native thinking mode in Stage 1 and re-imposed an externally-distilled reasoning style in Stage 2, Atem-4B is trained in one combined CoT-preserving pass — building reasoning capability on top of the base model's intact native foundation rather than on a cleared one.
This is currently the flagship model in the Atem series and the first to use full 16-bit LoRA (not QLoRA) on a Qwen3 base, enabled by the available VRAM headroom at this scale.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-4B |
| Training method | Single-pass CoT-Preserving LoRA SFT |
| LoRA config | r=64, alpha=128, dropout=0.05 |
| Target modules | q, k, v, o, gate, up, down projections |
| Parameters | ~4.02B |
| Trainable (LoRA) params | 125,042,688 (3.11% of base) |
| Training records | 56,573 (after token-length filtering) |
| Think / No-think split | 85% / 15% |
| Epochs | 2 (ceiling; early stopping patience=3) |
| Effective batch size | 64 (batch 8 × grad accum 8) |
| Learning rate | 1e-4, cosine schedule, 5% warmup |
| Max sequence length | 6,144 tokens |
| Precision | bfloat16 (full 16-bit LoRA, not QLoRA) |
| Hardware | NVIDIA A100-SXM4 80GB |
| License | Apache 2.0 |
Design Notes
Why a single combined pass? The 0.6B Atem pipeline (Stage 1: erase thinking → Stage 2: rebuild with external CoT) created two compounding problems: the erasure pass cost real capability on multi-step reasoning relative to the base model, and the rebuild pass re-imposed an external reasoning style on a foundation that had already been structurally altered. Qualitative benchmarking confirmed this — base Qwen3-0.6B in thinking mode self-corrected on problems the no-think Atem-0.6B got wrong, and lm-eval showed a real ARC-Challenge regression after Stage 2. Atem-4B skips the erasure entirely: one pass, intact native reasoning, externally-distilled CoT styles layered on top of something that still works.
Why full 16-bit LoRA? Every prior Atem script used QLoRA because VRAM was the binding constraint at 0.6B–3B. At 4B with an 80GB A100, full 16-bit LoRA requires ~33GB — comfortably inside budget. Full 16-bit is both marginally faster and marginally more accurate than QLoRA at equivalent effective batch sizes, since QLoRA pays a real compute cost on quantize/dequantize operations at each step.
Why r=64? r=32 represented 3.54% of the 0.6B model but only 0.82% of a 14B model — LoRA params scale roughly linearly with hidden size while total model params scale closer to quadratically. At 4B, r=64 recovers 3.11% proportional capacity, close to the proven 0.6B baseline, without jumping to the "very complex" territory of r=128+.
Intended Use
Atem-4B is designed for general reasoning tasks where structured, step-by-step thinking adds value:
- Multi-step mathematical reasoning
- Code explanation, implementation, and debugging
- Analytical reasoning and argument evaluation
- Scientific explanation requiring technical depth
- Logic, fallacy identification, and conditional reasoning
- Concept explanation across diverse domains
Atem-4B is not designed for real-time information retrieval, factual lookup requiring post-training knowledge, or tasks where a fast direct answer without reasoning is preferred — use Atem-0.6B or Atem-Savant-0.6B for those.
Training Data
Atem-4B was trained on a corpus assembled from eight sources covering mathematics, coding, general reasoning, scientific reasoning, and medical reasoning. All sources include explicit chain-of-thought reasoning traces; 85% of training records were formatted with full think traces and 15% as direct answers to maintain no-think capability.
| Dataset | Records | Source / Teacher |
|---|---|---|
| mitroitskii/OpenR1-Math-220k-formatted | 10,000 | DeepSeek-R1 — Mathematics (correctness-filtered) |
| Jackrong/Claude-opus-4.6-TraceInversion-9000x | 7,000 | Claude Opus 4.6 — Trace Inversion |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Math) | 8,000 | Kimi K2.5 — Mathematical Reasoning |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Distillation) | 8,000 | Kimi K2.5 — General Reasoning |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (PHD-Science) | 8,000 | Kimi K2.5 — Scientific Reasoning |
| WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | MiniMax M2.7 |
| FreedomIntelligence/medical-o1-reasoning-SFT | 6,000 | Medical reasoning (English config) |
| Modotte/CodeX-2M-Thinking | 15,000 | Mixed — Coding with CoT |
| trjxter/DeepSeek-V4-Pro-Reasoning-8000x | ~8,000 | DeepSeek-V4-Pro |
| nvidia/OpenCodeReasoning | 15,000 | Mixed — Competitive coding |
| Total (post-filter) | 56,573 |
Non-English reasoning traces (primarily CJK) were filtered at the trace level using an ASCII-ratio threshold; records with CJK traces were retained as no-think records rather than discarded entirely. Traces from all sources were passed via the reasoning_content message key rather than manual <think>...</think> concatenation, preventing silent truncation when reasoning traces contain the literal substring </think>.
Training Configuration
# Key hyperparameters
lora_r = 64
lora_alpha = 128
lora_dropout = 0.05
max_seq_length = 6144
learning_rate = 1e-4
lr_scheduler = 'cosine'
warmup_ratio = 0.05
batch_size = 8
grad_accumulation = 8 # effective batch size: 64
num_epochs = 2 # ceiling — early stopping may cut short
eval_steps = 150
early_stopping_patience = 3
early_stopping_threshold = 0.001
nothink_ratio = 0.15
load_in_4bit = False # full 16-bit LoRA
dtype = bfloat16
Training used Unsloth with train_on_responses_only masking — loss was computed exclusively on assistant response tokens including reasoning traces for think records. Early stopping halts training automatically when validation loss fails to improve by more than 0.001 across three consecutive evaluation windows, preventing wasted compute at plateau. load_best_model_at_end=True ensures the merged checkpoint reflects the best-validation-loss state rather than the final training step.
Atem's default identity is baked into the chat template via a Jinja else-branch injection: callers without an explicit system message receive the Atem persona by default; explicit system messages override it normally. Verified against the real Qwen3-4B tokenizer before training began.
Evaluation
Benchmark Results
Evaluated against base Qwen3-4B (Qwen/Qwen3-4B) using lm-evaluation-harness. Both models were loaded in 4-bit for evaluation.
| Task | Base (Qwen3-4B) | Atem-4B | Delta |
|---|---|---|---|
| ARC-Challenge (0-shot, acc_norm) | 53.1% | 54.5% | +1.5pp ✓ |
| GSM8K (5-shot, strict-match) | 82.8% | 80.1% | −2.7pp ⚠ |
| HellaSwag (0-shot, acc_norm) | 66.9% | 69.8% | +2.9pp ✓ |
| MMLU (0-shot, acc) | 68.4% | 68.7% | +0.3pp — |
HellaSwag (+2.9pp, 6.2σ) is the headline lm-eval result. This benchmark uses normalised log-likelihood scoring over multiple-choice options — it is insensitive to output formatting and cannot be influenced by style changes in generation. A ~3pp improvement here represents real commonsense reasoning transfer from the training corpus.
GSM8K note: The base Qwen3-4B achieves 82.8% on GSM8K — near the ceiling for this model class (comparable to Llama-3.1-8B-Instruct at 78.0%). The −2.7pp regression is likely a formatting artifact: Qwen3-4B in default thinking mode naturally produces #### {answer} format that lm-eval's strict-match regex extracts cleanly, while Atem-4B's CoT training may have shifted terminal answer formatting slightly. The math capability itself does not appear regressed — the qualitative benchmark showed correct solutions to the same category of problems, and HellaSwag's +2.9pp gain (format-independent, log-likelihood scoring) is the stronger signal for genuine reasoning improvement.
MMLU (+0.3pp) is within the combined bootstrap standard error of ±0.5pp (0.6σ) and should not be read as a reliable improvement at the aggregate level. The category breakdown tells a more consistent story: Social Sciences (+0.8pp) and Other (+0.7pp) both improved while STEM was flat (−0.3pp), matching the pattern seen across all other benchmarks — reasoning-oriented gains, no curriculum-knowledge gains. This is expected given the training corpus composition.
| MMLU Category | Base | Atem-4B | Delta |
|---|---|---|---|
| STEM | 69.1% | 68.8% | −0.3pp |
| Other | 71.3% | 72.2% | +0.8pp |
| Social Sciences | 77.9% | 78.7% | +0.7pp |
| Humanities | 59.7% | 59.7% | +0.1pp |
Qualitative Evaluation
Atem-4B was evaluated against base Qwen3-4B across 30 domain-representative questions spanning coding, mathematics, analytical reasoning, general knowledge, and language. Key findings:
| Domain | Questions | Outcome |
|---|---|---|
| Coding | 8 | Atem — more precise, correctly uses time.perf_counter(), cleaner implementations |
| Mathematics | 6 | Atem — correct on train speed (72 km/h), circular seating (12), Monty Hall (full simulation proof) |
| Analytical Reasoning | 6 | Atem — notably stronger on multi-explanation paradoxes and policy consequence chains |
| General Knowledge | 5 | Atem — Rayleigh scattering with λ⁻⁴ scaling, Hamiltonian dynamics on perpetual motion |
| Language & Logic | 5 | Comparable |
The CoT-preserving design is visible in the outputs: Atem-4B produces selective think traces — engaging reasoning for complex problems, suppressing it for straightforward ones — rather than the uniform marathon traces (2000+ words) that base Qwen3-4B produces by default. This reflects the 85/15 think/no-think training split working as intended.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "EphAsad/Atem-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Explain the Monty Hall problem and why switching doors gives 2/3 probability."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=2000,
temperature=0.6,
top_p=0.95,
top_k=20,
do_sample=True,
repetition_penalty=1.1,
)
response = tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
)
print(response)
Unsloth (faster inference)
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="EphAsad/Atem-4B",
max_seq_length=6144,
dtype=torch.bfloat16,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{
"role": "user",
"content": "Given a sorted array with one duplicate, find it in O(log n) time."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=2000,
temperature=0.6,
top_p=0.95,
top_k=20,
do_sample=True,
)
print(tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
))
Ollama
# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-4B:Q4_K_M
# Higher quality
ollama run hf.co/EphAsad/Atem-4B:Q5_K_M
# Near-lossless
ollama run hf.co/EphAsad/Atem-4B:Q8_0
llama.cpp
llama-server -hf EphAsad/Atem-4B:Q4_K_M
Sampling Parameters
Qwen3's own published guidance for thinking mode recommends temperature=0.6, top_p=0.95, top_k=20. These are the parameters used in the qualitative benchmark above and are recommended for general use. Do not use greedy decoding (temperature=0) with thinking mode — Qwen3's documentation explicitly warns this can cause repetition and degeneration.
System Prompt
Atem-4B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:
You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.
Available Files
| File | Size | Description |
|---|---|---|
model.safetensors |
~8 GB | Full bfloat16 merged weights |
Atem-4b.Q4_K_M.gguf |
~2.5 GB | 4-bit quantised — recommended |
Atem-4b.Q5_K_M.gguf |
~2.89 GB | 5-bit quantised |
Atem-4b.Q8_0.gguf |
~4.5 GB | 8-bit quantised — near-lossless |
Known Limitations
GSM8K formatting sensitivity. As noted in the evaluation section, the GSM8K regression is likely a formatting artifact rather than a capability regression. For production math applications, verify that your prompting setup elicits answers in the format your extraction logic expects.
6,144 token sequence length ceiling. The training corpus's longest reasoning traces (competitive programming, advanced mathematics) exceed 6,144 tokens and were dropped during formatting. The model has not been exposed to very long chain-of-thought traces and may perform below its potential on problems that genuinely require extended reasoning chains. Raise max_seq_length at inference time to allow longer generation budgets.
CoT activation is selective. Atem-4B does not think on every question — the 85/15 training split means it has learned to engage reasoning for complex problems and skip it for simpler ones. If you require thinking traces on a specific query type that the model is treating as simple, an explicit instruction in the system prompt can encourage it.
Single-pass SFT, no RLHF or DPO. Atem-4B has not undergone preference optimisation. Responses are accurate and structured but may not be as reliably aligned with user preferences in open-ended creative or instructional tasks compared to models that have undergone preference training.
Roadmap
- Atem-14B: Single CoT-preserving pass on Qwen3-14B with the same methodology, full 16-bit LoRA, r=64
Citation
@misc{atem_4b_2026,
author = {Asad, Zain},
title = {Atem-4B: A 4B CoT-Preserving Reasoning Model via
Single-Pass SFT on Qwen3},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/EphAsad/Atem-4B}},
}
License
Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-4B.
Built independently by Zain Asad — EphAsad
- Downloads last month
- 373
