Instructions to use EphAsad/Atem-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use EphAsad/Atem-4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="EphAsad/Atem-4B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EphAsad/Atem-4B")
model = AutoModelForCausalLM.from_pretrained("EphAsad/Atem-4B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use EphAsad/Atem-4B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="EphAsad/Atem-4B",
	filename="Atem-4b.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use EphAsad/Atem-4B with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf EphAsad/Atem-4B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf EphAsad/Atem-4B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf EphAsad/Atem-4B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf EphAsad/Atem-4B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf EphAsad/Atem-4B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf EphAsad/Atem-4B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf EphAsad/Atem-4B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf EphAsad/Atem-4B:Q4_K_M

Use Docker

docker model run hf.co/EphAsad/Atem-4B:Q4_K_M

LM Studio
Jan

vLLM

How to use EphAsad/Atem-4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "EphAsad/Atem-4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EphAsad/Atem-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/EphAsad/Atem-4B:Q4_K_M

SGLang

How to use EphAsad/Atem-4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "EphAsad/Atem-4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EphAsad/Atem-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "EphAsad/Atem-4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EphAsad/Atem-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use EphAsad/Atem-4B with Ollama:
```
ollama run hf.co/EphAsad/Atem-4B:Q4_K_M
```

Unsloth Studio

How to use EphAsad/Atem-4B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for EphAsad/Atem-4B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for EphAsad/Atem-4B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for EphAsad/Atem-4B to start chatting

How to use EphAsad/Atem-4B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf EphAsad/Atem-4B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "EphAsad/Atem-4B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use EphAsad/Atem-4B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf EphAsad/Atem-4B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default EphAsad/Atem-4B:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use EphAsad/Atem-4B with Docker Model Runner:
```
docker model run hf.co/EphAsad/Atem-4B:Q4_K_M
```

Lemonade

How to use EphAsad/Atem-4B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull EphAsad/Atem-4B:Q4_K_M

Run and chat with the model

lemonade run user.Atem-4B-Q4_K_M

List all available models

lemonade list

Atem-4B

Ancient logic. Modern intelligence.

A 4B reasoning model trained via a single CoT-preserving SFT pass directly on Qwen3-4B, distilling multi-domain reasoning capability from frontier teacher models while keeping the base model's native thinking capability intact.

Overview

Atem-4B is a 4B parameter reasoning model built via a single supervised fine-tuning pass on raw Qwen3-4B. Unlike the earlier 0.6B Atem line, which erased Qwen3's native thinking mode in Stage 1 and re-imposed an externally-distilled reasoning style in Stage 2, Atem-4B is trained in one combined CoT-preserving pass — building reasoning capability on top of the base model's intact native foundation rather than on a cleared one.

This is currently the flagship model in the Atem series and the first to use full 16-bit LoRA (not QLoRA) on a Qwen3 base, enabled by the available VRAM headroom at this scale.

Model Details

Property	Value
Base model	Qwen/Qwen3-4B
Training method	Single-pass CoT-Preserving LoRA SFT
LoRA config	r=64, alpha=128, dropout=0.05
Target modules	q, k, v, o, gate, up, down projections
Parameters	~4.02B
Trainable (LoRA) params	125,042,688 (3.11% of base)
Training records	56,573 (after token-length filtering)
Think / No-think split	85% / 15%
Epochs	2 (ceiling; early stopping patience=3)
Effective batch size	64 (batch 8 × grad accum 8)
Learning rate	1e-4, cosine schedule, 5% warmup
Max sequence length	6,144 tokens
Precision	bfloat16 (full 16-bit LoRA, not QLoRA)
Hardware	NVIDIA A100-SXM4 80GB
License	Apache 2.0

Design Notes

Why a single combined pass? The 0.6B Atem pipeline (Stage 1: erase thinking → Stage 2: rebuild with external CoT) created two compounding problems: the erasure pass cost real capability on multi-step reasoning relative to the base model, and the rebuild pass re-imposed an external reasoning style on a foundation that had already been structurally altered. Qualitative benchmarking confirmed this — base Qwen3-0.6B in thinking mode self-corrected on problems the no-think Atem-0.6B got wrong, and lm-eval showed a real ARC-Challenge regression after Stage 2. Atem-4B skips the erasure entirely: one pass, intact native reasoning, externally-distilled CoT styles layered on top of something that still works.

Why full 16-bit LoRA? Every prior Atem script used QLoRA because VRAM was the binding constraint at 0.6B–3B. At 4B with an 80GB A100, full 16-bit LoRA requires ~33GB — comfortably inside budget. Full 16-bit is both marginally faster and marginally more accurate than QLoRA at equivalent effective batch sizes, since QLoRA pays a real compute cost on quantize/dequantize operations at each step.

Why r=64? r=32 represented 3.54% of the 0.6B model but only 0.82% of a 14B model — LoRA params scale roughly linearly with hidden size while total model params scale closer to quadratically. At 4B, r=64 recovers 3.11% proportional capacity, close to the proven 0.6B baseline, without jumping to the "very complex" territory of r=128+.

Intended Use

Atem-4B is designed for general reasoning tasks where structured, step-by-step thinking adds value:

Multi-step mathematical reasoning
Code explanation, implementation, and debugging
Analytical reasoning and argument evaluation
Scientific explanation requiring technical depth
Logic, fallacy identification, and conditional reasoning
Concept explanation across diverse domains

Atem-4B is not designed for real-time information retrieval, factual lookup requiring post-training knowledge, or tasks where a fast direct answer without reasoning is preferred — use Atem-0.6B or Atem-Savant-0.6B for those.

Training Data

Atem-4B was trained on a corpus assembled from eight sources covering mathematics, coding, general reasoning, scientific reasoning, and medical reasoning. All sources include explicit chain-of-thought reasoning traces; 85% of training records were formatted with full think traces and 15% as direct answers to maintain no-think capability.

Dataset	Records	Source / Teacher
mitroitskii/OpenR1-Math-220k-formatted	10,000	DeepSeek-R1 — Mathematics (correctness-filtered)
Jackrong/Claude-opus-4.6-TraceInversion-9000x	7,000	Claude Opus 4.6 — Trace Inversion
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Math)	8,000	Kimi K2.5 — Mathematical Reasoning
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Distillation)	8,000	Kimi K2.5 — General Reasoning
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (PHD-Science)	8,000	Kimi K2.5 — Scientific Reasoning
WithinUsAI/MiniMax_M2.7_Distilled_5k	5,000	MiniMax M2.7
FreedomIntelligence/medical-o1-reasoning-SFT	6,000	Medical reasoning (English config)
Modotte/CodeX-2M-Thinking	15,000	Mixed — Coding with CoT
trjxter/DeepSeek-V4-Pro-Reasoning-8000x	~8,000	DeepSeek-V4-Pro
nvidia/OpenCodeReasoning	15,000	Mixed — Competitive coding
Total (post-filter)	56,573

Non-English reasoning traces (primarily CJK) were filtered at the trace level using an ASCII-ratio threshold; records with CJK traces were retained as no-think records rather than discarded entirely. Traces from all sources were passed via the reasoning_content message key rather than manual <think>...</think> concatenation, preventing silent truncation when reasoning traces contain the literal substring </think>.

Training Configuration

# Key hyperparameters
lora_r             = 64
lora_alpha         = 128
lora_dropout       = 0.05
max_seq_length     = 6144
learning_rate      = 1e-4
lr_scheduler       = 'cosine'
warmup_ratio       = 0.05
batch_size         = 8
grad_accumulation  = 8          # effective batch size: 64
num_epochs         = 2          # ceiling — early stopping may cut short
eval_steps         = 150
early_stopping_patience   = 3
early_stopping_threshold  = 0.001
nothink_ratio      = 0.15
load_in_4bit       = False      # full 16-bit LoRA
dtype              = bfloat16

Training used Unsloth with train_on_responses_only masking — loss was computed exclusively on assistant response tokens including reasoning traces for think records. Early stopping halts training automatically when validation loss fails to improve by more than 0.001 across three consecutive evaluation windows, preventing wasted compute at plateau. load_best_model_at_end=True ensures the merged checkpoint reflects the best-validation-loss state rather than the final training step.

Atem's default identity is baked into the chat template via a Jinja else-branch injection: callers without an explicit system message receive the Atem persona by default; explicit system messages override it normally. Verified against the real Qwen3-4B tokenizer before training began.

Evaluation

Benchmark Results

Evaluated against base Qwen3-4B (Qwen/Qwen3-4B) using lm-evaluation-harness. Both models were loaded in 4-bit for evaluation.

Task	Base (Qwen3-4B)	Atem-4B	Delta
ARC-Challenge (0-shot, acc_norm)	53.1%	54.5%	+1.5pp ✓
GSM8K (5-shot, strict-match)	82.8%	80.1%	−2.7pp ⚠
HellaSwag (0-shot, acc_norm)	66.9%	69.8%	+2.9pp ✓
MMLU (0-shot, acc)	68.4%	68.7%	+0.3pp —

HellaSwag (+2.9pp, 6.2σ) is the headline lm-eval result. This benchmark uses normalised log-likelihood scoring over multiple-choice options — it is insensitive to output formatting and cannot be influenced by style changes in generation. A ~3pp improvement here represents real commonsense reasoning transfer from the training corpus.

GSM8K note: The base Qwen3-4B achieves 82.8% on GSM8K — near the ceiling for this model class (comparable to Llama-3.1-8B-Instruct at 78.0%). The −2.7pp regression is likely a formatting artifact: Qwen3-4B in default thinking mode naturally produces #### {answer} format that lm-eval's strict-match regex extracts cleanly, while Atem-4B's CoT training may have shifted terminal answer formatting slightly. The math capability itself does not appear regressed — the qualitative benchmark showed correct solutions to the same category of problems, and HellaSwag's +2.9pp gain (format-independent, log-likelihood scoring) is the stronger signal for genuine reasoning improvement.

MMLU (+0.3pp) is within the combined bootstrap standard error of ±0.5pp (0.6σ) and should not be read as a reliable improvement at the aggregate level. The category breakdown tells a more consistent story: Social Sciences (+0.8pp) and Other (+0.7pp) both improved while STEM was flat (−0.3pp), matching the pattern seen across all other benchmarks — reasoning-oriented gains, no curriculum-knowledge gains. This is expected given the training corpus composition.

MMLU Category	Base	Atem-4B	Delta
STEM	69.1%	68.8%	−0.3pp
Other	71.3%	72.2%	+0.8pp
Social Sciences	77.9%	78.7%	+0.7pp
Humanities	59.7%	59.7%	+0.1pp

Qualitative Evaluation

Atem-4B was evaluated against base Qwen3-4B across 30 domain-representative questions spanning coding, mathematics, analytical reasoning, general knowledge, and language. Key findings:

Domain	Questions	Outcome
Coding	8	Atem — more precise, correctly uses `time.perf_counter()`, cleaner implementations
Mathematics	6	Atem — correct on train speed (72 km/h), circular seating (12), Monty Hall (full simulation proof)
Analytical Reasoning	6	Atem — notably stronger on multi-explanation paradoxes and policy consequence chains
General Knowledge	5	Atem — Rayleigh scattering with λ⁻⁴ scaling, Hamiltonian dynamics on perpetual motion
Language & Logic	5	Comparable

The CoT-preserving design is visible in the outputs: Atem-4B produces selective think traces — engaging reasoning for complex problems, suppressing it for straightforward ones — rather than the uniform marathon traces (2000+ words) that base Qwen3-4B produces by default. This reflects the 85/15 think/no-think training split working as intended.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-4B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Explain the Monty Hall problem and why switching doors gives 2/3 probability."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=2000,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        do_sample=True,
        repetition_penalty=1.1,
    )

response = tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="EphAsad/Atem-4B",
    max_seq_length=6144,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "Given a sorted array with one duplicate, find it in O(log n) time."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=2000,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        do_sample=True,
    )

print(tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-4B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-4B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-4B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-4B:Q4_K_M

Sampling Parameters

Qwen3's own published guidance for thinking mode recommends temperature=0.6, top_p=0.95, top_k=20. These are the parameters used in the qualitative benchmark above and are recommended for general use. Do not use greedy decoding (temperature=0) with thinking mode — Qwen3's documentation explicitly warns this can cause repetition and degeneration.

System Prompt

Atem-4B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Available Files

File	Size	Description
`model.safetensors`	~8 GB	Full bfloat16 merged weights
`Atem-4b.Q4_K_M.gguf`	~2.5 GB	4-bit quantised — recommended
`Atem-4b.Q5_K_M.gguf`	~2.89 GB	5-bit quantised
`Atem-4b.Q8_0.gguf`	~4.5 GB	8-bit quantised — near-lossless

Known Limitations

GSM8K formatting sensitivity. As noted in the evaluation section, the GSM8K regression is likely a formatting artifact rather than a capability regression. For production math applications, verify that your prompting setup elicits answers in the format your extraction logic expects.

6,144 token sequence length ceiling. The training corpus's longest reasoning traces (competitive programming, advanced mathematics) exceed 6,144 tokens and were dropped during formatting. The model has not been exposed to very long chain-of-thought traces and may perform below its potential on problems that genuinely require extended reasoning chains. Raise max_seq_length at inference time to allow longer generation budgets.

CoT activation is selective. Atem-4B does not think on every question — the 85/15 training split means it has learned to engage reasoning for complex problems and skip it for simpler ones. If you require thinking traces on a specific query type that the model is treating as simple, an explicit instruction in the system prompt can encourage it.

Single-pass SFT, no RLHF or DPO. Atem-4B has not undergone preference optimisation. Responses are accurate and structured but may not be as reliably aligned with user preferences in open-ended creative or instructional tasks compared to models that have undergone preference training.

Roadmap

Atem-14B: Single CoT-preserving pass on Qwen3-14B with the same methodology, full 16-bit LoRA, r=64

Citation

@misc{atem_4b_2026,
  author       = {Asad, Zain},
  title        = {Atem-4B: A 4B CoT-Preserving Reasoning Model via
                  Single-Pass SFT on Qwen3},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Atem-4B}},
}

License

Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-4B.

Built independently by Zain Asad — EphAsad

Downloads last month: 373

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for EphAsad/Atem-4B

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Adapter

(1042)

this model

Adapters

2 models

EphAsad
/

Atem-4B

Atem-4B

Overview

Model Details

Design Notes

Intended Use

Training Data

Training Configuration

Evaluation

Benchmark Results

Qualitative Evaluation

Usage

Transformers

Unsloth (faster inference)

Ollama

llama.cpp

Sampling Parameters

System Prompt

Available Files

Known Limitations

Roadmap

Citation

License

Model tree for EphAsad/Atem-4B

Datasets used to train EphAsad/Atem-4B