Terminator-Qwen3-14B

Terminator is a lightweight neural module that predicts when a reasoning language model has reached its final answer during chain-of-thought (CoT) generation. When the Terminator detects the model has committed to an answer, it truncates the remaining reasoning and forces the model to begin its response, thereby delivering the same answer with significantly less computation.

This repository contains everything needed to run Terminator-Qwen3-14B:

  • Trained Terminator checkpoint (1 extra transformer layer + prediction head)
  • vLLM plugin code (vllm_terminator/) for high-performance serving
  • Server launcher and streaming client
  • Standalone HuggingFace inference script (no server required)
  • Automated setup script

Note: Terminator currently supports single-GPU, single-sequence inference only.


Quick Start

# 1. Clone the repository (requires Git LFS: https://git-lfs.com)
git lfs install
git clone https://huggingface.co/acnagle/Terminator-Qwen3-14B
cd Terminator-Qwen3-14B

# 2. Run automated setup (creates conda env, installs vllm, downloads base model)
./setup.sh

# 3. Start the server
./start_server.sh

# 4. In another terminal, chat with the model
python client.py --interactive

Requirements

  • GPU: Single NVIDIA GPU with at least ~40GB VRAM (e.g., A100 40GB)
  • CUDA: Compatible CUDA driver installed, 12.9 and above recommended.
  • Python: 3.12
  • OS: Linux (recommended) or any OS supported by vLLM

Installation

Option A: Automated Setup

The setup.sh script handles everything:

./setup.sh

This will:

  1. Create a conda environment called terminator with Python 3.12
  2. Install uv, vLLM, and openai
  3. Download Qwen3-14B base model weights (~28GB) from HuggingFace
  4. Create the model directory (model_dir/)

Option B: Manual Setup

1. Create a Python environment

Using conda or micromamba:

conda create -n terminator python=3.12 -y
conda activate terminator

2. Install uv

pip install --upgrade uv

Or see the uv installation guide.

3. Install vLLM

uv pip install vllm --torch-backend=auto

See the vLLM installation guide for alternative installation methods (ROCm, CPU, etc.).

4. Install openai (for the client)

uv pip install openai

5. Set up the model directory

This downloads the base Qwen3-14B weights and creates a vLLM-ready model directory:

python setup_model_dir.py

The script accepts optional arguments:

Argument Default Description
--checkpoint ./terminator.pt Path to the Terminator checkpoint
--output-dir ./model_dir Output model directory
--threshold 0.7 Prediction threshold for Terminator activation
--window-size 10 Sliding window size for majority vote
--exit-message (built-in message) Message injected when Terminator fires

Starting the Server

./start_server.sh

Or with custom configuration:

VLLM_GPU_UTIL=0.70 VLLM_MAX_MODEL_LEN=8192 ./start_server.sh

The server exposes an OpenAI-compatible API on the configured port (default: 8000).

Configuration

Set these environment variables before running start_server.sh or serve.py:

Variable Default Description
VLLM_GPU_UTIL 0.90 Fraction of GPU memory to use for the model
VLLM_MAX_MODEL_LEN (auto) Maximum context length in tokens
VLLM_PORT 8000 Server port
VLLM_ENFORCE_EAGER 0 Set to 1 to disable CUDA graphs
VLLM_API_KEY (none) Require this API key from clients
VLLM_SERVED_NAME Terminator-Qwen3-14B Model name reported by the API

Standalone Inference (No Server)

Recommendation: For the best performance, use the vLLM server described above. vLLM uses KV caching, CUDA graphs, and optimized kernels, making it significantly faster than HuggingFace-native inference. The script below is provided for quick testing and demos where spinning up a server is inconvenient.

For quick testing without starting a vLLM server, use the HuggingFace-native inference script:

python inference_hf.py --prompt "What is the sum of the first 100 natural numbers?"

This loads the model directly via HuggingFace transformers and runs token-by-token generation with the Terminator head. Thinking content is streamed in dimmed text; the final answer is shown in bold.

Argument Default Description
--prompt (required) Input prompt
--model Qwen/Qwen3-14B HuggingFace model name or path
--checkpoint ./terminator.pt Path to the Terminator checkpoint
--threshold 0.7 Prediction threshold
--window-size 10 Sliding window size for majority vote
--exit-message (built-in message) Message injected when Terminator fires (empty string to disable)
--max-tokens 32768 Maximum tokens to generate
--temperature 0.6 Sampling temperature

Using the Client (vLLM Server)

Single Prompt

python client.py --prompt "What is the sum of the first 100 natural numbers?"

Interactive Mode

python client.py --interactive

This starts a multi-turn conversation with the model. Thinking content is displayed in dimmed text; the final answer is shown in bold.

Client Options

Argument Default Description
--base-url http://localhost:8000/v1 Server URL
--max-tokens (server default) Maximum tokens to generate
--temperature 0.6 Sampling temperature

Using the API Directly

The server is OpenAI-compatible. You can use any OpenAI client library. Replace localhost with your server's address if connecting remotely:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Terminator-Qwen3-14B",
    messages=[{"role": "user", "content": "What is 25 * 37?"}],
    temperature=0.6,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

# Thinking content (chain-of-thought)
print(response.choices[0].message.reasoning)

# Final answer
print(response.choices[0].message.content)

How Terminator Works

Terminator is a single transformer layer followed by a prediction head, trained on top of a frozen Qwen3-14B base model. The transformer layer (initialized as a copy of the base model's final layer, then fine-tuned) takes the hidden states from the LLM and processes them before the prediction head, which outputs a per-token binary prediction: has the model reached its final answer?

During generation, Terminator maintains a sliding window of the most recent predictions. When a majority of predictions in the window exceed the threshold (default: 0.7), the model is considered to have reached its final answer. At that point:

  1. A short exit message is injected into the reasoning (e.g., "I've run out of thinking tokens. I need to commit to a final answer.") to help the model transition smoothly.
  2. The </think> token is forced, ending the reasoning phase.
  3. The model generates its final answer normally.

This allows the model to skip potentially thousands of redundant reasoning tokens while preserving answer quality.


File Structure

Terminator-Qwen3-14B/
β”œβ”€β”€ README.md               This file
β”œβ”€β”€ terminator.pt            Trained Terminator checkpoint
β”œβ”€β”€ vllm_terminator/         vLLM plugin package
β”‚   β”œβ”€β”€ __init__.py          Registers the model architecture with vLLM
β”‚   β”œβ”€β”€ model.py             Qwen3TerminatorForCausalLM model class
β”‚   └── terminator_head.py   FFN classifier and checkpoint loading
β”œβ”€β”€ inference_hf.py          Standalone HuggingFace inference (no server)
β”œβ”€β”€ serve.py       vLLM server launcher
β”œβ”€β”€ setup_model_dir.py       Model directory setup (downloads base weights)
β”œβ”€β”€ client.py                Streaming chat client (connects to vLLM server)
β”œβ”€β”€ setup.sh                 Automated setup script
└── start_server.sh          Server launcher with sensible defaults

Citation

Coming soon.


License

This project builds on Qwen3-14B by the Qwen team. Please refer to the Qwen3 license for base model usage terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for acnagle/Terminator-Qwen3-14B

Finetuned
Qwen/Qwen3-14B
Finetuned
(215)
this model

Collection including acnagle/Terminator-Qwen3-14B