Transformers documentation
Serving
Serving
Transformer models can be efficiently deployed using libraries such as vLLM, Text Generation Inference (TGI), and others. These libraries are designed for production-grade user-facing services, and can scale to multiple servers and millions of concurrent users.
You can also serve transformer models easily using the transformers serve
CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use.
TGI
TGI can serve models that aren’t natively implemented by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren’t available in the Transformers implementation, but other features like continuous batching and streaming are still supported.
Refer to the Non-core model serving guide for more details.
Serve a Transformers implementation the same way you’d serve a TGI model.
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
Add --trust-remote_code
to the command to serve a custom Transformers model.
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code
vLLM
vLLM can also serve a Transformers implementation of a model if it isn’t natively implemented in vLLM.
Many features like quantization, LoRA adapters, and distributed inference and serving are supported for the Transformers implementation.
Refer to the Transformers fallback section for more details.
By default, vLLM serves the native implementation and if it doesn’t exist, it falls back on the Transformers implementation. But you can also set --model-impl transformers
to explicitly use the Transformers model implementation.
vllm serve Qwen/Qwen2.5-1.5B-Instruct \ --task generate \ --model-impl transformers
Add the trust-remote-code
parameter to enable loading a remote code model.
vllm serve Qwen/Qwen2.5-1.5B-Instruct \ --task generate \ --model-impl transformers \ --trust-remote-code
Serve CLI
This section is experimental and subject to change in future versions
You can serve LLMs supported by transformers
with the transformers serve
CLI. It spawns a local server that offers a chat Completions API compatible with the OpenAI SDK, which is the de facto standard for LLM conversations. This way, you can use the server from many third party applications, or test it using the transformers chat
CLI (docs).
To launch a server, simply use the transformers serve
CLI command:
transformers serve
The simplest way to interact with the server is through our transformers chat
CLI
transformers chat localhost:8000 --model-name-or-path Qwen/Qwen3-4B
or by sending an HTTP request with cURL
, e.g.
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
from which you’ll receive multiple chunks in the Completions API format
data: {"object": "chat.completion.chunk", "id": "req_0", "created": 1751377863, "model": "Qwen/Qwen2.5-0.5B-Instruct", "system_fingerprint": "", "choices": [{"delta": {"role": "assistant", "content": "", "tool_call_id": null, "tool_calls": null}, "index": 0, "finish_reason": null, "logprobs": null}]} data: {"object": "chat.completion.chunk", "id": "req_0", "created": 1751377863, "model": "Qwen/Qwen2.5-0.5B-Instruct", "system_fingerprint": "", "choices": [{"delta": {"role": "assistant", "content": "", "tool_call_id": null, "tool_calls": null}, "index": 0, "finish_reason": null, "logprobs": null}]} (...)
The server is also an MCP client, so it can interact with MCP tools in agentic use cases. This, of course, requires the use of an LLM that is designed to use tools.
At the moment, MCP tool usage in transformers
is limited to the qwen
family of models.
Usage example 1: apps with local requests (feat. Jan)
This example shows how to use transformers serve
as a local LLM provider for the Jan app. Jan is a ChatGPT-alternative graphical interface, fully running on your machine. The requests to transformers serve
come directly from the local app — while this section focuses on Jan, you can extrapolate some instructions to other apps that make local requests.
To connect transformers serve
with Jan, you’ll need to set up a new model provider (“Settings” > “Model Providers”). Click on “Add Provider”, and set a new name. In your new model provider page, all you need to set is the “Base URL” to the following pattern:
http://[host]:[port]/v1
where host
and port
are the transformers serve
CLI parameters (localhost:8000
by default). After setting this up, you should be able to see some models in the “Models” section, hitting “Refresh”. Make sure you add some text in the “API key” text field too — this data is not actually used, but the field can’t be empty. Your custom model provider page should look like this:
You are now ready to chat!
You can add any transformers
-compatible model to Jan through transformers serve
. In the custom model provider you created, click on the ”+” button in the “Models” section and add its Hub repository name, e.g. Qwen/Qwen3-4B
.
To conclude this example, let’s look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have ssh
access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine’s terminal
ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server
Port forwarding is not Jan-specific: you can use it to connect transformers serve
running in a different machine with an app of your choice.
Usage example 2: apps with external requests (feat. Cursor)
This example shows how to use transformers serve
as a local LLM provider for Cursor, the popular IDE. Unlike in the previous example, requests to transformers serve
will come from an external IP (Cursor’s server IPs), which requires some additional setup. Furthermore, some of Cursor’s requests require CORS, which is disabled by default for security reasons.
To launch our server with CORS enabled, run
transformers serve --enable-cors
We’ll also need to expose our server to external IPs. A potential solution is to use ngrok
, which has a permissive free tier. After setting up your ngrok
account and authenticating on your server machine, you run
ngrok http [port]
where port
is the port used by transformers serve
(8000
by default). On the terminal where you launched ngrok
, you’ll see an https address in the “Forwarding” row, as in the image below. This is the address to send requests to.
We’re now ready to set things up on the app side! In Cursor, while we can’t set a new provider, we can change the endpoint for OpenAI requests in the model selection settings. First, navigate to “Settings” > “Cursor Settings”, “Models” tab, and expand the “API Keys” collapsible. To set our transformers serve
endpoint, follow this order:
- Unselect ALL models in the list above (e.g.
gpt4
, …); - Add and select the model you want to use (e.g.
Qwen/Qwen3-4B
) - Add some random text to OpenAI API Key. This field won’t be used, but it can’t be empty;
- Add the https address from
ngrok
to the “Override OpenAI Base URL” field, appending/v1
to the address (i.e.https://(...).ngrok-free.app/v1
); - Hit “Verify”.
After you follow these steps, your “Models” tab should look like the image below. Your server should also have received a few requests from the verification step.
You are now ready to use your local model in Cursor! For instance, if you toggle the AI Pane, you can select the model you added and ask it questions about your local files.
Usage example 3: tiny-agents CLI and MCP Tools
To showcase the use of MCP tools, let’s see how to integrate the transformers serve
server with the tiny-agents
CLI.
Many Hugging Face Spaces can be used as MCP servers, as in this example. You can find all compatible Spaces here.
The first step to use MCP tools is to let the model know which tools are available. As an example, let’s consider a tiny-agents
configuration file with a reference to an image generation MCP server.
{
"model": "Menlo/Jan-nano",
"endpointUrl": "http://localhost:8000",
"servers": [
{
"type": "sse",
"url": "https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse"
}
]
}
You can then launch your tiny-agents
chat interface with the following command.
tiny-agents run path/to/your/config.json
If you have transformers serve
running in the background, you’re ready to use MCP tools from a local model! For instance, here’s the example of a chat session with tiny-agents
:
Agent loaded with 1 tools:
• flux1_schnell_infer
» Generate an image of a cat on the moon
<Tool req_0_tool_call>flux1_schnell_infer {"prompt": "a cat on the moon", "seed": 42, "randomize_seed": true, "width": 1024, "height": 1024, "num_inference_steps": 4}
Tool req_0_tool_call
[Binary Content: Image image/webp, 57732 bytes]
The task is complete and the content accessible to the User
Image URL: https://evalstate-flux1-schnell.hf.space/gradio_api/file=/tmp/gradio/3dbddc0e53b5a865ed56a4e3dbdd30f3f61cf3b8aabf1b456f43e5241bd968b8/image.webp
380576952
I have generated an image of a cat on the moon using the Flux 1 Schnell Image Generator. The image is 1024x1024 pixels and was created with 4 inference steps. Let me know if you would like to make any changes or need further assistance!