Serving

Transformer models can be efficiently deployed using libraries such as vLLM, Text Generation Inference (TGI), and others. These libraries are designed for production-grade user-facing services, and can scale to multiple servers and millions of concurrent users.

You can also serve transformer models easily using the transformers serve CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use.

TGI

TGI can serve models that aren’t natively implemented by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren’t available in the Transformers implementation, but other features like continuous batching and streaming are still supported.

Refer to the Non-core model serving guide for more details.

Serve a Transformers implementation the same way you’d serve a TGI model.

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2

Add --trust-remote_code to the command to serve a custom Transformers model.

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code

vLLM

vLLM can also serve a Transformers implementation of a model if it isn’t natively implemented in vLLM.

Many features like quantization, LoRA adapters, and distributed inference and serving are supported for the Transformers implementation.

Refer to the Transformers fallback section for more details.

By default, vLLM serves the native implementation and if it doesn’t exist, it falls back on the Transformers implementation. But you can also set --model-impl transformers to explicitly use the Transformers model implementation.

vllm serve Qwen/Qwen2.5-1.5B-Instruct \
    --task generate \
    --model-impl transformers

Add the trust-remote-code parameter to enable loading a remote code model.

vllm serve Qwen/Qwen2.5-1.5B-Instruct \
    --task generate \
    --model-impl transformers \
    --trust-remote-code

Serve CLI

This section is experimental and subject to change in future versions

You can serve LLMs supported by transformers with the transformers serve CLI. It spawns a local server that offers a chat Completions API compatible with the OpenAI SDK, which is the de facto standard for LLM conversations. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).

To launch a server, simply use the transformers serve CLI command:

transformers serve

The simplest way to interact with the server is through our transformers chat CLI

transformers chat localhost:8000 --model-name-or-path Qwen/Qwen3-4B

or by sending an HTTP request with cURL, e.g.

curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'

from which you’ll receive multiple chunks in the Completions API format

data: {"object": "chat.completion.chunk", "id": "req_0", "created": 1751377863, "model": "Qwen/Qwen2.5-0.5B-Instruct", "system_fingerprint": "", "choices": [{"delta": {"role": "assistant", "content": "", "tool_call_id": null, "tool_calls": null}, "index": 0, "finish_reason": null, "logprobs": null}]}

data: {"object": "chat.completion.chunk", "id": "req_0", "created": 1751377863, "model": "Qwen/Qwen2.5-0.5B-Instruct", "system_fingerprint": "", "choices": [{"delta": {"role": "assistant", "content": "", "tool_call_id": null, "tool_calls": null}, "index": 0, "finish_reason": null, "logprobs": null}]}

(...)

The server is also an MCP client, so it can interact with MCP tools in agentic use cases. This, of course, requires the use of an LLM that is designed to use tools.

At the moment, MCP tool usage in transformers is limited to the qwen family of models.

Usage example 1: apps with local requests (feat. Jan)

This example shows how to use transformers serve as a local LLM provider for the Jan app. Jan is a ChatGPT-alternative graphical interface, fully running on your machine. The requests to transformers serve come directly from the local app — while this section focuses on Jan, you can extrapolate some instructions to other apps that make local requests.

To connect transformers serve with Jan, you’ll need to set up a new model provider (“Settings” > “Model Providers”). Click on “Add Provider”, and set a new name. In your new model provider page, all you need to set is the “Base URL” to the following pattern:

http://[host]:[port]/v1

where host and port are the transformers serve CLI parameters (localhost:8000 by default). After setting this up, you should be able to see some models in the “Models” section, hitting “Refresh”. Make sure you add some text in the “API key” text field too — this data is not actually used, but the field can’t be empty. Your custom model provider page should look like this:

You are now ready to chat!

You can add any transformers-compatible model to Jan through transformers serve. In the custom model provider you created, click on the ”+” button in the “Models” section and add its Hub repository name, e.g. Qwen/Qwen3-4B.

To conclude this example, let’s look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have ssh access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine’s terminal

ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server

Port forwarding is not Jan-specific: you can use it to connect transformers serve running in a different machine with an app of your choice.

Usage example 2: apps with external requests (feat. Cursor)

This example shows how to use transformers serve as a local LLM provider for Cursor, the popular IDE. Unlike in the previous example, requests to transformers serve will come from an external IP (Cursor’s server IPs), which requires some additional setup. Furthermore, some of Cursor’s requests require CORS, which is disabled by default for security reasons.

To launch our server with CORS enabled, run

transformers serve --enable-cors

We’ll also need to expose our server to external IPs. A potential solution is to use ngrok, which has a permissive free tier. After setting up your ngrok account and authenticating on your server machine, you run

ngrok http [port]

where port is the port used by transformers serve (8000 by default). On the terminal where you launched ngrok, you’ll see an https address in the “Forwarding” row, as in the image below. This is the address to send requests to.

We’re now ready to set things up on the app side! In Cursor, while we can’t set a new provider, we can change the endpoint for OpenAI requests in the model selection settings. First, navigate to “Settings” > “Cursor Settings”, “Models” tab, and expand the “API Keys” collapsible. To set our transformers serve endpoint, follow this order:

Unselect ALL models in the list above (e.g. gpt4, …);
Add and select the model you want to use (e.g. Qwen/Qwen3-4B)
Add some random text to OpenAI API Key. This field won’t be used, but it can’t be empty;
Add the https address from ngrok to the “Override OpenAI Base URL” field, appending /v1 to the address (i.e. https://(...).ngrok-free.app/v1);
Hit “Verify”.

After you follow these steps, your “Models” tab should look like the image below. Your server should also have received a few requests from the verification step.

You are now ready to use your local model in Cursor! For instance, if you toggle the AI Pane, you can select the model you added and ask it questions about your local files.

Usage example 3: tiny-agents CLI and MCP Tools

To showcase the use of MCP tools, let’s see how to integrate the transformers serve server with the tiny-agents CLI.

Many Hugging Face Spaces can be used as MCP servers, as in this example. You can find all compatible Spaces here.

The first step to use MCP tools is to let the model know which tools are available. As an example, let’s consider a tiny-agents configuration file with a reference to an image generation MCP server.

{
    "model": "Menlo/Jan-nano",
    "endpointUrl": "http://localhost:8000",
    "servers": [
        {
            "type": "sse",
            "url": "https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse"
        }
    ]
}

You can then launch your tiny-agents chat interface with the following command.

tiny-agents run path/to/your/config.json

If you have transformers serve running in the background, you’re ready to use MCP tools from a local model! For instance, here’s the example of a chat session with tiny-agents:

Agent loaded with 1 tools:
 • flux1_schnell_infer
»  Generate an image of a cat on the moon
<Tool req_0_tool_call>flux1_schnell_infer {"prompt": "a cat on the moon", "seed": 42, "randomize_seed": true, "width": 1024, "height": 1024, "num_inference_steps": 4}

Tool req_0_tool_call
[Binary Content: Image image/webp, 57732 bytes]
The task is complete and the content accessible to the User
Image URL: https://evalstate-flux1-schnell.hf.space/gradio_api/file=/tmp/gradio/3dbddc0e53b5a865ed56a4e3dbdd30f3f61cf3b8aabf1b456f43e5241bd968b8/image.webp
380576952

I have generated an image of a cat on the moon using the Flux 1 Schnell Image Generator. The image is 1024x1024 pixels and was created with 4 inference steps. Let me know if you would like to make any changes or need further assistance!

< > Update on GitHub