🚀 Deploying OLMo-7B with Text Generation Inference (TGI) on Hugging Face Spaces

Community Article Published February 2, 2025

Hugging Face’s Text Generation Inference (TGI) is the go-to solution for production-ready LLM deployments. Unlike our previous FastAPI-based prototype, this guide shows how to deploy OLMo-7B-Instruct on Hugging Face Spaces using TGI—making it scalable, optimized, and efficient.

📌 What’s Different?

TGI-backed API → Optimized for inference, not just a proof-of-concept.

Hugging Face Transformers-Compatible → Works with any TGI-supported LLM.

Auto-Optimizations → All the TGI goodies.

1️⃣ Setting Up the Space

Go to Hugging Face Spaces and create a new Space.

Choose: Docker as the SDK.
Set app_port: 8080 in README.md.

🔥 Important: Unlike other Spaces, TGI requires app_port: 8080 (or your port of choice) for proper routing.

2️⃣ Writing the `Dockerfile`

TGI provides a pre-built inference server for Hugging Face models.
We just need to set up a Dockerfile that pulls the correct image and configures the model.

📜 Dockerfile

# Use Hugging Face TGI as the base image
FROM ghcr.io/huggingface/text-generation-inference:3.0.2

# Set working directory
WORKDIR /app

# Create and set permissions for cache directories
RUN mkdir -p /data && chmod 777 /data
RUN mkdir -p /.cache && chmod 777 /.cache
RUN mkdir -p /.triton && chmod 777 /.triton

# Expose the model API on port 8080
EXPOSE 8080

# Set Hugging Face token for private models
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}

# Run the TGI server with OLMo-7B
CMD ["--model-id", "allenai/OLMo-7B-0724-Instruct-hf", "--port", "8080"]

📷 [Placeholder for Screenshot: Space Deploying with TGI]

3️⃣ Testing the API

Once deployed, the TGI API is automatically available at:

https://your-space-url.hf.space/v1/generate

✅ Using `curl`

~ curl https://arig23498-tgi-docker-olmo.hf.space/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'

✅ Using Python

from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="https://arig23498-tgi-docker-olmo.hf.space/v1/",
)

output = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is deep learning?"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content, end="")

🔗 More TGI Features: TGI v3.0.2 Release Notes

🚀 Next Steps?

Deploy other LLMs like LLaMA 3, Falcon, or Mistral using TGI.
Add GPU support for blazing-fast inference.
Build a frontend chatbot using Gradio or Streamlit.

🔹 Try it out & scale your own LLM API today! 🚀

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote