πŸš€ Deploying OLMo-7B with Text Generation Inference (TGI) on Hugging Face Spaces

Community Article Published February 2, 2025

Hugging Face’s Text Generation Inference (TGI) is the go-to solution for production-ready LLM deployments. Unlike our previous FastAPI-based prototype, this guide shows how to deploy OLMo-7B-Instruct on Hugging Face Spaces using TGIβ€”making it scalable, optimized, and efficient.

πŸ“Œ What’s Different?

  • TGI-backed API β†’ Optimized for inference, not just a proof-of-concept.
  • Hugging Face Transformers-Compatible β†’ Works with any TGI-supported LLM.
  • Auto-Optimizations β†’ All the TGI goodies.

1️⃣ Setting Up the Space

Go to Hugging Face Spaces and create a new Space.

  • Choose: Docker as the SDK.
  • Set app_port: 8080 in README.md.

πŸ”₯ Important: Unlike other Spaces, TGI requires app_port: 8080 (or your port of choice) for proper routing.

2️⃣ Writing the Dockerfile

TGI provides a pre-built inference server for Hugging Face models.
We just need to set up a Dockerfile that pulls the correct image and configures the model.

πŸ“œ Dockerfile

# Use Hugging Face TGI as the base image
FROM ghcr.io/huggingface/text-generation-inference:3.0.2

# Set working directory
WORKDIR /app

# Create and set permissions for cache directories
RUN mkdir -p /data && chmod 777 /data
RUN mkdir -p /.cache && chmod 777 /.cache
RUN mkdir -p /.triton && chmod 777 /.triton

# Expose the model API on port 8080
EXPOSE 8080

# Set Hugging Face token for private models
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}

# Run the TGI server with OLMo-7B
CMD ["--model-id", "allenai/OLMo-7B-0724-Instruct-hf", "--port", "8080"]

πŸ“· [Placeholder for Screenshot: Space Deploying with TGI]

3️⃣ Testing the API

Once deployed, the TGI API is automatically available at:

https://your-space-url.hf.space/v1/generate

βœ… Using curl

~ curl https://arig23498-tgi-docker-olmo.hf.space/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'

βœ… Using Python

from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="https://arig23498-tgi-docker-olmo.hf.space/v1/",
)

output = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is deep learning?"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content, end="")

πŸ”— More TGI Features: TGI v3.0.2 Release Notes

πŸš€ Next Steps?

  • Deploy other LLMs like LLaMA 3, Falcon, or Mistral using TGI.
  • Add GPU support for blazing-fast inference.
  • Build a frontend chatbot using Gradio or Streamlit.

πŸ”Ή Try it out & scale your own LLM API today! πŸš€

Community

Sign up or log in to comment