π Deploying OLMo-7B with Text Generation Inference (TGI) on Hugging Face Spaces
Community Article
Published
February 2, 2025
Hugging Faceβs Text Generation Inference (TGI) is the go-to solution for production-ready LLM deployments. Unlike our previous FastAPI-based prototype, this guide shows how to deploy OLMo-7B-Instruct on Hugging Face Spaces using TGIβmaking it scalable, optimized, and efficient.
π Whatβs Different?
- TGI-backed API β Optimized for inference, not just a proof-of-concept.
- Hugging Face Transformers-Compatible β Works with any TGI-supported LLM.
- Auto-Optimizations β All the TGI goodies.
1οΈβ£ Setting Up the Space
Go to Hugging Face Spaces and create a new Space.
- Choose: Docker as the SDK.
- Set
app_port: 8080
inREADME.md
.
π₯ Important: Unlike other Spaces, TGI requires
app_port: 8080
(or your port of choice) for proper routing.
2οΈβ£ Writing the Dockerfile
TGI provides a pre-built inference server for Hugging Face models.
We just need to set up a Dockerfile that pulls the correct image and configures the model.
π Dockerfile
# Use Hugging Face TGI as the base image
FROM ghcr.io/huggingface/text-generation-inference:3.0.2
# Set working directory
WORKDIR /app
# Create and set permissions for cache directories
RUN mkdir -p /data && chmod 777 /data
RUN mkdir -p /.cache && chmod 777 /.cache
RUN mkdir -p /.triton && chmod 777 /.triton
# Expose the model API on port 8080
EXPOSE 8080
# Set Hugging Face token for private models
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}
# Run the TGI server with OLMo-7B
CMD ["--model-id", "allenai/OLMo-7B-0724-Instruct-hf", "--port", "8080"]
π· [Placeholder for Screenshot: Space Deploying with TGI]
3οΈβ£ Testing the API
Once deployed, the TGI API is automatically available at:
https://your-space-url.hf.space/v1/generate
β
Using curl
~ curl https://arig23498-tgi-docker-olmo.hf.space/v1/chat/completions \
-X POST \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 20
}' \
-H 'Content-Type: application/json'
β Using Python
from huggingface_hub import InferenceClient
client = InferenceClient(
base_url="https://arig23498-tgi-docker-olmo.hf.space/v1/",
)
output = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is deep learning?"},
],
stream=True,
max_tokens=1024,
)
for chunk in output:
print(chunk.choices[0].delta.content, end="")
π More TGI Features: TGI v3.0.2 Release Notes
π Next Steps?
- Deploy other LLMs like LLaMA 3, Falcon, or Mistral using TGI.
- Add GPU support for blazing-fast inference.
- Build a frontend chatbot using Gradio or Streamlit.
πΉ Try it out & scale your own LLM API today! π