Luigi's picture
Update README.md
34cf84a verified

A newer version of the Gradio SDK is available: 5.27.0

Upgrade
metadata
title: ' ZeroGPU-LLM-Inference'
emoji: 🧠
colorFrom: pink
colorTo: purple
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
license: apache-2.0
short_description: Chat inference for GGUF models with llama.cpp & Gradio

This Gradio app enables chat-based inference on various GGUF models using llama.cpp and llama-cpp-python. The application features:

  • Real-Time Web Search Integration: Uses DuckDuckGo to retrieve up-to-date context; debug output is displayed in real time.
  • Streaming Token-by-Token Responses: Users see the generated answer as it comes in.
  • Response Cancellation: A cancel button allows stopping response generation in progress.
  • Customizable Prompts & Generation Parameters: Adjust the system prompt (with dynamic date insertion), temperature, token limits, and more.
  • Memory-Safe Design: Loads one model at a time with proper memory management, ideal for deployment on Hugging Face Spaces.
  • Rate Limit Handling: Implements exponential backoff to cope with DuckDuckGo API rate limits.

πŸ”„ Supported Models:

  • Qwen/Qwen2.5-7B-Instruct-GGUF β†’ qwen2.5-7b-instruct-q2_k.gguf
  • unsloth/gemma-3-4b-it-GGUF β†’ gemma-3-4b-it-Q4_K_M.gguf
  • unsloth/Phi-4-mini-instruct-GGUF β†’ Phi-4-mini-instruct-Q4_K_M.gguf
  • MaziyarPanahi/Meta-Llama-3.1-8B-Instruct-GGUF β†’ Meta-Llama-3.1-8B-Instruct.Q2_K.gguf
  • unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF β†’ DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf
  • MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF β†’ Mistral-7B-Instruct-v0.3.IQ3_XS.gguf
  • Qwen/Qwen2.5-Coder-7B-Instruct-GGUF β†’ qwen2.5-coder-7b-instruct-q2_k.gguf

βš™οΈ Features:

  • Model Selection: Select from multiple GGUF models.
  • Customizable Prompts & Parameters: Set a system prompt (e.g., automatically including today’s date), adjust temperature, token limits, and more.
  • Chat-style Interface: Interactive Gradio UI with streaming token-by-token responses.
  • Real-Time Web Search & Debug Output: Leverages DuckDuckGo to fetch recent context, with a dedicated debug panel showing web search progress and results.
  • Response Cancellation: Cancel in-progress answer generation using a cancel button.
  • Memory-Safe & Rate-Limit Resilient: Loads one model at a time with proper cleanup and incorporates exponential backoff to handle API rate limits.

Ideal for deploying multiple GGUF chat models on Hugging Face Spaces with a robust, user-friendly interface!

For further details, check the Spaces configuration guide.