Transformers Streaming Output

Community Article Published March 15, 2025

Introduction

With the advancement of AI-driven chatbots, interactive learning has become more engaging. In this blog, we will explore how to build a streaming output using Python, Gradio, and a Qwen-based language model.

Prerequisites

Before we start, ensure you have the following installed:

pip install gradio transformers torch

Code Implementation

import gradio as gr  # Import the Gradio library for creating user interfaces
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer  # Import necessary classes from the transformers library
from threading import Thread  # Import Thread for concurrent execution
import time  # Import time for adding delays

model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bit"  # Define the model name or path

# Load the pre-trained model with automatic data type and device mapping
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

def QwenChat(message, history):  # Define the QwenChat function
    # Construct the messages list with system, history, and user message
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    ]
    messages.extend(history)  # Add chat history to the messages list
    messages.append({"role": "user", "content": message})  # Append the user's message

    # Apply chat template to format the messages for the model
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Set up the streamer for token generation
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Prepare model inputs by tokenizing the text and moving it to the model's device
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # Set up generation arguments including max tokens and streamer
    generation_args = {
        "max_new_tokens": 512,
        "streamer": streamer,
        **model_inputs
    }

    # Start a separate thread for model generation to allow streaming output
    thread = Thread(
        target=model.generate,
        kwargs=generation_args,
    )
    thread.start()

    # Accumulate and yield text tokens as they are generated
    acc_text = ""
    for text_token in streamer:
        time.sleep(0.01)  # Simulate real-time output with a short delay
        acc_text += text_token  # Append the generated token to the accumulated text
        yield acc_text  # Yield the accumulated text

    # Ensure the generation thread completes
    thread.join()

# Create a Gradio chat interface with the QwenChat function
demo = gr.ChatInterface(fn=QwenChat, type="messages")

# Launch the Gradio interface on all available network interfaces
demo.launch(server_name="0.0.0.0")

Features of This AI Tutor

  • Real-time response: Generates words dynamically as the model processes input.
  • Interactive learning: Users can practice conversations with an AI tutor.
  • Customizable: Modify the system prompt to tailor the teaching style.

How It Works

  1. The user enters a message.
  2. The system constructs a chat template including previous conversations.
  3. The AI model processes the input and generates a response word-by-word in real-time.
  4. The response appears gradually, simulating a natural conversation.

Conclusion

This approach offers an **engaging way to learn using AI. By integrating streaming output, students can experience dynamic, realistic interactions rather than static responses.

Try it out and start your AI-powered learning journey today! ๐Ÿš€

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment