--- license: apache-2.0 language: - en - zh base_model: - meta-llama/Llama-3.2-3B-Instruct library_name: transformers tags: - CoT - LongCoT - o1 pipeline_tag: text-generation --- # Llama-3.2-3B-LongCoT A small model with **LongCoT** capability. ![Example Image](images/example.jpg) ## Features - Using high-quality synthetic data for fine-tuning. - The model can adjust whether to use LongCoT based on the complexity of the question. - Good at mathematics and reasoning ## Benchmark | Benchmark | Llama-3.2-3B-Instruct | Llama-3.2-3B-LongCoT | |-----------|-----------------------|----------------------| | Math | 35.5 | **52.0** | | GSM8K | 77.3 | **82.3** | ## Inference Example of Stream Inference: ```python import time import torch from transformers import ( AutoModelForCausalLM, AutoTokenizer, TextStreamer, ) # Model ID from Hugging Face model_id = "Kadins/Llama-3.2-3B-LongCoT" # Load the pre-trained model with appropriate data type and device mapping model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, # Use bfloat16 for optimized performance device_map="auto", # Automatically map the model to available devices ) # Load the tokenizer associated with the model tokenizer = AutoTokenizer.from_pretrained(model_id) def stream_chat(messages, max_new_tokens=8192, top_p=0.95, temperature=0.6): """ Generates a response using streaming inference. Args: messages (list): A list of dictionaries containing the conversation prompt. max_new_tokens (int): Maximum number of tokens to generate. top_p (float): Nucleus sampling parameter for controlling diversity. temperature (float): Sampling temperature to control response creativity. """ # Prepare the input by applying the chat template and tokenizing inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True, # Ensure the output is a dictionary ).to(model.device) # Move the inputs to the same device as the model # Initialize the TextStreamer for real-time output streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) # Record the start time for performance measurement start_time = time.time() # Generate the response using the model's generate method with streaming model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, repetition_penalty=1.1, top_p=top_p, temperature=temperature, streamer=streamer, # Enable streaming of the generated tokens ) # Calculate and print the total response time total_time = time.time() - start_time print(f"\n--- Response finished in {total_time:.2f} seconds ---") def chat_loop(): """ Initiates an interactive chat session with the model. Continuously reads user input and generates model responses until the user exits. """ while True: # Initialize the conversation with a system message messages = [ {"role": "system", "content": "You are a reasoning expert and helpful assistant."}, ] # Prompt the user for input user_input = input("\nUser: ") if user_input.strip().lower() in ["exit", "quit"]: print("Exiting chat...") break # Append the user's message to the conversation history messages.append({"role": "user", "content": user_input}) print("Assistant: ", end="", flush=True) # Generate and stream the assistant's response stream_chat(messages) # Note: Currently, the assistant's reply is streamed directly to the console. # To store the assistant's reply in the conversation history, additional handling is required. if __name__ == "__main__": # Start the interactive chat loop when the script is executed chat_loop() ```