--- base_model: - Qwen/Qwen2.5-3B-Instruct tags: - text-generation-inference - transformers - qwen2 - trl - grpo license: apache-2.0 language: - en --- # Uploaded model - **Developed by:** TethysAI - **License:** apache-2.0 - **Finetuned from model :** Qwen/Qwen2.5-3B-Instruct # Follow the below structure to call the model: ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("saishshinde15/TethysAI_Base_Reasoning") model = AutoModelForCausalLM.from_pretrained("saishshinde15/TethysAI_Base_Reasoning") # Prepare input prompt using chat template SYSTEM_PROMPT = """ Respond in the following format: ... ... """ text = tokenizer.apply_chat_template([ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": "What is 2x+3=4"}, ], tokenize=False, add_generation_prompt=True) # Tokenize input input_ids = tokenizer(text, return_tensors="pt").input_ids # Move to GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) input_ids = input_ids.to(device) # Generate response # The line below caused the error as the loaded model doesn't have the attribute 'fast_generate' # output_ids = model.generate( # input_ids, # temperature=0.8, # top_p=0.95, # max_length=1024, # Equivalent to max_tokens # ) # Instead, use this from vllm import SamplingParams sampling_params = SamplingParams( temperature=0.8, top_p=0.95, max_tokens=1024, ) output = model.generate( input_ids, sampling_params=sampling_params, ) # Decode and print output output_text = tokenizer.decode(output[0], skip_special_tokens=True) print(output_text) ```
Fast inference ```python pip install transformers vllm vllm[lora] torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 text = tokenizer.apply_chat_template([ {"role" : "system", "content" : SYSTEM_PROMPT}, {"role" : "user", "content" : "What is 2x+3=4"}, ], tokenize = False, add_generation_prompt = True) from vllm import SamplingParams sampling_params = SamplingParams( temperature = 0.8, top_p = 0.95, max_tokens = 1024, ) output = model.fast_generate( text, sampling_params = sampling_params, lora_request = model.load_lora("grpo_saved_lora"), )[0].outputs[0].text output ```
# Use this prompt for more detailed and personalised result. This is the recommendend prompt as the model was tuned on it. ```python You are a reasoning model made by researcher at TethysAI and your role is to respond in the following format only and in detail : ... ... ```