issai/LLama-3.1-KazLLM-1.0-70B-GGUF4 · The error during usage

Hello.

You can also run KazLLM like that using vllm.

1 cell.

# Setup env: 
!conda create -n vllm_test python=3.10 -y
!pip install vllm==0.6.3
!pip install ipykernel
!python -m ipykernel install --user --name vllm_test

2 cell

# load model
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
from vllm import LLM, SamplingParams

# In this script, we demonstrate how to pass input to the chat method:
conversation = [
   {
      "role": "system",
      "content": "You are a helpful assistant"
   },
   {
      "role": "user",
      "content": "Hello"
   },
   {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
   },
   {
      "role": "user",
      "content": "Write an essay about the importance of higher education.",
   },
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="Nemotron_70B_instruct_corex5_mcq_cleaned_old_torchtune_cabinet_28112024_18000-Q4_K_M.gguf",
         gpu_memory_utilization=0.95, max_model_len=32000)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)

3 cell

# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
   
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt}, Generated text: {generated_text}")

Or you can also run using llama.cpp if you want, because vllm not yet fully optimized for gguf.