Introduction
GemSUra-edu is a large language model fine-tuned on a dataset of FAQs from HCMUT, based on the pre-trained model GemSUra 2B developed by the URA research group at Ho Chi Minh City University of Technology (HCMUT).
Inference (with Unsloth for higher speed)
from unsloth import FastLanguageModel
import torch
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="IAmSkyDra/GemSUra-edu",
max_seq_length=4096,
dtype=None,
load_in_4bit=True
)
FastLanguageModel.for_inference(model)
query_template = "<start_of_turn>user\n{query}<end_of_turn>\n<start_of_turn>model\n"
while True:
query = input("Query: ")
if query.lower() == "exit":
break
query = query_template.format(query=query)
inputs = tokenizer(query, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, use_cache=True)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
answer = generated_text[0].split("model\n")[1].strip()
print(answer)
Inference (with Transformers)
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
pipeline_kwargs = {
"temperature": 0.1,
"max_new_tokens": 4096,
"do_sample": True
}
if __name__ == "__main__":
# Load model
model = AutoModelForCausalLM.from_pretrained(
"IAmSkyDra/GemSUra-edu",
device_map="auto"
)
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"IAmSkyDra/GemSUra-edu",
trust_remote_code=True
)
pipeline = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=False,
task='text-generation',
**pipeline_kwargs
)
query_template = "<start_of_turn>user\n{query}<end_of_turn>\n<start_of_turn>model\n"
while True:
query = input("Query: ")
if query.lower() == "exit":
break
query = query_template.format(query=query)
answer = pipeline(query)[0]["generated_text"]
answer = answer.split("model\n")[1].strip()
print(answer)
Notation
If you want to quantize the model for deployment on local devices, it should be quantized to at least 8 bits.
- Downloads last month
- 233