Edit model card

PMC_LLaMA_13B - AWQ

Description

This repository contains AWQ model files for PMC_LLaMA_13B.

About AWQ

Activation-aware Weight Quantization (AWQ) selectively preserves a subset of crucial weights for LLM performance instead of quantizing all weights in a model. This targeted approach minimizes quantization loss, allowing models to operate in 4-bit precision without compromising performance.

Example of usage with vLLM library:

from vllm import LLM, SamplingParams

prompt_input = (
    '### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:'
)
 
examples = [
    {
      "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
      "input": "What is the mechanism of action of antibiotics?"
    },
    {
      "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
      "input": "How do statins work to lower cholesterol levels?"
    },
    {
      "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
      "input": "Tell me about Paracetamol"
    }
]
 
prompt_batch = [prompt_input.format_map(example) for example in examples]

sampling_params = SamplingParams(temperature=0.8, max_tokens=512)

llm = LLM(model="disi-unibo-nlp/pmc-llama-13b-awq", quantization="awq", dtype="half")

outputs = llm.generate(prompt_batch, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(generated_text)
Downloads last month
20
Safetensors
Model size
2.03B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including disi-unibo-nlp/pmc-llama-13b-awq