disi-unibo-nlp
/

pmc-llama-13b-awq

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

pmc-llama-13b-awq / README.md

pr3ss's picture

Update README.md

d912cf2 verified 9 months ago

|

1.99 kB

	---
	license: openrail
	model_creator: axiong
	model_name: PMC_LLaMA_13B
	---
	# PMC_LLaMA_13B - AWQ
	- Model creator: [axiong](https://huggingface.co/axiong)
	- Original model: [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B)

	## Description

	This repository contains AWQ model files for [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B).

	### About AWQ

	[Activation-aware Weight Quantization (AWQ)](https://arxiv.org/abs/2306.00978) selectively preserves a subset of crucial weights for LLM performance instead of quantizing all weights in a model. This targeted approach minimizes quantization loss, allowing models to operate in 4-bit precision without compromising performance.

	Example of usage with vLLM library:

	```python
	from vllm import LLM, SamplingParams

	prompt_input = (
	'### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:'
	)

	examples = [
	{
	"instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
	"input": "What is the mechanism of action of antibiotics?"
	},
	{
	"instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
	"input": "How do statins work to lower cholesterol levels?"
	},
	{
	"instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
	"input": "Tell me about Paracetamol"
	}
	]

	prompt_batch = [prompt_input.format_map(example) for example in examples]

	sampling_params = SamplingParams(temperature=0.8, max_tokens=512)

	llm = LLM(model="disi-unibo-nlp/pmc-llama-13b-awq", quantization="awq", dtype="half")

	outputs = llm.generate(prompt_batch, sampling_params)

	# Print the outputs.
	for output in outputs:
	prompt = output.prompt
	generated_text = output.outputs[0].text
	print(f"Prompt: {prompt}")
	print(f"Response: {generated_text}")
	```