README.md · DavidCatalano/calme-3.2-instruct-78b-exl2 at main

calme-3.2-instruct-78b-exl2 / README.md

DavidCatalano

Updated base model reference to proper model

acfa3a7 verified about 1 month ago

preview code

raw

history blame contribute delete

5.66 kB

	---
	language:
	- en
	license: other
	library_name: transformers
	tags:
	- chat
	- qwen
	- qwen2.5
	- finetune
	- english
	base_model:
	- MaziyarPanahi/calme-3.2-instruct-78b
	model_name: calme-3.2-instruct-78b
	license_name: qwen
	license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
	pipeline_tag: text-generation
	inference: false
	model_creator: MaziyarPanahi
	quantized_by: MaziyarPanahi
	model-index:
	- name: calme-3.2-instruct-78b
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 80.63
	name: strict accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 62.61
	name: normalized accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 39.95
	name: exact match
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 20.36
	name: acc_norm
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 38.53
	name: acc_norm
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 70.03
	name: accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b
	name: Open LLM Leaderboard
	---

	# EXL2 4.5bpw Quantization of calme-3.2-instruct-78b

	<img src="./calme_3.png" alt="Calme-3 Models" width="200" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

	This repository hosts the 4.5 bits per weight (bpw) quantization of the [calme-3.2-instruct-78b](https://huggingface.co/MaziyarPanahi/calme-3.2-instruct-78b) model, leveraging the ExLlamaV2 format for efficient inference with high-context capabilities. This model is a Qwen 2.5 finetune.

	## Quantization Details
	- Format: ExLlamaV2 4.5bpw
	- Version: ExLlamaV2 0.2.6
	- Model Size: 78 billion parameters
	- VRAM Usage: Approx. 44GB (32,000 context)
	- Calibration:
	- Rows: 115
	- Length: 2048
	- Dataset: (default)

	The quantization process reduces memory usage and inference latency while maintaining high performance for generative text tasks.

	## Prompt Template
	This model uses the ChatML prompt template for interaction:

	```
	<\|im_start\|>system
	{System}
	<\|im_end\|>
	<\|im_start\|>user
	{User}
	<\|im_end\|>
	<\|im_start\|>assistant
	{Assistant}
	```

	## Model Usage

	### Example: Inference with ExLlamaV2
	To use this quantized model, ensure you have the ExLlamaV2 library installed:

	```bash
	pip install exllamav2
	```

	```python
	from exllamav2 import ExLlamaModel, ExLlamaTokenizer, ExLlamaPipeline

	# Load model and tokenizer
	model = ExLlamaModel.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw")
	tokenizer = ExLlamaTokenizer.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw")

	# Create pipeline
	pipeline = ExLlamaPipeline(model, tokenizer)

	# Generate text
	messages = [{"role": "user", "content": "What is EXL2 quantization?"}]
	response = pipeline(messages)
	print(response)
	```

	## Features
	- EXL2 format requires Nvidia hardware but runs faster and with less RAM than GGUF.
	- Supports 44GB VRAM with 32,000 context window.
	- 40GB minimum 1024 context window
	- Highly optimized for inference, making it ideal for resource-constrained environments.
	- Compatible with ChatML-based prompting systems.

	## Acknowledgments
	- Original Model Creator: [MaziyarPanahi](https://huggingface.co/MaziyarPanahi)
	- Quantization by: [DavidCatalano](https://huggingface.co/DavidCatalano)
	- Quantization Tool: ExLlamaV2 0.2.6

	## Download Instructions
	To download the model files:

	```bash
	huggingface-cli install huggingface_hub
	huggingface-cli login
	huggingface-cli download DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw --include "*" --local-dir ./local-folder
	```


	---