|
--- |
|
license: apache-2.0 |
|
language: |
|
- pl |
|
base_model: |
|
- CYFRAGOVPL/PLLuM-8x7B-chat |
|
tags: |
|
- polish |
|
- llm |
|
- quantized |
|
- gguf |
|
- mixtral |
|
- llama |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
<p align="center"> |
|
<img src="https://i.imgur.com/e9226KU.png"> |
|
</p> |
|
|
|
# PLLuM-8x7B-chat GGUF (Unofficial) |
|
|
|
This repository contains quantized versions of the [PLLuM-8x7B-chat](https://huggingface.co/CYFRAGOVPL/PLLuM-8x7B-chat) model in GGUF format, optimized for local execution using [llama.cpp](https://github.com/ggerganov/llama.cpp) and related tools. Quantization allows for a significant reduction in model size while maintaining good quality of generated text, enabling it to run on standard hardware. |
|
|
|
This is the only repository that contains the PLLuM-8x7B-chat model in both **reference (F16)** and **(BF16)** versions, as well as **(IQ3_S)** quantization. |
|
|
|
The GGUF version allows you to run, among other things, in [LM Studio](https://lmstudio.ai/) or [Ollama](https://ollama.com/). |
|
|
|
## Available models |
|
|
|
| Filename | Size | Quantization type | Recommended hardware | Usage | |
|
|-------------|---------|-----------------|-----------------|--------------| |
|
| [PLLuM-8x7B-chat-gguf-q2_k.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q2_k.gguf) | 17 GB | Q2_K | CPU, min. 20 GB RAM | Very weak computers, worst quality | |
|
| [**PLLuM-8x7B-chat-gguf-iq3_s.gguf**](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-iq3_s.gguf) | 20.4 GB | IQ3_S | CPU, min. 24GB RAM | Running on weaker computers with acceptable quality | |
|
| [PLLuM-8x7B-chat-gguf-q3_k_m.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q3_k_m.gguf) | 22.5 GB | Q3_K_M | CPU, min. 26GB RAM | Good compromise between size and quality | |
|
| [PLLuM-8x7B-chat-gguf-q4_k_m.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q4_k_m.gguf) | 28.4 GB | Q4_K_M | CPU/GPU, min. 32GB RAM | Recommended for most applications | |
|
| [PLLuM-8x7B-chat-gguf-q5_k_m.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q5_k_m.gguf) | 33.2 GB | Q5_K_M | CPU/GPU, min. 40GB RAM | High quality with reasonable size | |
|
| [PLLuM-8x7B-chat-gguf-q8_0.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q8_0.gguf) | 49.6 GB | Q8_0 | GPU, min. 52GB RAM | Highest quality, close to original | |
|
| [**PLLuM-8x7B-chat-gguf-F16**](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/tree/main/PLLuM-8x7B-chat-gguf-F16) | ~85 GB | F16 | GPU, min. 85GB VRAM | Reference model without quantization | |
|
| [**PLLuM-8x7B-chat-gguf-bf16**](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/tree/main/PLLuM-8x7B-chat-gguf-bf16) | ~85 GB | BF16 | GPU, min. 85GB VRAM | Alternative full precision format | |
|
|
|
## What is quantization? |
|
|
|
Quantization is the process of reducing the precision of model weights, which decreases memory requirements while maintaining acceptable quality of generated text. The GGUF (GPT-Generated Unified Format) format is the successor to the GGML format, which enables efficient running of large language models on consumer hardware. |
|
|
|
## Which model to choose? |
|
|
|
- **Q2_K, IQ3_S and Q3_K_M**: The smallest versions of the model, ideal when memory savings are a priority |
|
- **Q4_K_M**: Recommended for most applications - good balance between quality and size |
|
- **Q5_K_M**: Choose when you care about better quality and have the appropriate amount of memory |
|
- **Q8_0**: Highest quality on GPU, smallest quality decrease compared to the original |
|
- **F16/BF16**: Full precision, reference versions without quantization |
|
|
|
# Downloading the model using huggingface-cli |
|
|
|
<details> |
|
<summary>Click to see download instructions</summary> |
|
|
|
First, make sure you have the huggingface-cli tool installed: |
|
```bash |
|
pip install -U "huggingface_hub[cli]" |
|
``` |
|
|
|
### Downloading smaller models |
|
To download a specific model smaller than 50GB (e.g., q4_k_m): |
|
```bash |
|
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q4_k_m.gguf" --local-dir ./ |
|
``` |
|
|
|
You can also download other quantizations by changing the filename: |
|
```bash |
|
# For q3_k_m version (22.5 GB) |
|
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q3_k_m.gguf" --local-dir ./ |
|
|
|
# For iq3_s version (20.4 GB) |
|
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-iq3_s.gguf" --local-dir ./ |
|
|
|
# For q5_k_m version (33.2 GB) |
|
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q5_k_m.gguf" --local-dir ./ |
|
``` |
|
|
|
### Downloading larger models (split into parts) |
|
For large models, such as F16 or bf16, files are split into smaller parts. To download all parts to a local folder: |
|
|
|
```bash |
|
# For F16 version (~85 GB) |
|
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-F16/*" --local-dir ./F16/ |
|
|
|
# For bf16 version (~85 GB) |
|
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-bf16/*" --local-dir ./bf16/ |
|
``` |
|
|
|
### Faster downloads with hf_transfer |
|
To significantly speed up downloading (up to 1GB/s), you can use the hf_transfer library: |
|
|
|
```bash |
|
# Install hf_transfer |
|
pip install hf_transfer |
|
|
|
# Download with hf_transfer enabled (much faster) |
|
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q4_k_m.gguf" --local-dir ./ |
|
``` |
|
|
|
### Joining split files after downloading |
|
If you downloaded a split model, you can join it using: |
|
|
|
```bash |
|
# On Linux/Mac systems |
|
cat PLLuM-8x7B-chat-gguf-F16.part-* > PLLuM-8x7B-chat-gguf-F16.gguf |
|
|
|
# On Windows systems |
|
copy /b PLLuM-8x7B-chat-gguf-F16.part-* PLLuM-8x7B-chat-gguf-F16.gguf |
|
``` |
|
</details> |
|
|
|
## How to run the model |
|
|
|
### Using llama.cpp |
|
|
|
In these examples, we will use the PLLuM model from our unofficial repository. You can download your preferred quantization from the available models table above. |
|
|
|
Once downloaded, place your model in the `models` directory. |
|
|
|
#### Unix-based systems (Linux, macOS, etc.): |
|
Input prompt (One-and-done) |
|
|
|
```bash |
|
./llama-cli -m models/PLLuM-8x7B-chat-gguf-q4_k_m.gguf --prompt "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:" |
|
``` |
|
#### Windows: |
|
Input prompt (One-and-done) |
|
|
|
```bash |
|
./llama-cli.exe -m models\PLLuM-8x7B-chat-gguf-q4_k_m.gguf --prompt "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:" |
|
``` |
|
|
|
For detailed and up-to-date information, please refer to the official [llama.cpp documentation](https://github.com/ggml-org/llama.cpp/blob/master/examples/main/README.md). |
|
|
|
### Using text-generation-webui |
|
|
|
```bash |
|
# Install text-generation-webui |
|
git clone https://github.com/oobabooga/text-generation-webui.git |
|
cd text-generation-webui |
|
pip install -r requirements.txt |
|
|
|
# Run the server with the selected model |
|
python server.py --model path/to/PLLuM-8x7B-chat-gguf-q4_k_m.gguf |
|
``` |
|
|
|
### Using python and llama-cpp-python |
|
|
|
```python |
|
from llama_cpp import Llama |
|
|
|
# Load the model |
|
llm = Llama( |
|
model_path="path/to/PLLuM-8x7B-chat-gguf-q4_k_m.gguf", |
|
n_ctx=4096, # Context size |
|
n_threads=8, # Number of CPU threads |
|
n_batch=512 # Batch size |
|
) |
|
|
|
# Example usage |
|
prompt = "Pytanie: Jakie są najciekawsze zabytki w Krakowie? Odpowiedź:" |
|
output = llm( |
|
prompt, |
|
max_tokens=512, |
|
temperature=0.7, |
|
top_p=0.95 |
|
) |
|
|
|
print(output["choices"][0]["text"]) |
|
``` |
|
|
|
## About the PLLuM model |
|
|
|
PLLuM (Polish Large Language Model) is an advanced family of Polish language models developed by the Polish Ministry of Digital Affairs. This version of the model (8x7B-chat) has been optimized for conversations (chat). |
|
|
|
### Model capabilities: |
|
- Generating text in Polish |
|
- Answering questions |
|
- Summarizing texts |
|
- Creating content |
|
- Translation |
|
- Explaining concepts |
|
- Conducting conversations |
|
|
|
## License |
|
|
|
The base PLLuM 8x7B-chat model is distributed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt). Quantized versions are subject to the same license. |
|
|
|
## Authors |
|
|
|
The author of the repository and quantization is [Piotr Bednarski](https://github.com/piotrmaciejbednarski) |