File size: 8,382 Bytes
8f17ae9
 
 
 
 
 
 
 
 
 
 
69381b6
e4f052b
8f17ae9
cdb9456
8f17ae9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
466099f
 
 
 
 
 
 
 
8f17ae9
 
 
 
 
 
 
 
 
 
 
 
 
5517248
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f17ae9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dbd8bba
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: apache-2.0
language:
- pl
base_model:
- CYFRAGOVPL/PLLuM-8x7B-chat
tags:
- polish
- llm
- quantized
- gguf
- mixtral
- llama
library_name: transformers
pipeline_tag: text-generation
---

<p align="center">
  <img src="https://i.imgur.com/e9226KU.png">
</p>

# PLLuM-8x7B-chat GGUF (Unofficial)

This repository contains quantized versions of the [PLLuM-8x7B-chat](https://huggingface.co/CYFRAGOVPL/PLLuM-8x7B-chat) model in GGUF format, optimized for local execution using [llama.cpp](https://github.com/ggerganov/llama.cpp) and related tools. Quantization allows for a significant reduction in model size while maintaining good quality of generated text, enabling it to run on standard hardware.

This is the only repository that contains the PLLuM-8x7B-chat model in both **reference (F16)** and **(BF16)** versions, as well as **(IQ3_S)** quantization.

The GGUF version allows you to run, among other things, in [LM Studio](https://lmstudio.ai/) or [Ollama](https://ollama.com/).

## Available models

| Filename | Size | Quantization type | Recommended hardware | Usage |
|-------------|---------|-----------------|-----------------|--------------|
| [PLLuM-8x7B-chat-gguf-q2_k.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q2_k.gguf) | 17 GB | Q2_K | CPU, min. 20 GB RAM | Very weak computers, worst quality |
| [**PLLuM-8x7B-chat-gguf-iq3_s.gguf**](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-iq3_s.gguf) | 20.4 GB | IQ3_S | CPU, min. 24GB RAM | Running on weaker computers with acceptable quality |
| [PLLuM-8x7B-chat-gguf-q3_k_m.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q3_k_m.gguf) | 22.5 GB | Q3_K_M | CPU, min. 26GB RAM | Good compromise between size and quality |
| [PLLuM-8x7B-chat-gguf-q4_k_m.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q4_k_m.gguf) | 28.4 GB | Q4_K_M | CPU/GPU, min. 32GB RAM | Recommended for most applications |
| [PLLuM-8x7B-chat-gguf-q5_k_m.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q5_k_m.gguf) | 33.2 GB | Q5_K_M | CPU/GPU, min. 40GB RAM | High quality with reasonable size |
| [PLLuM-8x7B-chat-gguf-q8_0.gguf](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/blob/main/PLLuM-8x7B-chat-gguf-q8_0.gguf) | 49.6 GB | Q8_0 | GPU, min. 52GB RAM | Highest quality, close to original |
| [**PLLuM-8x7B-chat-gguf-F16**](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/tree/main/PLLuM-8x7B-chat-gguf-F16) | ~85 GB | F16 | GPU, min. 85GB VRAM | Reference model without quantization |
| [**PLLuM-8x7B-chat-gguf-bf16**](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF/tree/main/PLLuM-8x7B-chat-gguf-bf16) | ~85 GB | BF16 | GPU, min. 85GB VRAM | Alternative full precision format |

## What is quantization?

Quantization is the process of reducing the precision of model weights, which decreases memory requirements while maintaining acceptable quality of generated text. The GGUF (GPT-Generated Unified Format) format is the successor to the GGML format, which enables efficient running of large language models on consumer hardware.

## Which model to choose?

- **Q2_K, IQ3_S and Q3_K_M**: The smallest versions of the model, ideal when memory savings are a priority
- **Q4_K_M**: Recommended for most applications - good balance between quality and size
- **Q5_K_M**: Choose when you care about better quality and have the appropriate amount of memory
- **Q8_0**: Highest quality on GPU, smallest quality decrease compared to the original
- **F16/BF16**: Full precision, reference versions without quantization

# Downloading the model using huggingface-cli

<details>
  <summary>Click to see download instructions</summary>

First, make sure you have the huggingface-cli tool installed:
```bash
pip install -U "huggingface_hub[cli]"
```

### Downloading smaller models
To download a specific model smaller than 50GB (e.g., q4_k_m):
```bash
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q4_k_m.gguf" --local-dir ./
```

You can also download other quantizations by changing the filename:
```bash
# For q3_k_m version (22.5 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q3_k_m.gguf" --local-dir ./

# For iq3_s version (20.4 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-iq3_s.gguf" --local-dir ./

# For q5_k_m version (33.2 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q5_k_m.gguf" --local-dir ./
```

### Downloading larger models (split into parts)
For large models, such as F16 or bf16, files are split into smaller parts. To download all parts to a local folder:

```bash
# For F16 version (~85 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-F16/*" --local-dir ./F16/

# For bf16 version (~85 GB)
huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-bf16/*" --local-dir ./bf16/
```

### Faster downloads with hf_transfer
To significantly speed up downloading (up to 1GB/s), you can use the hf_transfer library:

```bash
# Install hf_transfer
pip install hf_transfer

# Download with hf_transfer enabled (much faster)
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF --include "PLLuM-8x7B-chat-gguf-q4_k_m.gguf" --local-dir ./
```

### Joining split files after downloading
If you downloaded a split model, you can join it using:

```bash
# On Linux/Mac systems
cat PLLuM-8x7B-chat-gguf-F16.part-* > PLLuM-8x7B-chat-gguf-F16.gguf

# On Windows systems
copy /b PLLuM-8x7B-chat-gguf-F16.part-* PLLuM-8x7B-chat-gguf-F16.gguf
```
</details>

## How to run the model

### Using llama.cpp

In these examples, we will use the PLLuM model from our unofficial repository. You can download your preferred quantization from the available models table above.

Once downloaded, place your model in the `models` directory.

#### Unix-based systems (Linux, macOS, etc.):
Input prompt (One-and-done)

```bash
./llama-cli -m models/PLLuM-8x7B-chat-gguf-q4_k_m.gguf --prompt "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:"
```
#### Windows:
Input prompt (One-and-done)

```bash
./llama-cli.exe -m models\PLLuM-8x7B-chat-gguf-q4_k_m.gguf --prompt "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:"
```

For detailed and up-to-date information, please refer to the official [llama.cpp documentation](https://github.com/ggml-org/llama.cpp/blob/master/examples/main/README.md).

### Using text-generation-webui

```bash
# Install text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt

# Run the server with the selected model
python server.py --model path/to/PLLuM-8x7B-chat-gguf-q4_k_m.gguf
```

### Using python and llama-cpp-python

```python
from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="path/to/PLLuM-8x7B-chat-gguf-q4_k_m.gguf",
    n_ctx=4096,     # Context size
    n_threads=8,    # Number of CPU threads
    n_batch=512     # Batch size
)

# Example usage
prompt = "Pytanie: Jakie są najciekawsze zabytki w Krakowie? Odpowiedź:"
output = llm(
    prompt,
    max_tokens=512,
    temperature=0.7,
    top_p=0.95
)

print(output["choices"][0]["text"])
```

## About the PLLuM model

PLLuM (Polish Large Language Model) is an advanced family of Polish language models developed by the Polish Ministry of Digital Affairs. This version of the model (8x7B-chat) has been optimized for conversations (chat).

### Model capabilities:
- Generating text in Polish
- Answering questions
- Summarizing texts
- Creating content
- Translation
- Explaining concepts
- Conducting conversations

## License

The base PLLuM 8x7B-chat model is distributed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt). Quantized versions are subject to the same license.

## Authors

The author of the repository and quantization is [Piotr Bednarski](https://github.com/piotrmaciejbednarski)