File size: 4,541 Bytes
285742c
 
 
 
1d2b1ee
a6197fe
48a80c0
a0fdb2c
a6197fe
 
 
 
 
84de6d9
a34a136
9da32f5
54a6f3d
f33c195
 
 
54a6f3d
 
1d7499b
 
 
f33c195
9da32f5
776abf3
 
 
 
 
 
84de6d9
072e7cd
84de6d9
 
80d807a
cb72d4b
c55e703
be8cdc8
a087c14
 
 
be8cdc8
c55e703
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f130d68
 
 
 
 
 
 
 
 
 
 
 
8d8ee80
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
tags:
- ctranslate2
---
"Ctranslate2" is an amazing library that runs these models.  They are faster, more accurate, and use less VRAM/RAM than GGML and GPTQ models.

How to run with instructions: https://github.com/BBC-Esq
- COMING SOON

Learn more about the amazing "ctranslate2" technology:"
- https://github.com/OpenNMT/CTranslate2
- https://opennmt.net/CTranslate2/index.html

<details>
<summary><b>Compatibility and Data Formats</b></summary>

| Format          | Approximate Size Compared to `float32` | Nvidia GPU Required "Compute" | Accuracy Summary |
|-----------------|----------------------------|-----------------|--------------------------|
| `float32`       | 100%                       | 1.0        | Offers more precision and a wider range.  Most un-quantized models use this. |
| `int16`         | 51.37%                     | 1.0        | Same as `int8` but with a larger range. |
| `float16`       | 50.00%                     | 5.3  (e.g. Nvidia 10 Series and Higher)      | Suitable for scientific computations; balance between precision and memory. |
| `bfloat16`      | 50.00%                     | 8.0  (e.g. Nvidia 30 Series and Higher)      | Often used in neural network training; larger exponent range than `float16`. |
| `int8_float32`  | 27.47%                     | test manually (see below)             | Combines low precision integer with high precision float. Useful for mixed data. |
| `int8_float16`  | 26.10%                     | test manually (see below)             | Combines low precision integer with medium precision float. Saves memory. |
| `int8_bfloat16` | 26.10%                     | test manually (see below)             | Combines low precision integer with reduced precision float. Efficient for neural nets. |
| `int8`          | 25%                        | 1.0        | Lower precision, suitable for whole numbers within a specific range. Often used where memory is crucial. |

| Web Link                                                                      | Description                                                                                                         |
|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| [CUDA GPUs Supported](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)       | See what level of "compute" your Nvidia GPU supports.                                                               |
| [CTranslate2 Quantization](https://opennmt.net/CTranslate2/quantization.html#implicit-type-conversion-on-load) | Even if your GPU/CPU doesn't support the data type of the model you download, "ctranslate2" will automatically run the model in a way that's compatible. |
| [Bfloat16 Floating-Point Format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format#bfloat16_floating-point_format) | Visualize data formats.                                                                                                           |
| [Nvidia Floating-Point](https://docs.nvidia.com/cuda/floating-point/index.html) | Technical discussion.                                                                                    |
</details>

<details>
  
<summary><b>Check Compatibility Manually</b></summary>
Open a command prompt and run the following commands (may require CUDA toolkit and cuDNN installed as well, need to doublecheck this):

   ```bash
   pip install ctranslate2
   ```

   ```bash
   python
   ```

   ```python
   import ctranslate2
   ```

Check GPU/CUDA compatibility:

   ```python
   ctranslate2.get_supported_compute_types("cuda")
   ```

Check CPU compatibility:

   ```python
   ctranslate2.get_supported_compute_types("cpu")
   ```

It will print out your CPU/GPU compatibility.  For example, a system with a 4090 GPU and 13900k would have the following compatibility:

|                 | **CPU** | **GPU** |
|-----------------|---------|---------|
| **`float32`**   | βœ…       | βœ…       |
| **`int16`**     | βœ…       |         |
| **`float16`**   |         | βœ…       |
| **`bfloat16`**  |         | βœ…       |
| **`int8_float32`** | βœ…     | βœ…       |
| **`int8_float16`** |       | βœ…       |
| **`int8_bfloat16`** |      | βœ…       |
| **`int8`**          | βœ…     | βœ…       |
</details>

![Comparison of ctranslate2 and ggml](https://huggingface.co/ctranslate2-4you/Llama-2-7b-chat-hf-ct2-int8/resolve/main/comparison%20of%20ctranslate2%20and%20ggml.png)