File size: 4,710 Bytes
9382e3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Maximum Batch Size Analysis for Llama2 Models

Provides a summary of the performance testing results for Llama2 models under various configurations. The focus here is on identifying the maximum batch sizes that can be processed without errors and documenting the corresponding generation times in seconds.

## Experiment Details

The experiment varied settings such as model size, number of new tokens (`num_new_tokens`), key-value bit size (`kv_bits`), and `batch sizes`. "Unquantized" indicates configurations without quantization. The objective was to determine stable operating conditions for generating a fixed number of tokens under these configurations.

### Models and Configurations

- **Models Tested:** Llama2 7B and 13B.
- **Measurements:** Generation times are directly reported in seconds as provided by the dataset.
## Results: Llama2 7B Model Performance
 
| Model Size | num_new_tokens | KV Bits     | Max Batch Size | Generation Time (s) | Speedup (Batch Size) |
|------------|----------------|-------------|----------------|----------------------|-----------------------|
| 7B         | 256            | 1           | 764            | 257                  | 14.98x                |
| 7B         | 256            | 2           | 384            | 124                  | 7.53x                 |
| 7B         | 256            | 4           | 204            | 99                   | 4.00x                 |
| 7B         | 256            | Unquantized | 51             | 75                   | 1x                    |
| 7B         | 512            | 1           | 437            | 352                  | 15.07x                |
| 7B         | 512            | 2           | 223            | 178                  | 7.69x                 |
| 7B         | 512            | 4           | 114            | 148                  | 3.93x                 |
| 7B         | 512            | Unquantized | 29             | 122                  | 1x                    |
| 7B         | 1024           | 1           | 247            | 454                  | 15.44x                |
| 7B         | 1024           | 2           | 126            | 300                  | 7.88x                 |
| 7B         | 1024           | 4           | 65             | 283                  | 4.06x                 |
| 7B         | 1024           | Unquantized | 16             | 224                  | 1x                    |


## Results: Llama2 13B Model Performance
| Model Size | num_new_tokens | KV Bits     | Max Batch Size | Generation Time (s) | Speedup (Batch Size) |
|------------|----------------|-------------|----------------|----------------------|-----------------------|
| 13B        | 256            | 1           | 154            | 83                   | 14.00x                |
| 13B        | 256            | 2           | 88             | 63                   | 8.00x                 |
| 13B        | 256            | 4           | 45             | 62                   | 4.09x                 |
| 13B        | 256            | Unquantized | 11             | 33                   | 1x                    |
| 13B        | 512            | 1           | 100            | 144                  | 16.67x                |
| 13B        | 512            | 2           | 51             | 98                   | 8.50x                 |
| 13B        | 512            | 4           | 26             | 108                  | 4.33x                 |
| 13B        | 512            | Unquantized | 6              | 60                   | 1x                    |
| 13B        | 1024           | 1           | 58             | 260                  | 19.33x                |
| 13B        | 1024           | 2           | 29             | 173                  | 9.67x                 |
| 13B        | 1024           | 4           | 15             | 216                  | 5.00x                 |
| 13B        | 1024           | Unquantized | 3              | 118                  | 1x                    |



## Recommendations
1. **KV Bits Influence**: Configurations with KV bits generally handle larger batch sizes more effectively, highlighting the importance of key/value storage management in batch processing.

2. **Optimal Configuration Selection**: Depending on the operational needs (e.g., low latency vs. high throughput), choose the appropriate KV bits setting. For scenarios where throughput is critical, a lower KV bits setting is advisable.

## Averaged Speedup Analysis
- **1-bit Quantization:** On average, achieves an approximately 15.58x speedup in batch size handling compared to unquantized configurations across all tested scenarios.

- **2-bit Quantization:** Provides an average of 8.02x speedup.