A newer version of the Gradio SDK is available:
5.9.1
Maximum Batch Size Analysis for Llama2 Models
Provides a summary of the performance testing results for Llama2 models under various configurations. The focus here is on identifying the maximum batch sizes that can be processed without errors and documenting the corresponding generation times in seconds.
Experiment Details
The experiment varied settings such as model size, number of new tokens (num_new_tokens
), key-value bit size (kv_bits
), and batch sizes
. "Unquantized" indicates configurations without quantization. The objective was to determine stable operating conditions for generating a fixed number of tokens under these configurations.
Models and Configurations
- Models Tested: Llama2 7B and 13B.
- Measurements: Generation times are directly reported in seconds as provided by the dataset.
Results: Llama2 7B Model Performance
Model Size | num_new_tokens | KV Bits | Max Batch Size | Generation Time (s) | Speedup (Batch Size) |
---|---|---|---|---|---|
7B | 256 | 1 | 764 | 257 | 14.98x |
7B | 256 | 2 | 384 | 124 | 7.53x |
7B | 256 | 4 | 204 | 99 | 4.00x |
7B | 256 | Unquantized | 51 | 75 | 1x |
7B | 512 | 1 | 437 | 352 | 15.07x |
7B | 512 | 2 | 223 | 178 | 7.69x |
7B | 512 | 4 | 114 | 148 | 3.93x |
7B | 512 | Unquantized | 29 | 122 | 1x |
7B | 1024 | 1 | 247 | 454 | 15.44x |
7B | 1024 | 2 | 126 | 300 | 7.88x |
7B | 1024 | 4 | 65 | 283 | 4.06x |
7B | 1024 | Unquantized | 16 | 224 | 1x |
Results: Llama2 13B Model Performance
Model Size | num_new_tokens | KV Bits | Max Batch Size | Generation Time (s) | Speedup (Batch Size) |
---|---|---|---|---|---|
13B | 256 | 1 | 154 | 83 | 14.00x |
13B | 256 | 2 | 88 | 63 | 8.00x |
13B | 256 | 4 | 45 | 62 | 4.09x |
13B | 256 | Unquantized | 11 | 33 | 1x |
13B | 512 | 1 | 100 | 144 | 16.67x |
13B | 512 | 2 | 51 | 98 | 8.50x |
13B | 512 | 4 | 26 | 108 | 4.33x |
13B | 512 | Unquantized | 6 | 60 | 1x |
13B | 1024 | 1 | 58 | 260 | 19.33x |
13B | 1024 | 2 | 29 | 173 | 9.67x |
13B | 1024 | 4 | 15 | 216 | 5.00x |
13B | 1024 | Unquantized | 3 | 118 | 1x |
Recommendations
KV Bits Influence: Configurations with KV bits generally handle larger batch sizes more effectively, highlighting the importance of key/value storage management in batch processing.
Optimal Configuration Selection: Depending on the operational needs (e.g., low latency vs. high throughput), choose the appropriate KV bits setting. For scenarios where throughput is critical, a lower KV bits setting is advisable.
Averaged Speedup Analysis
1-bit Quantization: On average, achieves an approximately 15.58x speedup in batch size handling compared to unquantized configurations across all tested scenarios.
2-bit Quantization: Provides an average of 8.02x speedup.