1bit_llama3_instruct_xmad_qa_batch / README_test_result.md
Aston-xMAD's picture
init commit
9382e3f verified

A newer version of the Gradio SDK is available: 5.9.1

Upgrade

Maximum Batch Size Analysis for Llama2 Models

Provides a summary of the performance testing results for Llama2 models under various configurations. The focus here is on identifying the maximum batch sizes that can be processed without errors and documenting the corresponding generation times in seconds.

Experiment Details

The experiment varied settings such as model size, number of new tokens (num_new_tokens), key-value bit size (kv_bits), and batch sizes. "Unquantized" indicates configurations without quantization. The objective was to determine stable operating conditions for generating a fixed number of tokens under these configurations.

Models and Configurations

  • Models Tested: Llama2 7B and 13B.
  • Measurements: Generation times are directly reported in seconds as provided by the dataset.

Results: Llama2 7B Model Performance

Model Size num_new_tokens KV Bits Max Batch Size Generation Time (s) Speedup (Batch Size)
7B 256 1 764 257 14.98x
7B 256 2 384 124 7.53x
7B 256 4 204 99 4.00x
7B 256 Unquantized 51 75 1x
7B 512 1 437 352 15.07x
7B 512 2 223 178 7.69x
7B 512 4 114 148 3.93x
7B 512 Unquantized 29 122 1x
7B 1024 1 247 454 15.44x
7B 1024 2 126 300 7.88x
7B 1024 4 65 283 4.06x
7B 1024 Unquantized 16 224 1x

Results: Llama2 13B Model Performance

Model Size num_new_tokens KV Bits Max Batch Size Generation Time (s) Speedup (Batch Size)
13B 256 1 154 83 14.00x
13B 256 2 88 63 8.00x
13B 256 4 45 62 4.09x
13B 256 Unquantized 11 33 1x
13B 512 1 100 144 16.67x
13B 512 2 51 98 8.50x
13B 512 4 26 108 4.33x
13B 512 Unquantized 6 60 1x
13B 1024 1 58 260 19.33x
13B 1024 2 29 173 9.67x
13B 1024 4 15 216 5.00x
13B 1024 Unquantized 3 118 1x

Recommendations

  1. KV Bits Influence: Configurations with KV bits generally handle larger batch sizes more effectively, highlighting the importance of key/value storage management in batch processing.

  2. Optimal Configuration Selection: Depending on the operational needs (e.g., low latency vs. high throughput), choose the appropriate KV bits setting. For scenarios where throughput is critical, a lower KV bits setting is advisable.

Averaged Speedup Analysis

  • 1-bit Quantization: On average, achieves an approximately 15.58x speedup in batch size handling compared to unquantized configurations across all tested scenarios.

  • 2-bit Quantization: Provides an average of 8.02x speedup.