File size: 12,974 Bytes
37061fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
pipeline_tag: token-classification
datasets:
- conll2003
metrics:
- precision
- recall
- f1
- accuracy
tags:
- distilbert
---

**task**: `token-classification`  
**Backend:** `sagemaker-training`  
**Backend args:** `{'instance_type': 'ml.m5.2xlarge', 'supported_instructions': 'avx512'}`  
**Number of evaluation samples:** `All dataset`  

Fixed parameters:
* **model_name_or_path**: `elastic/distilbert-base-uncased-finetuned-conll03-english`
* **dataset**:
    * **path**: `conll2003`
    * **eval_split**: `validation`
    * **data_keys**: `{'primary': 'tokens'}`
    * **ref_keys**: `['ner_tags']`
    * **calibration_split**: `train`
* **node_exclusion**: `[]`
* **per_channel**: `False`
* **calibration**:
    * **method**: `minmax`
    * **num_calibration_samples**: `100`
* **framework**: `onnxruntime`
* **framework_args**:
    * **opset**: `11`
    * **optimization_level**: `1`
* **aware_training**: `False`

Benchmarked parameters:
* **quantization_approach**: `dynamic`,  `static`
* **operators_to_quantize**: `['Add', 'MatMul']`,  `['Add']`

# Evaluation
## Non-time metrics
| quantization_approach | operators_to_quantize |     | precision (original) | precision (optimized) |     | recall (original) | recall (optimized) |     | f1 (original) | f1 (optimized) |     | accuracy (original) | accuracy (optimized) |
| :-------------------: | :-------------------: | :-: | :------------------: | :-------------------: | :-: | :---------------: | :----------------: | :-: | :-----------: | :------------: | :-: | :-----------------: | :------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |        0.936         |         0.935         |  \|  |       0.944       |       0.943        |  \|  |     0.940     |     0.939      |  \|  |        0.988        |        0.988         |
|       `dynamic`       |       `['Add']`       |  \|  |        0.936         |         0.936         |  \|  |       0.944       |       0.944        |  \|  |     0.940     |     0.940      |  \|  |        0.988        |        0.988         |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |        0.936         |         0.063         |  \|  |       0.944       |       0.246        |  \|  |     0.940     |     0.100      |  \|  |        0.988        |        0.343         |
|       `static`        |       `['Add']`       |  \|  |        0.936         |         0.050         |  \|  |       0.944       |       0.160        |  \|  |     0.940     |     0.076      |  \|  |        0.988        |        0.311         |

## Time metrics
Time benchmarks were run for 15 seconds per config.


Below, time metrics for batch size = 1, input length = 32.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |            46.38            |             9.96             |  \|  |           21.60           |           100.47           |
|       `dynamic`       |       `['Add']`       |  \|  |            36.59            |            13.98             |  \|  |           27.33           |           71.60            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |            33.84            |            14.46             |  \|  |           29.60           |           69.20            |
|       `static`        |       `['Add']`       |  \|  |            33.23            |            20.11             |  \|  |           30.13           |           49.73            |


Below, time metrics for batch size = 1, input length = 64.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |            58.92            |            19.68             |  \|  |           17.00           |           50.87            |
|       `dynamic`       |       `['Add']`       |  \|  |            58.59            |            24.81             |  \|  |           17.13           |           40.33            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |            51.41            |            29.36             |  \|  |           19.47           |           34.07            |
|       `static`        |       `['Add']`       |  \|  |            44.22            |            38.56             |  \|  |           22.67           |           25.93            |


Below, time metrics for batch size = 1, input length = 128.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |            72.38            |            36.47             |  \|  |           13.87           |           27.47            |
|       `dynamic`       |       `['Add']`       |  \|  |            70.21            |            46.30             |  \|  |           14.27           |           21.60            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |            70.76            |            48.24             |  \|  |           14.13           |           20.80            |
|       `static`        |       `['Add']`       |  \|  |            72.47            |            71.10             |  \|  |           13.80           |           14.07            |


Below, time metrics for batch size = 4, input length = 32.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |            69.76            |            38.50             |  \|  |           14.40           |           26.00            |
|       `dynamic`       |       `['Add']`       |  \|  |            56.02            |            51.32             |  \|  |           17.87           |           19.53            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |            55.05            |            46.80             |  \|  |           18.20           |           21.40            |
|       `static`        |       `['Add']`       |  \|  |            71.03            |            56.82             |  \|  |           14.13           |           17.67            |


Below, time metrics for batch size = 4, input length = 64.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |           119.91            |            61.51             |  \|  |           8.40            |           16.27            |
|       `dynamic`       |       `['Add']`       |  \|  |           108.43            |            105.65            |  \|  |           9.27            |            9.47            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |           119.89            |            86.76             |  \|  |           8.40            |           11.53            |
|       `static`        |       `['Add']`       |  \|  |            96.99            |            102.03            |  \|  |           10.33           |            9.87            |


Below, time metrics for batch size = 4, input length = 128.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |           219.78            |            123.71            |  \|  |           4.60            |            8.13            |
|       `dynamic`       |       `['Add']`       |  \|  |           220.13            |            187.21            |  \|  |           4.60            |            5.40            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |           186.39            |            176.99            |  \|  |           5.40            |            5.67            |
|       `static`        |       `['Add']`       |  \|  |           219.57            |            203.71            |  \|  |           4.60            |            4.93            |


Below, time metrics for batch size = 8, input length = 32.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |           118.32            |            59.22             |  \|  |           8.47            |           16.93            |
|       `dynamic`       |       `['Add']`       |  \|  |           116.52            |            80.17             |  \|  |           8.60            |           12.53            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |           116.59            |            83.55             |  \|  |           8.60            |           12.00            |
|       `static`        |       `['Add']`       |  \|  |           115.81            |            126.53            |  \|  |           8.67            |            7.93            |


Below, time metrics for batch size = 8, input length = 64.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |           172.71            |            117.89            |  \|  |           5.80            |            8.53            |
|       `dynamic`       |       `['Add']`       |  \|  |           166.05            |            156.99            |  \|  |           6.07            |            6.40            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |           215.00            |            148.93            |  \|  |           4.67            |            6.73            |
|       `static`        |       `['Add']`       |  \|  |           214.55            |            200.16            |  \|  |           4.67            |            5.00            |


Below, time metrics for batch size = 8, input length = 128.

| quantization_approach | operators_to_quantize |     | latency_mean (original, ms) | latency_mean (optimized, ms) |     | throughput (original, /s) | throughput (optimized, /s) |
| :-------------------: | :-------------------: | :-: | :-------------------------: | :--------------------------: | :-: | :-----------------------: | :------------------------: |
|       `dynamic`       |  `['Add', 'MatMul']`  |  \|  |           403.69            |            307.36            |  \|  |           2.53            |            3.27            |
|       `dynamic`       |       `['Add']`       |  \|  |           372.85            |            317.53            |  \|  |           2.73            |            3.20            |
|       `static`        |  `['Add', 'MatMul']`  |  \|  |           352.18            |            320.85            |  \|  |           2.87            |            3.13            |
|       `static`        |       `['Add']`       |  \|  |           403.55            |            410.17            |  \|  |           2.53            |            2.47            |