petermcaughan commited on
Commit
a44ba12
1 Parent(s): 56e8190

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -0
README.md CHANGED
@@ -1,3 +1,79 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ base_model: tiiuae/falcon-7b
4
+ language:
5
+ - en
6
+ tags:
7
+ - falcon-7b
8
+ - falcon
9
+ - onnxruntime
10
+ - onnx
11
+ - llm
12
  ---
13
+
14
+ # falcon-7b for ONNX Runtime
15
+
16
+ ## Introduction
17
+
18
+ This repository hosts the optimized version of **falcon-7b** to accelerate inference with ONNX Runtime CUDA execution provider.
19
+
20
+ See the [usage instructions](#usage-example) for how to inference this model with the ONNX files hosted in this repository.
21
+
22
+ ## Model Description
23
+
24
+ - **Developed by:** TIIUAE
25
+ - **Model type:** Pretrained generative text model
26
+ - **License:** Apache 2.0 License
27
+ - **Model Description:** This is a conversion of the [falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider.
28
+
29
+
30
+ ## Performance Comparison
31
+
32
+ #### Latency for token generation
33
+
34
+ Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
35
+
36
+ | Prompt Length | Batch Size | PyTorch 2.1 torch.compile | ONNX Runtime CUDA |
37
+ |-------------|------------|----------------|-------------------|
38
+ | 16 | 1 | N/A | N/A |
39
+ | 256 | 1 | N/A | N/A |
40
+ | 1024 | 1 | N/A | N/A |
41
+ | 2048 | 1 | N/A | N/A |
42
+ | 16 | 4 | N/A | N/A |
43
+ | 256 | 4 | N/A | N/A |
44
+ | 1024 | 4 | N/A | N/A |
45
+ | 2048 | 4 | N/A | N/A |
46
+
47
+ ## Usage Example
48
+
49
+ 1. Clone onnxruntime repository.
50
+ ```shell
51
+ git clone https://github.com/microsoft/onnxruntime
52
+ cd onnxruntime
53
+ ```
54
+
55
+ 2. Install required dependencies
56
+ ```shell
57
+ python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
58
+ ```
59
+
60
+ 5. Inference using custom model API, or use Hugging Face's ORTModelForCausalLM
61
+ ```python
62
+ from optimum.onnxruntime import ORTModelForCausalLM
63
+ from onnxruntime import InferenceSession
64
+ from transformers import AutoConfig, AutoTokenizer
65
+
66
+ sess = InferenceSession("falcon-7b.onnx", providers = ["CUDAExecutionProvider"])
67
+ config = AutoConfig.from_pretrained("tiiuae/falcon-7b")
68
+
69
+ model = ORTFalconForCausalLM(sess, config, use_cache = True, use_io_binding = True)
70
+
71
+ tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
72
+
73
+ inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
74
+
75
+ outputs = model.generate(**inputs)
76
+
77
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
78
+ ```
79
+