microsoft
/

falcon-7B-onnx

Model card Files Files and versions Community

petermcaughan commited on Dec 11, 2023

Commit

a44ba12

•

1 Parent(s): 56e8190

Update README.md

Files changed (1) hide show

README.md +76 -0

README.md CHANGED Viewed

@@ -1,3 +1,79 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+base_model: tiiuae/falcon-7b
+language:
+ - en
+tags:
+ - falcon-7b
+ - falcon
+ - onnxruntime
+ - onnx
+ - llm
 ---
+# falcon-7b for ONNX Runtime
+## Introduction
+This repository hosts the optimized version of **falcon-7b** to accelerate inference with ONNX Runtime CUDA execution provider.
+See the [usage instructions](#usage-example) for how to inference this model with the ONNX files hosted in this repository.
+## Model Description
+- **Developed by:** TIIUAE
+- **Model type:** Pretrained generative text model
+- **License:** Apache 2.0 License
+- **Model Description:** This is a conversion of the [falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider.
+## Performance Comparison
+#### Latency for token generation
+Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
+| Prompt Length | Batch Size | PyTorch 2.1 torch.compile | ONNX Runtime CUDA |
+|-------------|------------|----------------|-------------------|
+| 16 | 1 | N/A | N/A |
+| 256 | 1 | N/A | N/A |
+| 1024 | 1 | N/A | N/A |
+| 2048 | 1 | N/A | N/A |
+| 16 | 4 | N/A | N/A |
+| 256 | 4 | N/A | N/A |
+| 1024 | 4 | N/A | N/A |
+| 2048 | 4 | N/A | N/A |
+## Usage Example
+1. Clone onnxruntime repository.
+```shell
+git clone https://github.com/microsoft/onnxruntime
+cd onnxruntime
+```
+2. Install required dependencies
+```shell
+python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
+```
+5. Inference using custom model API, or use Hugging Face's ORTModelForCausalLM
+```python
+from optimum.onnxruntime import ORTModelForCausalLM
+from onnxruntime import InferenceSession
+from transformers import AutoConfig, AutoTokenizer
+sess = InferenceSession("falcon-7b.onnx", providers = ["CUDAExecutionProvider"])
+config = AutoConfig.from_pretrained("tiiuae/falcon-7b")
+model = ORTFalconForCausalLM(sess, config, use_cache = True, use_io_binding = True)
+tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
+inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
+outputs = model.generate(**inputs)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```