microsoft
/

falcon-7B-onnx

Model card Files Files and versions Community

falcon-7B-onnx / README.md

petermcaughan's picture

Update README.md

cceea9d 10 months ago

|

2.6 kB

	---
	license: apache-2.0
	base_model: tiiuae/falcon-7b
	language:
	- en
	tags:
	- falcon-7b
	- falcon
	- onnxruntime
	- onnx
	- llm
	---

	# falcon-7b for ONNX Runtime

	## Introduction

	This repository hosts the optimized version of falcon-7b to accelerate inference with ONNX Runtime CUDA execution provider.

	See the [usage instructions](#usage-example) for how to inference this model with the ONNX files hosted in this repository.

	## Model Description

	- Developed by: TIIUAE
	- Model type: Pretrained generative text model
	- License: Apache 2.0 License
	- Model Description: This is a conversion of the [falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider.


	## Performance Comparison

	#### Latency for token generation

	Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:

	\| Prompt Length \| Batch Size \| PyTorch 2.1 torch.compile \| ONNX Runtime CUDA \|
	\|-------------\|------------\|----------------\|-------------------\|
	\| 32 \| 1 \| 53.64ms \| 15.68ms \|
	\| 256 \| 1 \| 59.55ms \| 26.05ms \|
	\| 1024 \| 1 \| 89.82ms \| 99.05ms \|
	\| 2048 \| 1 \| 208.0ms \| 227.0ms \|
	\| 32 \| 4 \| 70.8ms \| 19.62ms \|
	\| 256 \| 4 \| 78.6ms \| 81.29ms \|
	\| 1024 \| 4 \| 373.7ms \| 369.6ms \|
	\| 2048 \| 4 \| N/A \| 879.2ms \|

	## Usage Example

	1. Clone onnxruntime repository.
	```shell
	git clone https://github.com/microsoft/onnxruntime
	cd onnxruntime
	```

	2. Install required dependencies
	```shell
	python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
	```

	5. Inference using custom model API, or use Hugging Face's ORTModelForCausalLM
	```python
	from optimum.onnxruntime import ORTModelForCausalLM
	from onnxruntime import InferenceSession
	from transformers import AutoConfig, AutoTokenizer

	sess = InferenceSession("falcon-7b.onnx", providers = ["CUDAExecutionProvider"])
	config = AutoConfig.from_pretrained("tiiuae/falcon-7b")

	model = ORTFalconForCausalLM(sess, config, use_cache = True, use_io_binding = True)

	tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

	inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")

	outputs = model.generate(**inputs)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```