File size: 2,596 Bytes

---
license: apache-2.0
base_model: tiiuae/falcon-7b
language:
  - en
tags:
  - falcon-7b
  - falcon
  - onnxruntime
  - onnx
  - llm
---

# falcon-7b for ONNX Runtime

## Introduction

This repository hosts the optimized version of **falcon-7b** to accelerate inference with ONNX Runtime CUDA execution provider.

See the [usage instructions](#usage-example) for how to inference this model with the ONNX files hosted in this repository.

## Model Description

- **Developed by:** TIIUAE
- **Model type:** Pretrained generative text model
- **License:** Apache 2.0 License
- **Model Description:** This is a conversion of the [falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider.


## Performance Comparison

#### Latency for token generation

Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:

| Prompt Length      | Batch Size | PyTorch 2.1 torch.compile    | ONNX Runtime CUDA |
|-------------|------------|----------------|-------------------|
| 32      | 1          | 53.64ms            | 15.68ms           |
| 256      | 1          | 59.55ms            | 26.05ms       |
| 1024     | 1          | 89.82ms        | 99.05ms          |
| 2048     | 1          | 208.0ms      | 227.0ms         |
| 32      | 4          | 70.8ms            | 19.62ms           |
| 256      | 4          | 78.6ms            | 81.29ms       |
| 1024     | 4          | 373.7ms        | 369.6ms           |
| 2048     | 4          | N/A       | 879.2ms          |

## Usage Example

1. Clone onnxruntime repository.
```shell
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime
```

2. Install required dependencies
```shell
python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
```

5. Inference using custom model API, or use Hugging Face's ORTModelForCausalLM
```python
from optimum.onnxruntime import ORTModelForCausalLM
from onnxruntime import InferenceSession
from transformers import AutoConfig, AutoTokenizer

sess = InferenceSession("falcon-7b.onnx", providers = ["CUDAExecutionProvider"])
config = AutoConfig.from_pretrained("tiiuae/falcon-7b")

model = ORTFalconForCausalLM(sess, config, use_cache = True, use_io_binding = True)

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")

outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```