falcon-7B-onnx / README.md
petermcaughan's picture
Update README.md
cceea9d
|
raw
history blame
2.6 kB
metadata
license: apache-2.0
base_model: tiiuae/falcon-7b
language:
  - en
tags:
  - falcon-7b
  - falcon
  - onnxruntime
  - onnx
  - llm

falcon-7b for ONNX Runtime

Introduction

This repository hosts the optimized version of falcon-7b to accelerate inference with ONNX Runtime CUDA execution provider.

See the usage instructions for how to inference this model with the ONNX files hosted in this repository.

Model Description

  • Developed by: TIIUAE
  • Model type: Pretrained generative text model
  • License: Apache 2.0 License
  • Model Description: This is a conversion of the falcon-7b for ONNX Runtime inference with CUDA execution provider.

Performance Comparison

Latency for token generation

Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:

Prompt Length Batch Size PyTorch 2.1 torch.compile ONNX Runtime CUDA
32 1 53.64ms 15.68ms
256 1 59.55ms 26.05ms
1024 1 89.82ms 99.05ms
2048 1 208.0ms 227.0ms
32 4 70.8ms 19.62ms
256 4 78.6ms 81.29ms
1024 4 373.7ms 369.6ms
2048 4 N/A 879.2ms

Usage Example

  1. Clone onnxruntime repository.
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime
  1. Install required dependencies
python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
  1. Inference using custom model API, or use Hugging Face's ORTModelForCausalLM
from optimum.onnxruntime import ORTModelForCausalLM
from onnxruntime import InferenceSession
from transformers import AutoConfig, AutoTokenizer

sess = InferenceSession("falcon-7b.onnx", providers = ["CUDAExecutionProvider"])
config = AutoConfig.from_pretrained("tiiuae/falcon-7b")

model = ORTFalconForCausalLM(sess, config, use_cache = True, use_io_binding = True)

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")

outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))