|
--- |
|
license: mit |
|
language: en |
|
tags: |
|
- LLM |
|
- LLaMA |
|
- Baichuan |
|
- Baichuan2 |
|
- XVERSE |
|
--- |
|
# Model Card for lyraLLMs |
|
|
|
## Introduction |
|
|
|
We have released **lyraLLMs**, a highly optimized and easy-to-use inference engine for LLMs. |
|
|
|
**lyraLLMs** is suitable for NVIDIA GPUs: |
|
- Volta (V100) |
|
- Turing (T4) |
|
- Ampere (A100/A10) |
|
- Ada Lovelace (RTX 4090, etc.) |
|
|
|
**lyraLLMs** supports many popular HuggingFace models as follows: |
|
- [BELLE](https://huggingface.co/TMElyralab/lyraBELLE) |
|
- [ChatGLM](https://huggingface.co/TMElyralab/lyraChatGLM) |
|
- LLaMA |
|
- LLaMA 2 |
|
- XVERSE |
|
- Baichuan 1 & 2 |
|
|
|
**lyraLLMs** is fast, memory-efficient & easy to use with: |
|
- State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B) |
|
- Efficient memory usage of attention with FlashAttention2 |
|
- Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8 |
|
- Easy-to-use Python API to serve LLMs |
|
- Streaming outputs |
|
|
|
If you like our work and consider to join us, feel free to drop a line at [email protected] |
|
|
|
## Speed |
|
|
|
### Settings |
|
* Evaluated at tokens/s (input + output) |
|
* Test on A100 40G, CUDA 12.0 |
|
* Enable the use of MEMOPT mode and KVCache Int8 |
|
|
|
### Throughputs |
|
|
|
### XVERSE-13B-Chat |
|
|
|
#### Input |
|
北京的景点:故宫、天坛、万里长城等。\n深圳的景点: |
|
|
|
| Version | Batch Size 1 | Batch Size 64 | Batch Size 128 | Batch Size 256 | Batch Size 512 | |
|
| --- | --- | --- | --- | --- | --- | |
|
| Torch 2.1.0 | 52.9 | 2308.1 | OOM | | | |
|
| lyraXVERSE | 200.4 | 4624.8 | 5759.7 | 6075.6 | 5733 | |
|
|
|
### Baichuan2-7B-Base |
|
|
|
#### Input |
|
北京的景点:登鹳雀楼->王之涣\n夜雨寄北-> |
|
|
|
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | |
|
| --- | --- | --- | --- | --- | --- | |
|
| Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 | |
|
| lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 | |
|
|
|
### Baichuan2-13B-Base |
|
|
|
#### Input |
|
北京的景点:登鹳雀楼->王之涣\n夜雨寄北-> |
|
|
|
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | |
|
| --- | --- | --- | --- | --- | --- | |
|
| Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 | |
|
| lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 | |
|
|
|
### Yi-6B |
|
|
|
#### Input |
|
\# write the quick sort algorithm |
|
|
|
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | |
|
| --- | --- | --- | --- | --- | --- | |
|
| Torch 2.1.0 | 31.4 | 247.5 | 490.4 | 987.2 | 1796.3 | |
|
| lyraLLaMA | 93.8 | 735.6 | 2339.8 | 3020.9 | 4630.8 | |
|
|
|
### Yi-34B |
|
|
|
Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch. |
|
|
|
#### Input |
|
Let me tell you an interesting story about cat Tom and mouse Jerry, |
|
|
|
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | |
|
| --- | --- | --- | --- | --- | --- | |
|
| lyraLLaMA | 52.5 | 399.4 | 753.0 | 1138.2 | 1926.2 | |
|
|
|
## Usage |
|
|
|
### Environment (Docker recommended) |
|
|
|
- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3``` |
|
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3``` |
|
|
|
```bash |
|
docker pull nvcr.io/nvidia/pytorch:23.02-py3 |
|
docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3 |
|
|
|
pip install -r requirements.txt |
|
``` |
|
|
|
### Convert Models |
|
|
|
We have released multiple optimized models converted from original HuggingFace ones: |
|
- ChatGLM-6B |
|
- XVERSE-13B-Chat |
|
- LLaMA-Ziya-13B |
|
- Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat |
|
- Yi-6B, Yi-34B |
|
|
|
Feel free to contact us if you would like to convert a finetuned version of LLMs. |
|
|
|
### Inference |
|
|
|
Refer to [README.md](./lyrallms/README.md) for inference of converted models with **lyraLLMs**. |
|
|
|
### Python Demo |
|
|
|
```python |
|
from lyra_llama import lyraLlama |
|
|
|
model_path = 'XXX' # 包含转换后的模型参数,配置,tokenizer文件目录 |
|
data_type = 'fp16' |
|
memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1 |
|
|
|
model = lyraLlama(model_path, data_type, memopt_mode) |
|
|
|
prompts = '列出3个不同的机器学习算法,并说明它们的适用范围.' |
|
prompts = [prompts,] * 64 |
|
|
|
output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0) |
|
print(output_texts) |
|
|
|
``` |
|
|
|
## Citation |
|
``` bibtex |
|
@Misc{lyraLLMs2024, |
|
author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu}, |
|
title = {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs}, |
|
howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}}, |
|
year = {2024} |
|
} |
|
``` |
|
|
|
## Report bug |
|
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLMs/discussions |
|
- report bug with a `[bug]` mark in the title. |
|
|