lyraLLMs / README.md
carsonhxsu
Update README
22a0289
---
license: mit
language: en
tags:
- LLM
- LLaMA
- Baichuan
- Baichuan2
- XVERSE
---
# Model Card for lyraLLMs
## Introduction
We have released **lyraLLMs**, a highly optimized and easy-to-use inference engine for LLMs.
**lyraLLMs** is suitable for NVIDIA GPUs:
- Volta (V100)
- Turing (T4)
- Ampere (A100/A10)
- Ada Lovelace (RTX 4090, etc.)
**lyraLLMs** supports many popular HuggingFace models as follows:
- [BELLE](https://huggingface.co/TMElyralab/lyraBELLE)
- [ChatGLM](https://huggingface.co/TMElyralab/lyraChatGLM)
- LLaMA
- LLaMA 2
- XVERSE
- Baichuan 1 & 2
**lyraLLMs** is fast, memory-efficient & easy to use with:
- State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B)
- Efficient memory usage of attention with FlashAttention2
- Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8
- Easy-to-use Python API to serve LLMs
- Streaming outputs
If you like our work and consider to join us, feel free to drop a line at [email protected]
## Speed
### Settings
* Evaluated at tokens/s (input + output)
* Test on A100 40G, CUDA 12.0
* Enable the use of MEMOPT mode and KVCache Int8
### Throughputs
### XVERSE-13B-Chat
#### Input
北京的景点:故宫、天坛、万里长城等。\n深圳的景点:
| Version | Batch Size 1 | Batch Size 64 | Batch Size 128 | Batch Size 256 | Batch Size 512 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.1.0 | 52.9 | 2308.1 | OOM | | |
| lyraXVERSE | 200.4 | 4624.8 | 5759.7 | 6075.6 | 5733 |
### Baichuan2-7B-Base
#### Input
北京的景点:登鹳雀楼->王之涣\n夜雨寄北->
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 |
| lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 |
### Baichuan2-13B-Base
#### Input
北京的景点:登鹳雀楼->王之涣\n夜雨寄北->
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 |
| lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 |
### Yi-6B
#### Input
\# write the quick sort algorithm
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.1.0 | 31.4 | 247.5 | 490.4 | 987.2 | 1796.3 |
| lyraLLaMA | 93.8 | 735.6 | 2339.8 | 3020.9 | 4630.8 |
### Yi-34B
Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch.
#### Input
Let me tell you an interesting story about cat Tom and mouse Jerry,
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| lyraLLaMA | 52.5 | 399.4 | 753.0 | 1138.2 | 1926.2 |
## Usage
### Environment (Docker recommended)
- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```
```bash
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3
pip install -r requirements.txt
```
### Convert Models
We have released multiple optimized models converted from original HuggingFace ones:
- ChatGLM-6B
- XVERSE-13B-Chat
- LLaMA-Ziya-13B
- Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat
- Yi-6B, Yi-34B
Feel free to contact us if you would like to convert a finetuned version of LLMs.
### Inference
Refer to [README.md](./lyrallms/README.md) for inference of converted models with **lyraLLMs**.
### Python Demo
```python
from lyra_llama import lyraLlama
model_path = 'XXX' # 包含转换后的模型参数,配置,tokenizer文件目录
data_type = 'fp16'
memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1
model = lyraLlama(model_path, data_type, memopt_mode)
prompts = '列出3个不同的机器学习算法,并说明它们的适用范围.'
prompts = [prompts,] * 64
output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0)
print(output_texts)
```
## Citation
``` bibtex
@Misc{lyraLLMs2024,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},
  title =        {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},
  year =         {2024}
}
```
## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLMs/discussions
- report bug with a `[bug]` mark in the title.