lyraLLMs / README.md

carsonhxsu

Update README

22a0289 12 months ago

4.78 kB

	---
	license: mit
	language: en
	tags:
	- LLM
	- LLaMA
	- Baichuan
	- Baichuan2
	- XVERSE
	---
	# Model Card for lyraLLMs

	## Introduction

	We have released lyraLLMs, a highly optimized and easy-to-use inference engine for LLMs.

	lyraLLMs is suitable for NVIDIA GPUs:
	- Volta (V100)
	- Turing (T4)
	- Ampere (A100/A10)
	- Ada Lovelace (RTX 4090, etc.)

	lyraLLMs supports many popular HuggingFace models as follows:
	- [BELLE](https://huggingface.co/TMElyralab/lyraBELLE)
	- [ChatGLM](https://huggingface.co/TMElyralab/lyraChatGLM)
	- LLaMA
	- LLaMA 2
	- XVERSE
	- Baichuan 1 & 2

	lyraLLMs is fast, memory-efficient & easy to use with:
	- State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B)
	- Efficient memory usage of attention with FlashAttention2
	- Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8
	- Easy-to-use Python API to serve LLMs
	- Streaming outputs

	If you like our work and consider to join us, feel free to drop a line at [email protected]

	## Speed

	### Settings
	* Evaluated at tokens/s (input + output)
	* Test on A100 40G, CUDA 12.0
	* Enable the use of MEMOPT mode and KVCache Int8

	### Throughputs

	### XVERSE-13B-Chat

	#### Input
	北京的景点：故宫、天坛、万里长城等。\n深圳的景点：

	\| Version \| Batch Size 1 \| Batch Size 64 \| Batch Size 128 \| Batch Size 256 \| Batch Size 512 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| Torch 2.1.0 \| 52.9 \| 2308.1 \| OOM \| \| \|
	\| lyraXVERSE \| 200.4 \| 4624.8 \| 5759.7 \| 6075.6 \| 5733 \|

	### Baichuan2-7B-Base

	#### Input
	北京的景点：登鹳雀楼->王之涣\n夜雨寄北->

	\| Version \| Batch Size 1 \| Batch Size 8 \| Batch Size 16 \| Batch Size 32 \| Batch Size 64 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| Torch 2.0.1 \| 41.2 \| 323.2 \| 640.0 \| 1256.8 \| 2231.0 \|
	\| lyraBaichuan \| 125.9 \| 948.1 \| 1749.3 \| 2974.0 \| 4370.1 \|

	### Baichuan2-13B-Base

	#### Input
	北京的景点：登鹳雀楼->王之涣\n夜雨寄北->

	\| Version \| Batch Size 1 \| Batch Size 8 \| Batch Size 16 \| Batch Size 32 \| Batch Size 64 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| Torch 2.0.1 \| 40.9 \| 307.9 \| 555.6 \| 1010.4 \| 1601.0 \|
	\| lyraBaichuan \| 80.0 \| 568.2 \| 1124.4 \| 1942.6 \| 2828.0 \|

	### Yi-6B

	#### Input
	\# write the quick sort algorithm

	\| Version \| Batch Size 1 \| Batch Size 8 \| Batch Size 16 \| Batch Size 32 \| Batch Size 64 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| Torch 2.1.0 \| 31.4 \| 247.5 \| 490.4 \| 987.2 \| 1796.3 \|
	\| lyraLLaMA \| 93.8 \| 735.6 \| 2339.8 \| 3020.9 \| 4630.8 \|

	### Yi-34B

	Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch.

	#### Input
	Let me tell you an interesting story about cat Tom and mouse Jerry,

	\| Version \| Batch Size 1 \| Batch Size 8 \| Batch Size 16 \| Batch Size 32 \| Batch Size 64 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| lyraLLaMA \| 52.5 \| 399.4 \| 753.0 \| 1138.2 \| 1926.2 \|

	## Usage

	### Environment (Docker recommended)

	- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
	- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```

	```bash
	docker pull nvcr.io/nvidia/pytorch:23.02-py3
	docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3

	pip install -r requirements.txt
	```

	### Convert Models

	We have released multiple optimized models converted from original HuggingFace ones:
	- ChatGLM-6B
	- XVERSE-13B-Chat
	- LLaMA-Ziya-13B
	- Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat
	- Yi-6B, Yi-34B

	Feel free to contact us if you would like to convert a finetuned version of LLMs.

	### Inference

	Refer to [README.md](./lyrallms/README.md) for inference of converted models with lyraLLMs.

	### Python Demo

	```python
	from lyra_llama import lyraLlama

	model_path = 'XXX' # 包含转换后的模型参数，配置，tokenizer文件目录
	data_type = 'fp16'
	memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1

	model = lyraLlama(model_path, data_type, memopt_mode)

	prompts = '列出3个不同的机器学习算法，并说明它们的适用范围.'
	prompts = [prompts,] * 64

	output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0)
	print(output_texts)

	```

	## Citation
	``` bibtex
	@Misc{lyraLLMs2024,
	author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},
	title = {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},
	howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},
	year = {2024}
	}
	```

	## Report bug
	- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLMs/discussions
	- report bug with a `[bug]` mark in the title.