Update README.md
Browse files
README.md
CHANGED
@@ -115,7 +115,7 @@ Limitations: Although we have made efforts to ensure the safety of the model dur
|
|
115 |
|
116 |
## Quick Start
|
117 |
|
118 |
-
We provide an example code to run InternVL2-Llama3-76B using `transformers`.
|
119 |
|
120 |
> Please use transformers>=4.37.2 to ensure the model works normally.
|
121 |
|
@@ -150,10 +150,6 @@ model = AutoModel.from_pretrained(
|
|
150 |
trust_remote_code=True).eval()
|
151 |
```
|
152 |
|
153 |
-
#### BNB 4-bit Quantization
|
154 |
-
|
155 |
-
> **⚠️ Warning:** Due to significant quantization errors with BNB 4-bit quantization on InternViT-6B, the model may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit quantization.
|
156 |
-
|
157 |
#### Multiple GPUs
|
158 |
|
159 |
The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
|
@@ -443,7 +439,7 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
|
|
443 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
444 |
print(f'User: {question}\nAssistant: {response}')
|
445 |
|
446 |
-
question = 'Describe this video in detail.
|
447 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
448 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
449 |
print(f'User: {question}\nAssistant: {response}')
|
@@ -494,14 +490,91 @@ pip install lmdeploy>=0.5.3
|
|
494 |
|
495 |
LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
|
496 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
497 |
#### Service
|
498 |
|
499 |
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
|
500 |
|
501 |
-
> **⚠️ Warning**: Please make sure to install Flash Attention; otherwise, using `--tp` will cause errors.
|
502 |
-
|
503 |
```shell
|
504 |
-
|
505 |
```
|
506 |
|
507 |
To use the OpenAI-style interface, you need to install OpenAI:
|
|
|
115 |
|
116 |
## Quick Start
|
117 |
|
118 |
+
We provide an example code to run `InternVL2-Llama3-76B` using `transformers`.
|
119 |
|
120 |
> Please use transformers>=4.37.2 to ensure the model works normally.
|
121 |
|
|
|
150 |
trust_remote_code=True).eval()
|
151 |
```
|
152 |
|
|
|
|
|
|
|
|
|
153 |
#### Multiple GPUs
|
154 |
|
155 |
The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
|
|
|
439 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
440 |
print(f'User: {question}\nAssistant: {response}')
|
441 |
|
442 |
+
question = 'Describe this video in detail.'
|
443 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
444 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
445 |
print(f'User: {question}\nAssistant: {response}')
|
|
|
490 |
|
491 |
LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
|
492 |
|
493 |
+
#### A 'Hello, world' Example
|
494 |
+
|
495 |
+
```python
|
496 |
+
from lmdeploy import pipeline, TurbomindEngineConfig
|
497 |
+
from lmdeploy.vl import load_image
|
498 |
+
|
499 |
+
model = 'OpenGVLab/InternVL2-Llama3-76B'
|
500 |
+
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
|
501 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
502 |
+
response = pipe(('describe this image', image))
|
503 |
+
print(response.text)
|
504 |
+
```
|
505 |
+
|
506 |
+
If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
|
507 |
+
|
508 |
+
#### Multi-images Inference
|
509 |
+
|
510 |
+
When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
|
511 |
+
|
512 |
+
> Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
|
513 |
+
|
514 |
+
```python
|
515 |
+
from lmdeploy import pipeline, TurbomindEngineConfig
|
516 |
+
from lmdeploy.vl import load_image
|
517 |
+
from lmdeploy.vl.constants import IMAGE_TOKEN
|
518 |
+
|
519 |
+
model = 'OpenGVLab/InternVL2-Llama3-76B'
|
520 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
521 |
+
|
522 |
+
image_urls=[
|
523 |
+
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
|
524 |
+
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
|
525 |
+
]
|
526 |
+
|
527 |
+
images = [load_image(img_url) for img_url in image_urls]
|
528 |
+
# Numbering images improves multi-image conversations
|
529 |
+
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
|
530 |
+
print(response.text)
|
531 |
+
```
|
532 |
+
|
533 |
+
#### Batch Prompts Inference
|
534 |
+
|
535 |
+
Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
|
536 |
+
|
537 |
+
```python
|
538 |
+
from lmdeploy import pipeline, TurbomindEngineConfig
|
539 |
+
from lmdeploy.vl import load_image
|
540 |
+
|
541 |
+
model = 'OpenGVLab/InternVL2-Llama3-76B'
|
542 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
543 |
+
|
544 |
+
image_urls=[
|
545 |
+
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
|
546 |
+
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
|
547 |
+
]
|
548 |
+
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
|
549 |
+
response = pipe(prompts)
|
550 |
+
print(response)
|
551 |
+
```
|
552 |
+
|
553 |
+
#### Multi-turn Conversation
|
554 |
+
|
555 |
+
There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
|
556 |
+
|
557 |
+
```python
|
558 |
+
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
|
559 |
+
from lmdeploy.vl import load_image
|
560 |
+
|
561 |
+
model = 'OpenGVLab/InternVL2-Llama3-76B'
|
562 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
563 |
+
|
564 |
+
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
|
565 |
+
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
|
566 |
+
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
|
567 |
+
print(sess.response.text)
|
568 |
+
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
|
569 |
+
print(sess.response.text)
|
570 |
+
```
|
571 |
+
|
572 |
#### Service
|
573 |
|
574 |
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
|
575 |
|
|
|
|
|
576 |
```shell
|
577 |
+
lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B --backend turbomind --server-port 23333 --tp 4
|
578 |
```
|
579 |
|
580 |
To use the OpenAI-style interface, you need to install OpenAI:
|