czczup commited on
Commit
d044c87
·
verified ·
1 Parent(s): 963014e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -9
README.md CHANGED
@@ -115,7 +115,7 @@ Limitations: Although we have made efforts to ensure the safety of the model dur
115
 
116
  ## Quick Start
117
 
118
- We provide an example code to run InternVL2-Llama3-76B using `transformers`.
119
 
120
  > Please use transformers>=4.37.2 to ensure the model works normally.
121
 
@@ -150,10 +150,6 @@ model = AutoModel.from_pretrained(
150
  trust_remote_code=True).eval()
151
  ```
152
 
153
- #### BNB 4-bit Quantization
154
-
155
- > **⚠️ Warning:** Due to significant quantization errors with BNB 4-bit quantization on InternViT-6B, the model may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit quantization.
156
-
157
  #### Multiple GPUs
158
 
159
  The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
@@ -443,7 +439,7 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
443
  num_patches_list=num_patches_list, history=None, return_history=True)
444
  print(f'User: {question}\nAssistant: {response}')
445
 
446
- question = 'Describe this video in detail. Don\'t repeat.'
447
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
448
  num_patches_list=num_patches_list, history=history, return_history=True)
449
  print(f'User: {question}\nAssistant: {response}')
@@ -494,14 +490,91 @@ pip install lmdeploy>=0.5.3
494
 
495
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
496
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
497
  #### Service
498
 
499
  LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
500
 
501
- > **⚠️ Warning**: Please make sure to install Flash Attention; otherwise, using `--tp` will cause errors.
502
-
503
  ```shell
504
- CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B --backend turbomind --server-port 23333 --tp 4
505
  ```
506
 
507
  To use the OpenAI-style interface, you need to install OpenAI:
 
115
 
116
  ## Quick Start
117
 
118
+ We provide an example code to run `InternVL2-Llama3-76B` using `transformers`.
119
 
120
  > Please use transformers>=4.37.2 to ensure the model works normally.
121
 
 
150
  trust_remote_code=True).eval()
151
  ```
152
 
 
 
 
 
153
  #### Multiple GPUs
154
 
155
  The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
 
439
  num_patches_list=num_patches_list, history=None, return_history=True)
440
  print(f'User: {question}\nAssistant: {response}')
441
 
442
+ question = 'Describe this video in detail.'
443
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
444
  num_patches_list=num_patches_list, history=history, return_history=True)
445
  print(f'User: {question}\nAssistant: {response}')
 
490
 
491
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
492
 
493
+ #### A 'Hello, world' Example
494
+
495
+ ```python
496
+ from lmdeploy import pipeline, TurbomindEngineConfig
497
+ from lmdeploy.vl import load_image
498
+
499
+ model = 'OpenGVLab/InternVL2-Llama3-76B'
500
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
501
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
502
+ response = pipe(('describe this image', image))
503
+ print(response.text)
504
+ ```
505
+
506
+ If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
507
+
508
+ #### Multi-images Inference
509
+
510
+ When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
511
+
512
+ > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
513
+
514
+ ```python
515
+ from lmdeploy import pipeline, TurbomindEngineConfig
516
+ from lmdeploy.vl import load_image
517
+ from lmdeploy.vl.constants import IMAGE_TOKEN
518
+
519
+ model = 'OpenGVLab/InternVL2-Llama3-76B'
520
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
521
+
522
+ image_urls=[
523
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
524
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
525
+ ]
526
+
527
+ images = [load_image(img_url) for img_url in image_urls]
528
+ # Numbering images improves multi-image conversations
529
+ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
530
+ print(response.text)
531
+ ```
532
+
533
+ #### Batch Prompts Inference
534
+
535
+ Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
536
+
537
+ ```python
538
+ from lmdeploy import pipeline, TurbomindEngineConfig
539
+ from lmdeploy.vl import load_image
540
+
541
+ model = 'OpenGVLab/InternVL2-Llama3-76B'
542
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
543
+
544
+ image_urls=[
545
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
546
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
547
+ ]
548
+ prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
549
+ response = pipe(prompts)
550
+ print(response)
551
+ ```
552
+
553
+ #### Multi-turn Conversation
554
+
555
+ There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
556
+
557
+ ```python
558
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
559
+ from lmdeploy.vl import load_image
560
+
561
+ model = 'OpenGVLab/InternVL2-Llama3-76B'
562
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
563
+
564
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
565
+ gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
566
+ sess = pipe.chat(('describe this image', image), gen_config=gen_config)
567
+ print(sess.response.text)
568
+ sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
569
+ print(sess.response.text)
570
+ ```
571
+
572
  #### Service
573
 
574
  LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
575
 
 
 
576
  ```shell
577
+ lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B --backend turbomind --server-port 23333 --tp 4
578
  ```
579
 
580
  To use the OpenAI-style interface, you need to install OpenAI: