Vision-CAIR commited on
Commit
5b15eaf
1 Parent(s): 7000b4e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - shenxq/OneVision
4
+ - shenxq/VideoChat2
5
+ base_model:
6
+ - Qwen/Qwen2-7B-Instruct
7
+ model-index:
8
+ - name: llava-onevision-qwen-7b-ov
9
+ results:
10
+ - task:
11
+ type: multimodal
12
+ dataset:
13
+ name: EgoSchema
14
+ type: egoschema
15
+ metrics:
16
+ - type: accuracy
17
+ value: 67.6
18
+ name: accuracy
19
+ verified: true
20
+ - task:
21
+ type: multimodal
22
+ dataset:
23
+ name: MLVU
24
+ type: mlvu
25
+ metrics:
26
+ - type: accuracy
27
+ value: 65.4
28
+ name: accuracy
29
+ verified: true
30
+ - task:
31
+ type: multimodal
32
+ dataset:
33
+ name: MVBench
34
+ type: mvbench
35
+ metrics:
36
+ - type: accuracy
37
+ value: 66.9
38
+ name: accuracy
39
+ verified: true
40
+ - task:
41
+ type: multimodal
42
+ dataset:
43
+ name: VideoMME
44
+ type: videomme
45
+ metrics:
46
+ - type: accuracy
47
+ value: 60.6
48
+ name: accuracy
49
+ verified: true
50
+ ---
51
+ # LongVU
52
+
53
+ Play with the model on the [HF demo](https://huggingface.co/spaces/Vision-CAIR/LongVU).
54
+
55
+ <div align="left">
56
+ <a href='https://vision-cair.github.io/LongVU'><img src="https://longvu.s3.amazonaws.com/assets/demo.gif" alt="Demo GIF" style="width: 100%; max-width: 650px;"></a>
57
+ </div>
58
+
59
+ # Use
60
+
61
+ We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/Vision-CAIR/LongVU)
62
+
63
+ ```python
64
+ # git clone https://github.com/Vision-CAIR/LongVU
65
+ import numpy as np
66
+ import torch
67
+ from longvu.builder import load_pretrained_model
68
+ from longvu.constants import (
69
+ DEFAULT_IMAGE_TOKEN,
70
+ IMAGE_TOKEN_INDEX,
71
+ )
72
+ from longvu.conversation import conv_templates, SeparatorStyle
73
+ from longvu.mm_datautils import (
74
+ KeywordsStoppingCriteria,
75
+ process_images,
76
+ tokenizer_image_token,
77
+ )
78
+ from decord import cpu, VideoReader
79
+
80
+ tokenizer, model, image_processor, context_len = load_pretrained_model(
81
+ "./checkpoints/longvu_qwen", None, "cambrian_qwen",
82
+ )
83
+
84
+ model.eval()
85
+ video_path = "./examples/video1.mp4"
86
+ qs = "Describe this video in detail"
87
+
88
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
89
+ fps = float(vr.get_avg_fps())
90
+ frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
91
+ video = []
92
+ for frame_index in frame_indices:
93
+ img = vr[frame_index].asnumpy()
94
+ video.append(img)
95
+ video = np.stack(video)
96
+ image_sizes = [video[0].shape[:2]]
97
+ video = process_images(video, image_processor, model.config)
98
+ video = [item.unsqueeze(0) for item in video]
99
+
100
+ qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
101
+ conv = conv_templates["qwen"].copy()
102
+ conv.append_message(conv.roles[0], qs)
103
+ conv.append_message(conv.roles[1], None)
104
+ prompt = conv.get_prompt()
105
+
106
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
107
+ stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
108
+ keywords = [stop_str]
109
+ stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
110
+ with torch.inference_mode():
111
+ output_ids = model.generate(
112
+ input_ids,
113
+ images=video,
114
+ image_sizes=image_sizes,
115
+ do_sample=False,
116
+ temperature=0.2,
117
+ max_new_tokens=128,
118
+ use_cache=True,
119
+ stopping_criteria=[stopping_criteria],
120
+ )
121
+ pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
122
+ ```