ynhe commited on
Commit
9e9fcf6
·
verified ·
1 Parent(s): 9f2fada

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: apache-2.0
6
+ metrics:
7
+ - accuracy
8
+ tags:
9
+ - multimodal
10
+ pipeline_tag: video-text-to-text
11
+ model-index:
12
+ - name: InternVideo2.5
13
+ results:
14
+ - task:
15
+ type: multimodal
16
+ dataset:
17
+ name: MLVU
18
+ type: mlvu
19
+ metrics:
20
+ - type: accuracy
21
+ value: 72.8
22
+ name: accuracy
23
+ verified: true
24
+ - task:
25
+ type: multimodal
26
+ dataset:
27
+ name: MVBench
28
+ type: mvbench
29
+ metrics:
30
+ - type: accuracy
31
+ value: 75.7
32
+ name: accuracy
33
+ verified: true
34
+ - task:
35
+ type: multimodal
36
+ dataset:
37
+ name: Perception Test
38
+ type: percepTest
39
+ metrics:
40
+ - type: accuracy
41
+ value: 74.9
42
+ name: accuracy
43
+ verified: true
44
+ - task:
45
+ type: multimodal
46
+ dataset:
47
+ name: LongVideoBench
48
+ type: longvideobench
49
+ metrics:
50
+ - type: accuracy
51
+ value: 60.6
52
+ name: accuracy
53
+ verified: true
54
+ - task:
55
+ type: multimodal
56
+ dataset:
57
+ name: VideoMME (w/o sub)
58
+ type: videomme
59
+ metrics:
60
+ - type: accuracy
61
+ value: 65.1
62
+ name: accuracy
63
+ verified: true
64
+ - task:
65
+ type: multimodal
66
+ dataset:
67
+ name: LVBench
68
+ type: lvbench
69
+ metrics:
70
+ - type: accuracy
71
+ value: 46.4
72
+ name: accuracy
73
+ verified: true
74
+
75
+
76
+ ---
77
+
78
+ # 📕InternVideo2.5⚡
79
+ <!-- [\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
80
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
81
+ [\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
82
+ <!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
83
+
84
+ InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo).
85
+
86
+
87
+
88
+
89
+ ## 📈 Performance
90
+ | Model | MVBench | LongVideoBench | VideoMME(w/o sub)|
91
+ | --- | --- | --- | --- |
92
+ |InternVideo2.5| 75.7 | 60.6 | 65.1|
93
+
94
+ ## 🚀 How to use the model
95
+
96
+ First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below:
97
+ ```
98
+ pip install transformers==4.40.1
99
+ pip install av
100
+ pip install imageio
101
+ pip install decord
102
+ pip install opencv-python
103
+ pip install flash-attn --no-build-isolation
104
+ ```
105
+ Then you could use our model:
106
+ ```python
107
+ from transformers import AutoModel, AutoTokenizer
108
+
109
+ # model setting
110
+ model_path = 'OpenGVLab/InternVideo2.5'
111
+
112
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
113
+ model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
114
+ image_processor = model.get_vision_tower().image_processor
115
+
116
+
117
+ # evaluation setting
118
+ max_num_frames = 512
119
+ generation_config = dict(
120
+ do_sample=False,
121
+ temperature=0.0,
122
+ max_new_tokens=1024,
123
+ top_p=0.1,
124
+ num_beams=1
125
+ )
126
+
127
+ video_path = "your_video.mp4"
128
+
129
+ # single-turn conversation
130
+ question1 = "Describe this video in detail."
131
+ output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
132
+
133
+ print(output1)
134
+
135
+ # multi-turn conversation
136
+ question2 = "How many people appear in the video?"
137
+ output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
138
+
139
+ print(output2)
140
+ ```
141
+
142
+ ## ✏️ Citation
143
+
144
+ ```bibtex
145
+
146
+ @article{wang2025internvideo,
147
+ title={InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling},
148
+ author={Wang, Yi and Li, Xinhao and Yan, Ziang and He, Yinan and Yu, Jiashuo and Zeng, Xiangyu and Wang, Chenting and Ma, Changlian and Huang, Haian and Gao, Jianfei and Dou, Min and Chen, Kai and Wang, Wenhai and Qiao, Yu and Wang, Yali and Wang, Limin},
149
+ journal={arXiv preprint arXiv:2501.12386},
150
+ year={2025}
151
+ }
152
+ ```