mmiemon commited on
Commit
f09f08d
·
verified ·
1 Parent(s): bc0d7e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -6
README.md CHANGED
@@ -1,9 +1,101 @@
1
- ---
2
- library_name: peft
3
- ---
4
- ## Training procedure
5
 
6
- ### Framework versions
 
 
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- - PEFT 0.4.0
 
1
+ # BIMBA
 
 
 
2
 
3
+ [**BIMBA: Selective-Scan Compression for Long-Range Video Question Answering**](https://arxiv.org/abs/2503.09590)\
4
+ Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, and Lorenzo Torresani\
5
+ <span style="color:red">**Accepted by CVPR 2025**</span>
6
 
7
+ [**🌐 Homepage**](https://sites.google.com/view/bimba-mllm) | [**📖 arXiv**](https://arxiv.org/abs/2503.09590) | [**💻 GitHub**](https://github.com/md-mohaiminul/BIMBA) | [**🤗 Model**](https://huggingface.co/mmiemon/BIMBA-LLaVA-Qwen2-7B) | [**🌟 Demo**](BIMBA-LLaVA-NeXT/demo_selective_scan_compression.ipynb)
8
+
9
+ BIMBA is a multimodal large language model (MLLM) capable of efficiently processing long-range videos. Our model leverages the selective scan mechanism of [Mamba](https://arxiv.org/abs/2312.00752) to effectively select critical information from high-dimensional video and transform it into a reduced token sequence for efficient LLM processing. Extensive experiments demonstrate that BIMBA  achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including [PerceptionTest](https://arxiv.org/abs/2305.13786), [NExT-QA](https://arxiv.org/abs/2105.08276), [EgoSchema](https://arxiv.org/abs/2308.09126), [VNBench](https://arxiv.org/abs/2406.09367), [LongVideoBench](https://arxiv.org/abs/2407.15754), [Video-MME](https://arxiv.org/abs/2405.21075), and [MLVU](https://arxiv.org/abs/2406.04264).
10
+
11
+ # Quick Start
12
+
13
+ ```bash
14
+ from llava.model.builder import load_pretrained_model
15
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
16
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
17
+ from llava.conversation import conv_templates, SeparatorStyle
18
+ from PIL import Image
19
+ import requests
20
+ import copy
21
+ import torch
22
+ import sys
23
+ import warnings
24
+ from decord import VideoReader, cpu
25
+ import numpy as np
26
+ warnings.filterwarnings("ignore")
27
+
28
+ def load_video(video_path, max_frames_num,fps=1,force_sample=False):
29
+ if max_frames_num == 0:
30
+ return np.zeros((1, 336, 336, 3))
31
+ vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
32
+ total_frame_num = len(vr)
33
+ video_time = total_frame_num / vr.get_avg_fps()
34
+ fps = round(vr.get_avg_fps()/fps)
35
+ frame_idx = [i for i in range(0, len(vr), fps)]
36
+ frame_time = [i/fps for i in frame_idx]
37
+ if len(frame_idx) > max_frames_num or force_sample:
38
+ sample_fps = max_frames_num
39
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
40
+ frame_idx = uniform_sampled_frames.tolist()
41
+ frame_time = [i/vr.get_avg_fps() for i in frame_idx]
42
+ frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
43
+ spare_frames = vr.get_batch(frame_idx).asnumpy()
44
+ return spare_frames,frame_time,video_time
45
+
46
+ model_path = "checkpoints/BIMBA-LLaVA-Qwen2-7B"
47
+ model_base = "lmms-lab/LLaVA-Video-7B-Qwen2"
48
+ model_name = "llava_qwen_lora"
49
+
50
+
51
+ device = "cuda"
52
+ device_map = "auto"
53
+ tokenizer, model, image_processor, max_length = load_pretrained_model(
54
+ model_path = model_path,
55
+ model_base = model_base,
56
+ model_name = model_name,
57
+ torch_dtype="bfloat16",
58
+ device_map=device_map,
59
+ attn_implementation=None,
60
+ )
61
+
62
+ model.eval()
63
+
64
+
65
+ video_path = "assets/example.mp4"
66
+ max_frames_num = 64
67
+ video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
68
+ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
69
+ video = [video]
70
+ conv_template = "qwen_1_5"
71
+ time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
72
+ question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
73
+ conv = copy.deepcopy(conv_templates[conv_template])
74
+ conv.append_message(conv.roles[0], question)
75
+ conv.append_message(conv.roles[1], None)
76
+ prompt_question = conv.get_prompt()
77
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
78
+ cont = model.generate(
79
+ input_ids,
80
+ images=video,
81
+ modalities= ["video"],
82
+ do_sample=False,
83
+ temperature=0,
84
+ max_new_tokens=4096,
85
+ )
86
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
87
+ print(text_outputs)
88
+ ```
89
+
90
+
91
+ ## Citation
92
+ If you find BIMBA useful in your research, please use the following BibTeX entry for citation.
93
+ ```BibTeX
94
+ @article{islam2025bimba,
95
+ title={BIMBA: Selective-Scan Compression for Long-Range Video Question Answering},
96
+ author={Islam, Md Mohaiminul and Nagarajan, Tushar and Wang, Huiyu and Bertasius, Gedas and Torresani, Lorenzo},
97
+ journal={arXiv preprint arXiv:2503.09590},
98
+ year={2025}
99
+ }
100
+ ```
101