lmms-lab
/

LLaVA-Video-7B-Qwen2

Video-Text-to-Text

text-generation

Inference Endpoints

Model card Files Files and versions Community

ZhangYuanhan commited on Oct 1

Commit

5a16914

•

1 Parent(s): b05b2ee

Update README.md

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ metrics:
 tags:
 - multimodal
 model-index:
-- name: LLaVA-NeXT-Video-7B-Qwen2
   results:
   - task:
       type: multimodal
@@ -117,7 +117,7 @@ base_model:
 - lmms-lab/llava-onevision-qwen2-7b-si
 ---
-# LLaVA-NeXT-Video-7B-Qwen2
 ##  Table of Contents
@@ -130,7 +130,7 @@ base_model:
 ## Model Summary
-The LLaVA-NeXT-Video models are 7/72B parameter models trained on [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data) and [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), based on Qwen2 language model with a context window of 32K tokens.
 This model support at most 64 frames.
@@ -143,7 +143,7 @@ This model support at most 64 frames.
 ### Intended use
-The model was trained on [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data) and [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), having the ability to interact with images, multi-image and videos, but specific to videos.
@@ -186,7 +186,7 @@ def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
     spare_frames = vr.get_batch(frame_idx).asnumpy()
     # import pdb;pdb.set_trace()
     return spare_frames,frame_time,video_time
-pretrained = "lmms-lab/LLaVA-NeXT-Video-7B-Qwen2"
 model_name = "llava_qwen"
 device = "cuda"
 device_map = "auto"

 tags:
 - multimodal
 model-index:
+- name: LLaVA-Video-7B-Qwen2
   results:
   - task:
       type: multimodal
 - lmms-lab/llava-onevision-qwen2-7b-si
 ---
+# LLaVA-Video-7B-Qwen2
 ##  Table of Contents
 ## Model Summary
+The LLaVA-Video models are 7/72B parameter models trained on [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) and [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), based on Qwen2 language model with a context window of 32K tokens.
 This model support at most 64 frames.
 ### Intended use
+The model was trained on [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) and [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), having the ability to interact with images, multi-image and videos, but specific to videos.
     spare_frames = vr.get_batch(frame_idx).asnumpy()
     # import pdb;pdb.set_trace()
     return spare_frames,frame_time,video_time
+pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
 model_name = "llava_qwen"
 device = "cuda"
 device_map = "auto"