--- license: mit --- ## Model Summary Video-CCAM-9B is a Video-MLLM built on [Yi-1.5-9B-Chat](https://huggingface.co/01-ai/Yi-1.5-9B-Chat) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384). ## Usage Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10: ``` torch==2.1.0 torchvision==0.16.0 transformers==4.40.2 peft==0.10.0 ``` ## Inference & Evaluation Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) on inference and evaluation. ### Video-MME |#Frames.|32|96| |:-:|:-:|:-:| |w/o subs|50.0|50.6| |w subs|53.1|54.9| ### MVBench: 60.70 (16 frames) ## Acknowledgement * [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-9B is trained using the xtuner framework. Thanks for their excellent works! * [Yi-1.5-9B-Chat](https://huggingface.co/01-ai/Yi-1.5-9B-Chat): Great language models developed by [01.AI](https://www.lingyiwanwu.com/). * [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google. ## License The model is licensed under the MIT license.