--- license: mit --- ![header](./assets/header.png)

๐Ÿ“ƒ Paper โ€ข ๐ŸŒ Demo โ€ข ๐Ÿค— LongLLaVA

![efficiency](./assets/singleGPU.png) ## ๐ŸŒˆ Update * **[2024.09.05]** LongLLaVA repo is published๏ผ๐ŸŽ‰ The Code will ## Architecture
Click to view the architecture image ![Architecture Image](./assets/arch.png)
## Results
Click to view the Results - Main Results ![Main Results](./assets/result1.png) - Diagnostic Results ![Diagnostic Results](./assets/diaresult.png) - Video-NIAH ![Video-NIAH](./assets/NIAH.png)
## Results reproduction ### Data DownLoad and Construction
Dataset Taxonomy ![Dataset](./assets/dataset.png)
Dataset DownLoading and Construction > Coming Soon~
### Training > Coming Soon~ - Stage I: Single-image Alignment. ```bash bash Pretrain.sh ``` - Stage II: Single-image Instruction-tuning. ```bash bash SingleImageSFT.sh ``` - Stage III: Multi-image Instruction-tuning. ```bash bash MultiImageSFT.sh ``` ### Evaluation > Coming Soon~ ```bash bash Eval.sh ``` ## TO DO - [ ] Release Model Evalation Code - [ ] Release Data Construction Code - [ ] Release Model Training Code ## Acknowledgement - [LLaVA](https://github.com/haotian-liu/LLaVA): Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond. ## Citation ``` @misc{wang2024longllavascalingmultimodalllms, title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture}, author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang}, year={2024}, eprint={2409.02889}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.02889}, } ```