base_model: Qwen/Qwen2-VL-72B-Instruct | |
library_name: transformers | |
license: apache-2.0 | |
tags: | |
- llama-factory | |
- full | |
- generated_from_trainer | |
model-index: | |
- name: TVC-72B | |
results: [] | |
pipeline_tag: image-text-to-text | |
## Model Summary | |
The TVC models are 72B parameter models based on Qwen2-VL-72B-Instruct model with a context window of 8K tokens. | |
- **Repository:** https://github.com/sun-hailong/TVC | |
- **Languages:** English, Chinese | |
- **Paper:** https://arxiv.org/abs/2503.13360 | |
### Model Architecture | |
- **Architecture:** Qwen2-VL-72B-Instruct | |
- **Data:** a mixture of 300k long-chain reasoning data | |
- **Precision:** BFloat16 | |
#### Hardware & Software | |
- **Hardware:** 64 * NVIDIA Tesla H20 | |
- **Orchestration:** HuggingFace Trainer | |
- **Code:** Pytorch | |
### Framework versions | |
- Transformers 4.46.1 | |
- Pytorch 2.5.1+cu124 | |
- Datasets 3.1.0 | |
- Tokenizers 0.20.3 | |
## Citation | |
``` | |
@article{sun2024mitigating, | |
title={Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning}, | |
author={Sun, Hai-Long and Sun, Zhun and Peng, Houwen and Ye, Han-Jia}, | |
journal={arXiv preprint arXiv:2503.13360}, | |
year={2025} | |
} | |
``` |