|
--- |
|
base_model: |
|
- google/siglip-so400m-patch14-384 |
|
language: |
|
- en |
|
- zh |
|
license: apache-2.0 |
|
pipeline_tag: image-feature-extraction |
|
--- |
|
|
|
# Oryx-ViT |
|
|
|
## Model Summary |
|
|
|
The Oryx-ViT model is trained on 200M data and can seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths. It is described in the paper [Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution](https://arxiv.org/abs/2409.12961). |
|
|
|
- **Repository:** https://github.com/Oryx-mllm/Oryx |
|
- **Project Page:** https://oryx-mllm.github.io |
|
- **Languages:** English, Chinese |
|
|
|
|
|
### Model Architecture |
|
|
|
- **Architecture:** SigLip |
|
- **Data:** a mixture of 200M data, 2 epoch |
|
- **Precision:** BFloat16 |
|
|
|
#### Hardware & Software |
|
|
|
- **Hardware:** 64 * NVIDIA Tesla A100 |
|
- **Orchestration:** HuggingFace Trainer |
|
- **Code:** Pytorch |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{liu2024oryx, |
|
title={Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution}, |
|
author={Liu, Zuyan and Dong, Yuhao and Liu, Ziwei and Hu, Winston and Lu, Jiwen and Rao, Yongming}, |
|
journal={arXiv preprint arXiv:2409.12961}, |
|
year={2024} |
|
} |
|
``` |