|
--- |
|
language: en |
|
tags: |
|
- multimodal |
|
- text |
|
- image |
|
- image-to-text |
|
datasets: |
|
- HuggingFaceM4/OBELICS |
|
- laion/laion2B-en |
|
- coyo-700m |
|
- mmc4 |
|
pipeline_tag: text-generation |
|
inference: true |
|
--- |
|
## Paper |
|
|
|
More detailes can be found in our paper at https://arxiv.org/abs/2403.01487. We have released the pretraining model and the pyotrch code at https://github.com/InfiMM/infimm-hd/. Feel free to build your model from our pretrained model. |
|
|
|
## Quickstart |
|
|
|
Use the code below to get started with the base model: |
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
|
processor = AutoProcessor.from_pretrained("Infi-MM/infimm-hd", trust_remote_code=True) |
|
|
|
prompts = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"image": "/xxx/test.jpg"}, # change it with you image |
|
"Please describe the image in detail.", |
|
], |
|
} |
|
] |
|
inputs = processor(prompts) |
|
# use bf16 and gpu 0 |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"Infi-MM/infimm-hd", |
|
torch_dtype=torch.bfloat16, |
|
trust_remote_code=True, |
|
).to(0).eval() |
|
|
|
inputs = inputs |
|
|
|
inputs["batch_images"] = inputs["batch_images"].to(torch.bfloat16) |
|
for k in inputs: |
|
inputs[k] = inputs[k].to(model.device) |
|
|
|
generated_ids = model.generate( |
|
**inputs, |
|
min_new_tokens=0, |
|
max_new_tokens=256, |
|
) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(generated_text) |
|
``` |
|
## License |
|
|
|
<a href="https://creativecommons.org/licenses/by-nc/4.0/deed.en"> |
|
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Cc_by-nc_icon.svg/600px-Cc_by-nc_icon.svg.png" width="160"> |
|
</a> |
|
|
|
This project is licensed under the **CC BY-NC 4.0**. |
|
|
|
The copyright of the images belongs to the original authors. |
|
|
|
See [LICENSE](LICENSE) for more information. |
|
|
|
## Contact Us |
|
|
|
Please feel free to contact us via email [[email protected]]([email protected]) if you have any questions. |
|
|
|
## Citation |
|
|
|
```latex |
|
@misc{liu2024infimmhd, |
|
title={InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding}, |
|
author={Haogeng Liu and Quanzeng You and Xiaotian Han and Yiqi Wang and Bohan Zhai and Yongfei Liu and Yunzhe Tao and Huaibo Huang and Ran He and Hongxia Yang}, |
|
year={2024}, |
|
eprint={2403.01487}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |