Image-Text-to-Text
Transformers
Safetensors
Portuguese
tinyllava
text-generation
vision
custom_code
nicholasKluge commited on
Commit
cea9df8
verified
1 Parent(s): e144251

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +189 -0
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ datasets:
5
+ - TucanoBR/GigaVerbo
6
+ - TucanoBR/ViTucano-Pretrain
7
+ - TucanoBR/ViTucano-SFT
8
+ pipeline_tag: image-text-to-text
9
+ license: apache-2.0
10
+ tags:
11
+ - vision
12
+ - image-text-to-text
13
+ library_name: transformers
14
+ base_model:
15
+ - TucanoBR/Tucano-1b1
16
+ co2_eq_emissions:
17
+ emissions: 14100
18
+ source: CodeCarbon
19
+ geographical_location: Germany
20
+ hardware_used: NVIDIA A40
21
+ ---
22
+ # ViTucano-1b5-v1
23
+
24
+ <img src="ViTucano-logo.png" alt="Uma ilustra莽茫o de um tucano usando um elegante terno. O tucano est谩 olhando para o lado, o que mostra o mon贸culo em seu olho direito." height="200">
25
+
26
+ ## Model Summary
27
+
28
+ **ViTucano** is our first attempt at creating a vision assistant natively pretrained in Portuguese. **ViTucano** is built on top of the [Tucano series](https://arxiv.org/abs/2411.07854) using the [TinyLLaVA Factory](https://arxiv.org/abs/2405.11788). ViTucano integrates visual understanding with linguistic capabilities, creating a tool for multimodal tasks (e.g., image captioning, visual question answering, etc.).
29
+
30
+ ## Details
31
+
32
+ - **Architecture:** [`TinyLlavaForConditionalGeneration`](https://github.com/Nkluge-correa/TinyLLaVA_Factory/blob/main/tinyllava/model/modeling_tinyllava.py)
33
+ - **Vision Tower:** [`google/siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
34
+ - **Size:** 1,534,831,680 parameters
35
+ - **Context length:** 2048 tokens
36
+ - **Dataset:**
37
+ - [GigaVerbo](https://huggingface.co/datasets/TucanoBR/GigaVerbo)
38
+ - [ViTucano-Pretrain](https://huggingface.co/datasets/TucanoBR/ViTucano-Pretrain)
39
+ - [ViTucano-SFT](https://huggingface.co/datasets/TucanoBR/ViTucano-SFT)
40
+ - **Language:** Portuguese
41
+ - **GPU:** 8 NVIDIA A40
42
+ - **Training time**: ~ 14 hours
43
+ - **Emissions:** 14.10 KgCO2 (Germany)
44
+ - **Total energy consumption:** 37 kWh
45
+
46
+ This repository has the [source code](https://github.com/Nkluge-correa/TinyLLaVA_Factory) used to train this model.
47
+
48
+ ## Intended Uses
49
+
50
+ The primary intended use of the ViTucano models is to serve as foundations for research and development involving native Portuguese foundation models. You may also fine-tune and adapt ViTucano models for deployment if your use follows the Apache 2.0 license. If you decide to use the ViTucano models as a basis for your fine-tuned model, please conduct your own risk and bias assessment.
51
+
52
+ ## Out-of-scope Use
53
+
54
+ - ViTucano models are **not intended for deployment**. They are not an out-of-the-box product and should not be used for human-facing interactions.
55
+
56
+ - ViTucano models are for **the Portuguese language only** and are unsuitable for image-to-text generation tasks in other languages.
57
+
58
+ - ViTucano models have **not been fine-tuned** for any specific downstream task.
59
+
60
+ ## Basic usage
61
+
62
+ 鈿狅笍Using ViTucano models through the `transformers` library requires executing remote code (`trust_remote_code=True`). The executed files are `configuration.py` and `modeling_tinyllava_tucano.py`, both available in this repository.鈿狅笍
63
+
64
+ <details>
65
+ <summary>Run inference using <code>tinyllava</code></summary>
66
+
67
+ ```python
68
+ from tinyllava.eval.run_tiny_llava import eval_model
69
+
70
+ model_path = "TucanoBR/ViTucano-1b5-v1"
71
+ prompt = "Quais s茫o as coisas com as quais devo ter cuidado quando estiver aqui?"
72
+ image_file = "https://raw.githubusercontent.com/Nkluge-correa/TinyLLaVA_Factory/refs/heads/main/assets/sample.jpg"
73
+ conv_mode = "llama"
74
+
75
+ args = type('Args', (), {
76
+ "model_path": model_path,
77
+ "model": None,
78
+ "query": prompt,
79
+ "conv_mode": conv_mode,
80
+ "image_file": image_file,
81
+ "sep": ",",
82
+ "temperature": 0,
83
+ "top_p": None,
84
+ "num_beams": 1,
85
+ "max_new_tokens": 512
86
+ })()
87
+
88
+ eval_model(args)
89
+ ```
90
+ </details>
91
+
92
+ <details>
93
+ <summary>Run inference using <code>transformers</code></summary>
94
+
95
+ ```python
96
+ from transformers import AutoTokenizer, AutoModelForCausalLM
97
+ import torch
98
+
99
+ model_path = "TucanoBR/ViTucano-1b5-v1"
100
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
101
+
102
+ model = AutoModelForCausalLM.from_pretrained(
103
+ model_path,
104
+ #torch_dtype=torch.bfloat16, # for optimized inference 馃殌
105
+ #attn_implementation="flash_attention_2" # for optimized inference 馃殌
106
+ trust_remote_code=True)
107
+ model.to(device)
108
+
109
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
110
+ prompt = "Quais s茫o as coisas com as quais devo ter cuidado quando estiver aqui?"
111
+ image_file="https://raw.githubusercontent.com/Nkluge-correa/TinyLLaVA_Factory/refs/heads/main/assets/sample.jpg"
112
+ output_text, _ = model.chat(prompt=prompt, image=image_file, tokenizer=tokenizer)
113
+
114
+ print(output_text)
115
+ ```
116
+ </details>
117
+
118
+ ## Limitations
119
+
120
+ Like almost all other multimodal language models trained on large datasets scraped from the web, the ViTucano models show behavior that does not make them an out-of-the-box solution to many real-world applications, especially those requiring factual, reliable, and nontoxic text generation. ViTucano models are all subject to the following:
121
+
122
+ - **Hallucinations:** ViTucano models may generate misleading or entirely false information when interpreting or describing visual inputs, leading to hallucinations that could be mistaken as accurate observations or factual statements.
123
+
124
+ - **Biases and Toxicity:** ViTucano models inherit social and historical stereotypes in the training data. These biases can manifest in harmful, offensive, or misleading descriptions or analyses of visual or textual content.
125
+
126
+ - **Unreliable Visual Interpretations:** ViTucano models may produce inaccurate interpretations of visual elements, including objects, scenes, or text within images. Such outputs should not be considered reliable without human verification.
127
+
128
+ - **Multimodal Language Limitations:** While ViTucano models are optimized for Portuguese, handling multilingual visual and textual contexts may lead to errors, misinterpretations, or inadequate responses, especially with non-Portuguese content.
129
+
130
+ - **Repetition and Irrelevant Details:** ViTucano models can exhibit repetitive response patterns or generate verbose descriptions unrelated to the given visual or textual input, particularly under specific hyperparameter configurations.
131
+
132
+ Hence, even though our models are released with a permissive license, we urge users to perform their risk analysis before using them for real-world applications.
133
+
134
+ ## Cite as 馃
135
+
136
+ ### ViTucano
137
+
138
+ ```bibtex
139
+ @misc{correa20204vitucano,
140
+ author={Corr{\^e}a, Nicholas Kluge and Sen, Aniket and Falk, Sophia and Fatimah, Shiza},
141
+ title={{ViTucano: A Portuguese Vision Assitant}},
142
+ year=2024,
143
+ howpublished = {\url{https://huggingface.co/TucanoBR}},
144
+ }
145
+ ```
146
+
147
+ ### Tucano
148
+
149
+ ```bibtex
150
+ @misc{correa2024tucanoadvancingneuraltext,
151
+ title={{Tucano: Advancing Neural Text Generation for Portuguese}},
152
+ author={Corr{\^e}a, Nicholas Kluge and Sen, Aniket and Falk, Sophia and Fatimah, Shiza},
153
+ year={2024},
154
+ eprint={2411.07854},
155
+ archivePrefix={arXiv},
156
+ primaryClass={cs.CL},
157
+ url={https://arxiv.org/abs/2411.07854},
158
+ }
159
+ ```
160
+
161
+ ### TinyLLaVA Factory
162
+
163
+ ```bibtex
164
+ @article{jia2024tinyllava,
165
+ title={TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models},
166
+ author={Jia, Junlong and Hu, Ying and Weng, Xi and Shi, Yiming and Li, Miao and Zhang, Xingjian and Zhou, Baichuan and Liu, Ziyu and Luo, Jie and Huang, Lei and Wu, Ji},
167
+ journal={arXiv preprint arXiv:2405.11788},
168
+ year={2024}
169
+ }
170
+ ```
171
+
172
+ ### LLaVA
173
+
174
+ ```bibtex
175
+ @misc{liu2023llava,
176
+ title={Visual Instruction Tuning},
177
+ author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
178
+ publisher={NeurIPS},
179
+ year={2023},
180
+ }
181
+ ```
182
+
183
+ ## Aknowlegments
184
+
185
+ We gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing \& Analytics Lab.
186
+
187
+ ## License
188
+
189
+ ViTucano is licensed under the Apache License, Version 2.0. For more details, see the [LICENSE](LICENSE) file.