Update README.md
Browse files
README.md
CHANGED
@@ -78,16 +78,15 @@ SmolVLM is not intended for high-stakes scenarios or critical decision-making pr
|
|
78 |
|
79 |
SmolVLM2 is built upon [SigLIP](https://huggingface.co/google/siglip-base-patch16-512) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) for text decoder part.
|
80 |
|
81 |
-
We release the
|
82 |
|
83 |
## Training Data
|
84 |
|
85 |
-
SmolVLM2 used 3.3M samples for training
|
86 |
-
|
87 |
-
### General split
|
88 |
-
|
89 |
-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
|
90 |
|
|
|
|
|
91 |
### Text mixture
|
92 |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_text.png" width="auto" height="auto" alt="Image description">
|
93 |
|
@@ -98,5 +97,4 @@ SmolVLM2 used 3.3M samples for training coming from ten datasets: LlaVa Onevisio
|
|
98 |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_multiimage.png" width="auto" height="auto" alt="Image description">
|
99 |
|
100 |
### Video mixture
|
101 |
-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_video.png" width="auto" height="auto" alt="Image description">
|
102 |
-
|
|
|
78 |
|
79 |
SmolVLM2 is built upon [SigLIP](https://huggingface.co/google/siglip-base-patch16-512) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) for text decoder part.
|
80 |
|
81 |
+
We release the SmolVLM2 checkpoints under the Apache 2.0 license.
|
82 |
|
83 |
## Training Data
|
84 |
|
85 |
+
SmolVLM2 used 3.3M samples for training originally from ten different datasets: : LlaVa Onevision, M4-Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista-400K, MovieChat and ShareGPT4Video.
|
86 |
+
In the following plots we give a general overview of the samples across modalities and the source of those samples.
|
|
|
|
|
|
|
87 |
|
88 |
+
<center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
|
89 |
+
</center>
|
90 |
### Text mixture
|
91 |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_text.png" width="auto" height="auto" alt="Image description">
|
92 |
|
|
|
97 |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_multiimage.png" width="auto" height="auto" alt="Image description">
|
98 |
|
99 |
### Video mixture
|
100 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_video.png" width="auto" height="auto" alt="Image description">
|
|