mfarre HF staff commited on
Commit
b338671
·
verified ·
1 Parent(s): 070aa5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -8
README.md CHANGED
@@ -78,16 +78,15 @@ SmolVLM is not intended for high-stakes scenarios or critical decision-making pr
78
 
79
  SmolVLM2 is built upon [SigLIP](https://huggingface.co/google/siglip-base-patch16-512) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) for text decoder part.
80
 
81
- We release the SmolVLM 2checkpoints under the Apache 2.0 license.
82
 
83
  ## Training Data
84
 
85
- SmolVLM2 used 3.3M samples for training coming from ten datasets: LlaVa Onevision, M4-Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista-400K, MovieChat and ShareGPT4Video.
86
-
87
- ### General split
88
-
89
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
90
 
 
 
91
  ### Text mixture
92
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_text.png" width="auto" height="auto" alt="Image description">
93
 
@@ -98,5 +97,4 @@ SmolVLM2 used 3.3M samples for training coming from ten datasets: LlaVa Onevisio
98
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_multiimage.png" width="auto" height="auto" alt="Image description">
99
 
100
  ### Video mixture
101
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_video.png" width="auto" height="auto" alt="Image description">
102
-
 
78
 
79
  SmolVLM2 is built upon [SigLIP](https://huggingface.co/google/siglip-base-patch16-512) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) for text decoder part.
80
 
81
+ We release the SmolVLM2 checkpoints under the Apache 2.0 license.
82
 
83
  ## Training Data
84
 
85
+ SmolVLM2 used 3.3M samples for training originally from ten different datasets: : LlaVa Onevision, M4-Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista-400K, MovieChat and ShareGPT4Video.
86
+ In the following plots we give a general overview of the samples across modalities and the source of those samples.
 
 
 
87
 
88
+ <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
89
+ </center>
90
  ### Text mixture
91
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_text.png" width="auto" height="auto" alt="Image description">
92
 
 
97
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_multiimage.png" width="auto" height="auto" alt="Image description">
98
 
99
  ### Video mixture
100
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_video.png" width="auto" height="auto" alt="Image description">