HuggingFaceTB
/

SmolVLM2-500M-Video-Instruct

Video-Text-to-Text

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

mfarre HF staff commited on 12 days ago

Commit

b338671

·

verified ·

1 Parent(s): 070aa5f

Update README.md

Files changed (1) hide show

README.md +6 -8

README.md CHANGED Viewed

@@ -78,16 +78,15 @@ SmolVLM is not intended for high-stakes scenarios or critical decision-making pr
 SmolVLM2 is built upon [SigLIP](https://huggingface.co/google/siglip-base-patch16-512) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) for text decoder part.
-We release the SmolVLM 2checkpoints under the Apache 2.0 license.
 ## Training Data
-SmolVLM2 used 3.3M samples for training coming from ten datasets: LlaVa Onevision, M4-Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista-400K, MovieChat and ShareGPT4Video.
-### General split
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
 ### Text mixture
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_text.png" width="auto" height="auto" alt="Image description">
@@ -98,5 +97,4 @@ SmolVLM2 used 3.3M samples for training coming from ten datasets: LlaVa Onevisio
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_multiimage.png" width="auto" height="auto" alt="Image description">
 ### Video mixture
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_video.png" width="auto" height="auto" alt="Image description">

 SmolVLM2 is built upon [SigLIP](https://huggingface.co/google/siglip-base-patch16-512) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) for text decoder part.
+We release the SmolVLM2 checkpoints under the Apache 2.0 license.
 ## Training Data
+SmolVLM2 used 3.3M samples for training originally from ten different datasets: : LlaVa Onevision, M4-Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista-400K, MovieChat and ShareGPT4Video.
+In the following plots we give a general overview of the samples across modalities and the source of those samples.
+<center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
+</center>
 ### Text mixture
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_text.png" width="auto" height="auto" alt="Image description">
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_multiimage.png" width="auto" height="auto" alt="Image description">
 ### Video mixture
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_video.png" width="auto" height="auto" alt="Image description">