HuggingFaceM4
/

idefics-80b

Text Generation

image-text-to-text

text-generation-inference

Model card Files Files and versions Community

Leyo commited on Jul 11, 2023

Commit

436c345

·

1 Parent(s): 2f0f4fc

fix proportion numbers

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -116,8 +116,8 @@ The model is trained on the following data mixture of openly accessible English
 | Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
 | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 1      | 73.85%                                  |
-| [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 3      | 17.18%                                  |
-| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | TODO                      | TODO                      | 1      | 6.15%
 | [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | TODO                      | TODO                      | 3      | 2.82%                                   |                                |
 **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).

 | Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
 | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 1      | 73.85%                                  |
+| [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 3      | 6.15%                                  |
+| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | TODO                      | TODO                      | 1      | 17.18%
 | [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | TODO                      | TODO                      | 3      | 2.82%                                   |                                |
 **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).