Wikimedia Datasets - a frimelle Collection

frimelle 's Collections

Wikimedia Datasets

updated May 16, 2024

Wikimedia datasets, across languages and modalities, from different Wikimedia projects, on the hub. Not all tested.

Upvote

legacy-datasets/wikipedia

Updated Mar 11, 2024 • 34.7k • 577

Note Wikipedia: The largest and most widely used Wikipedia dataset on Hugging Face.
copenlu/wiki-stance

Preview • Updated May 17, 2024 • 279 • 3

Note Wikipedia: Wikipedia communities' article for deletion discussions in English, Turkish, and German, with a focus on stance and policy prediction.
wikimedia/wikipedia

Viewer • Updated Jan 9, 2024 • 61.6M • 107k • 750

Note Wikipedia: Dataset of Wikipedia articles across all languages, with Wikitext (the Wikimedia markup language) removed.
Salesforce/wikitext

Viewer • Updated Jan 4, 2024 • 3.71M • 468k • 413

Note Wikipedia: Dataset based on good and featured articles on Wikipedia.
wikimedia/wit_base

Viewer • Updated Nov 4, 2022 • 108k • 2.47k • 57

Note Wikipedia: A image-text dataset based on Wikipedia articles and the associated Wikipedia images.
aiintelligentsystems/vel_commons_wikidata

Viewer • Updated May 17, 2024 • 772k • 466 • 1

Note Wikimedia Commons: Dataset leveraging the structured data from Wikidata that the community adds to describe images for the task of visual entity linking, including license information.
MLCommons/speech-wikimedia

Viewer • Updated Jun 29, 2023 • 16 • 718 • 11

Note Wikimedia Commons: A dataset of Wikimedia Commons audio files and transcriptions across languages.
calm-and-collected/knives_and_time

Viewer • Updated May 5, 2024 • 325 • 64

Note Wikimedia Commons: Dataset of public domain images, manually collected for damaged images, paintings, and photographs (collected from Wikimedia Commons)[https://huggingface.co/calm-and-collected/knives_and_time]
kdm-daiict/freebase-wikidata-mapping

Viewer • Updated Feb 27, 2024 • 2.08M • 66 • 4

Note Wikidata: Links Wikidata to the widely used, but outdated Freebase knowledge Graph.
rvashurin/wikidata_simplequestions

Updated May 29, 2023 • 54 • 2

Note Wikidata: The simple questions dataset, based on Wikidata, on Hugging Face.
rayliuca/WikidataLabels

Viewer • Updated Jan 11, 2024 • 654M • 19.4k • 1

Note Wikidata: Dataset of entity labels across languages, extracted from Wikidata.
imvladikon/paranames

Viewer • Updated Jan 13, 2023 • 78M • 116 • 1

Note Wikidata: Dataset of 118 million names across 400 languages.
wikimedia/wikisource

Viewer • Updated Dec 8, 2023 • 1.66M • 1.49k • 79

Note Wikisource: Dataset of all Wikisource articles across all languages, without Wikitext.
mostol/wiktionary-ipa

Viewer • Updated Feb 2, 2022 • 80.1k • 35 • 5

Note Wiktionary: IPA strings and their respective pronunciation as audio files.
malteos/wikinews

Viewer • Updated Apr 16, 2024 • 249k • 2.68k • 2

Note Wikinews: Dataset of Wikinews articles without Wikitext across different languages, including the revision timestamp, categories, and sources.
taln-ls2n/wikinews-fr-100

Updated Sep 23, 2022 • 166 • 1

Note Wikinews: Dataset for keyphrase extraction and generation models, including 100 French Wikinews articles.
erhwenkuo/wikinews-zhtw

Viewer • Updated Oct 10, 2023 • 9.83k • 120 • 3

Note Wikinews: Dataset of cleaned Chinese Wikinews articles from 2023.
caretech-owl/wikiquote-de-quotes

Viewer • Updated Dec 22, 2023 • 16.2k • 85

Note Wikiquote: Dataset of German quotes and their authors from Wikiquote.
domenicrosati/TruthfulQA

Viewer • Updated Jul 1, 2022 • 817 • 201 • 10

Note Wikiquote: Dataset testing humans’ false believes’ representation in language models. One of the sources is Wikiquote.

Upvote