recommend some dataset
#5
by
chadqiu
- opened
some open-sourced dataset :
SkyPile-150B, Chinese dataset from Skywork-13B :https://huggingface.co/datasets/Skywork/SkyPile-150B
wanjuan, Chinese and English from InternLm: https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0
Dolma, English 3T token dataset: https://huggingface.co/datasets/allenai/dolma