Data Repositories and Size of Dataset
Hi bro, could you please specify which repositories the ORCA and Dolphin Data you used come from and the total amount of your unpublished dataset? Thanks a lot!
OK. garage-bAInd/Open-Platypus, ehartford/dolphin: flan1m-alpaca-uncensored-deduped.jsonl, Open-Orca/OpenOrca:1M-GPT4-Augmented.parquet
The amount is around 50K.
Thank you for your detailed information!
However, I have another question that, you said in your model card that you selected 5% Dolphin Data and 7% OpenOrca which are 120K in total and how could the final amount be around 50K? I am not sure whether or not I misunderstood your comment and your model card.
looking forward to your reply!
Oh, I get that, It's a mistake and I forget to revise the readme, I first use 5% Dolphin and 7% OpenOrca mixed with Platypus for training , but found a inferior performance(which we guess replicate or similar data still exists), so I filter the remaining data again, and finally only ~1% Dolphin and ~1% OpenORCA data remained. I will update the ReadMe file later, sorry for that.