Jorge De Corte's picture

Jorge De Corte PRO

JorgeDeC

AI & ML interests

None yet

Recent Activity

Organizations

BigCode's profile picture ReBatch's profile picture ZeroGPU Explorers's profile picture Hugging Face for Legal's profile picture

JorgeDeC's activity

updated a Space 8 months ago
New activity in huggingchat/chat-ui-template 8 months ago

Create INCLUDE_DB

1
#9 opened 8 months ago by
JorgeDeC
New activity in ReBatch/Llama-3-8B-dutch 8 months ago

License

1
#1 opened 8 months ago by
CorporateVero
replied to BramVanroy's post 8 months ago
view reply

A QLORA and ORPO finetune on your ultrafeedback dataset.
It defaults now more to Dutch, even when asking questions in English (sometimes :) )

https://huggingface.co/ReBatch/Llama-3-8B-dutch

I am surprised there is a (small) improvement on dutch_social and hellaswag with only 200k examples for one epoch. All other benchmarks saw a drop, will have to investigate that.

replied to BramVanroy's post 8 months ago
view reply

Great, thank you very much!
We were in the process of translating the original ultrachat en ultrafeedback dataset to Dutch ourselves using permissible models for commercial use.

But now we don't have to. Looking forward to using this!

reacted to BramVanroy's post with 🔥 8 months ago
view post
Post
2286
🥳 New license for datasets: Apache 2.0!

I have been struggling mentally for many months now with the OpenAI terms of use that indicate that their model outputs cannot be used to build "competing models". This leads to many questions:

- what is the definition of competing? Is it the same as "commercial"?
- since this is part of the terms of use between OpenAI and the API user, can a third party still use the generated dataset to build competing models?
- are such restrictions even legal in the first place?

Trying to "follow the rules" as much as possible despite wanting to be as open as possible, I kept releasing my datasets under non-commercial licenses (which are too restrictive anyhow - nothing should prevent you from using the data in non-LM commercial settings), just like models trained on these datasets. This has put me at a competitive disadvantage compared to creators who do not follow the same approach and release their data/models on apache 2.0 despite the OpenAI "restrictions". Moreover, I fear (https://twitter.com/BramVanroy/status/1780220420316164246) that my approach blocks adaptation of my data/models for (commercial) applications/integrations.

Thankfully @Rijgersberg noted that these OpenAI terms of use are NOT explicit in the Azure OpenAI API (https://twitter.com/E_Rijgersberg/status/1780308971762450725). Since my latest datasets were created via Azure, this comes as a relief. As far as I can tell after digging through Azure docs, this allows me to change all recent GPT4-generated datasets to apache 2.0! 🥳

- BramVanroy/ultrachat_200k_dutch
- BramVanroy/orca_dpo_pairs_dutch
- BramVanroy/ultra_feedback_dutch
- BramVanroy/ultra_feedback_dutch_cleaned
- BramVanroy/no_robots_dutch

I will have to mull over what I'll do for the older GPT3.5 datasets. What do you think that I should do?
·