About the dataset used for training

#3
by JavierCastellD - opened

In your description about the dataset used for training the model, it is specified that it includes web-sourced content and publicly available documents. Does this dataset contain potentially copyrighted information? If not, what was your effort to prevent it?

Language Technologies Unit @ Barcelona Supercomputing Center org

Hi @JavierCastellD ! We will soon publish a technical report where you can find more detailed information about the processing of the training data.

jsaizant changed discussion status to closed

Thanks! Great job and I'm looking forward to it.

Sign up or log in to comment