mapama247 commited on
Commit
f254fca
1 Parent(s): cc2c6f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -223,7 +223,7 @@ Feel free to click the expand button below to see the full list of sources.
223
  | MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
224
  | CURLICAT Corpus | bg, hr, hu, pl, ro, sk, sl | (Váradi et al., 2022) |
225
  | CATalog | ca | (Palomar-Giner et al., 2024) |
226
- | Spanish Crawling | ca, es, eu, gl | - |
227
  | Starcoder | code | (Li et al., 2023) |
228
  | SYN v9: large corpus of written Czech | cs | (Křen et al., 2021) |
229
  | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
@@ -245,13 +245,13 @@ Feel free to click the expand button below to see the full list of sources.
245
  | The Pile (PhilPapers subset) | en | (Gao et al., 2021) |
246
  | Spanish Legal Domain Corpora | es | (Gutiérrez-Fandiño et al., 2021) |
247
  | HPLTDatasets v1 - Spanish | es | (de Gibert et al., 2024) |
248
- | Legal | es | BOE, BORME, Senado, Congreso, sentencias (ULPGC) |
249
- | Biomedical | es | - |
250
- | Scientific | es | - |
251
  | Estonian National Corpus 2021 | et | (Koppel & Kallas, 2022) |
252
  | Estonian Reference Corpus | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
253
  | EusCrawl (filtered: no Wikipedia, no NC-licenses) | eu | (Artetxe et al., 2022) |
254
- | GAITU | eu | Compilation of CulturaX, Booktegi, some dumps of Colossal Oscar, Egunkaria, Euscrawl, HPLT and Wikipedia in Basque. |
255
  | Yle Finnish News Archive | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
256
  | CaBeRnet: a New French Balanced Reference Corpus | fr | (Popa-Fabre et al., 2020) |
257
  | French Public Domain Newspapers | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
@@ -265,7 +265,7 @@ Feel free to click the expand button below to see the full list of sources.
265
  | Korpus Malti | mt | (Micallef et al., 2022) |
266
  | SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
267
  | Norwegian Colossal Corpus | nn, no | (Kummervold et al., 2021) |
268
- | Occitan Corpus | oc | - |
269
  | Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego | pl | (Ogrodniczuk, 2018) |
270
  | NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish) | pl | (Lewandowska-Tomaszczyk et al., 2013) |
271
  | Brazilian Portuguese Web as Corpus | pt | (Wagner Filho et al., 2018) |
@@ -291,6 +291,7 @@ Feel free to click the expand button below to see the full list of sources.
291
  - Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1286–1305). Association for Computational Linguistics. [Link](https://doi.org/10.18653/v1/2021.emnlp-main.98)
292
  - Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica (Slovenia), 39, 35–42.
293
  - Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V., Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrímsson, S. hór, van der Pol, H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C., Calzada Pérez, M., de Macedo, L. D., … Rayson, P. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. [Link](http://hdl.handle.net/11356/1431)
 
294
  - Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR, abs/2101.00027. [Link](https://arxiv.org/abs/2101.00027)
295
  - Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-Agirre, A., & Villegas, M. (2021). Spanish Legalese Language Model and Corpora.
296
  - Hansen, D. H. (2018). The Danish Parliament Corpus 2009—2017, v1. [Link](http://hdl.handle.net/20.500.12115/8)
 
223
  | MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
224
  | CURLICAT Corpus | bg, hr, hu, pl, ro, sk, sl | (Váradi et al., 2022) |
225
  | CATalog | ca | (Palomar-Giner et al., 2024) |
226
+ | Spanish Crawling | ca, es, eu, gl | Relevant Spanish websites crawling |
227
  | Starcoder | code | (Li et al., 2023) |
228
  | SYN v9: large corpus of written Czech | cs | (Křen et al., 2021) |
229
  | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
 
245
  | The Pile (PhilPapers subset) | en | (Gao et al., 2021) |
246
  | Spanish Legal Domain Corpora | es | (Gutiérrez-Fandiño et al., 2021) |
247
  | HPLTDatasets v1 - Spanish | es | (de Gibert et al., 2024) |
248
+ | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
249
+ | Biomedical | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
250
+ | Scientific | es | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
251
  | Estonian National Corpus 2021 | et | (Koppel & Kallas, 2022) |
252
  | Estonian Reference Corpus | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
253
  | EusCrawl (filtered: no Wikipedia, no NC-licenses) | eu | (Artetxe et al., 2022) |
254
+ | Latxa Corpus v1.1 | eu | (Etxaniz et al., 2024) [Link](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1)|
255
  | Yle Finnish News Archive | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
256
  | CaBeRnet: a New French Balanced Reference Corpus | fr | (Popa-Fabre et al., 2020) |
257
  | French Public Domain Newspapers | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
 
265
  | Korpus Malti | mt | (Micallef et al., 2022) |
266
  | SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
267
  | Norwegian Colossal Corpus | nn, no | (Kummervold et al., 2021) |
268
+ | Occitan Corpus | oc | Provided by [IEA](https://www.institutestudisaranesi.cat/) |
269
  | Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego | pl | (Ogrodniczuk, 2018) |
270
  | NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish) | pl | (Lewandowska-Tomaszczyk et al., 2013) |
271
  | Brazilian Portuguese Web as Corpus | pt | (Wagner Filho et al., 2018) |
 
291
  - Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1286–1305). Association for Computational Linguistics. [Link](https://doi.org/10.18653/v1/2021.emnlp-main.98)
292
  - Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica (Slovenia), 39, 35–42.
293
  - Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V., Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrímsson, S. hór, van der Pol, H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C., Calzada Pérez, M., de Macedo, L. D., … Rayson, P. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. [Link](http://hdl.handle.net/11356/1431)
294
+ - Etxaniz, J., Sainz, O., Perez, N., Aldabe, I., Rigau, G., Agirre, E., Ormazabal, A., Artetxe, M., & Soroa, A. (2024). Latxa: An Open Language Model and Evaluation Suite for Basque. [Link] (https://arxiv.org/abs/2403.20266)
295
  - Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR, abs/2101.00027. [Link](https://arxiv.org/abs/2101.00027)
296
  - Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-Agirre, A., & Villegas, M. (2021). Spanish Legalese Language Model and Corpora.
297
  - Hansen, D. H. (2018). The Danish Parliament Corpus 2009—2017, v1. [Link](http://hdl.handle.net/20.500.12115/8)