--- license: gpl-3.0 language: - nl base_model: - CLTL/MedRoBERTa.nl tags: - medical - healthcare metrics: - perplexity library_name: transformers --- Continued, off-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co/CLTL/MedRoBERTa.nl) using about 50GB of open Dutch and translated English corpora. # Data statistics Sources: * Dutch: medical guidelines (FMS, NHG) * Dutch: [NtvG](https://www.ntvg.nl/) papers * English: Pubmed abstracts * English: PMC abstracts translated using DeepL * English: Apollo guidelines, papers and books * English: Meditron guidelines * English: MIMIC3 * English: MIMIC CXR * English: MIMIC4 All translated (if not with DeepL) with performed with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200. * Number of tokens: 15B * Number of documents: 27M # Training * Effective batch size: 5120 * Learning rate: 2e-4 * Weight decay: 1e-3 * Learning schedule: linear, with 5_000 warmup steps * Num epochs: ~3 Train perplexity: 3.0 Validation perplexity: 3.0 # Acknowledgement This work was done together with the Amsterdam UMC, in the context of the [DataTools4Heart](https://www.datatools4heart.eu/) project. We were happy to be able to use the [Google TPU research cloud](https://sites.research.google/trc/about/) for training the model.