# Fine-tuning ModernBERT on a Large Dataset with Masked Language Modelling
This guide demonstrates how to fine-tune the ModernBERT-large model on a Dutch dataset using the code from the s-smits/modernbert-finetune repository and the Hugging Face Transformers library. We'll walk through the steps of setting up your environment, preparing the dataset, configuring the training process, and running the fine-tuning script.
Prerequisites
- Hugging Face Account: You'll need a Hugging Face account to access models, datasets, and push your fine-tuned model to the Hub. Sign up here.
- Hugging Face API Token: Generate a User Access Token (with "write" access) from your Hugging Face profile settings. This token will be used to authenticate your interactions with the Hugging Face Hub.
- WandB Account (Optional but Recommended): Weights & Biases (WandB) is a great tool for tracking and visualizing your training runs. Create a free account at wandb.ai.
- WandB API Key: If you're using WandB, get your API key from your WandB settings.
- Environment: A GPU environment is strongly recommended. We suggest using the latest pytorch version.
Installation
Clone the Repository:
git clone https://github.com/s-smits/modernbert-finetune.git cd modernbert-finetune
Install Dependencies:
pip install -r requirements.txt
This command will install all the necessary packages listed in the
requirements.txt
file, includingtorch
,datasets
,huggingface-hub
,transformers
, andwandb
. It will also install the correct version of transformers from the main branch to get the latest features.
Configuration
Environment Variables:
Set the following environment variables:
export HUGGINGFACE_TOKEN="your_huggingface_token" export WANDB_API_KEY="your_wandb_api_key" # Optional
Replace
"your_huggingface_token"
with your actual Hugging Face token and"your_wandb_api_key"
with your WandB API key.
Script Parameters:
The
train.py
script defines several configurable parameters. Here are some of the most important ones:model_checkpoint
: "answerdotai/ModernBERT-large" (default, the large ModernBERT model).dataset_name
: "ssmits/fineweb-2-dutch" (default, a Dutch dataset). You can change this to any other dataset on the Hugging Face Hub.num_train_epochs
: 1 (default). Increase for longer training, but be mindful of overfitting.chunk_size
: 8192 (default). Adjust based on your GPU memory.gradient_accumulation_steps
: 32 (default). Modify based on your desired effective batch size and GPU memory.per_device_train_batch_size
: 1 (default). Adjust based on your GPU memory.eval_size_ratio
: 0.05 (default). The proportion of the dataset used for evaluation.masking_probabilities
:[0.3, 0.2, 0.18, 0.16, 0.14]
(default). The curriculum learning masking probabilities.
You can modify these parameters directly in the
train.py
file or by using environment variables.
Running the Fine-tuning Script
Login to Hugging Face Hub:
huggingface-cli login --token $HUGGINGFACE_TOKEN
Login to WandB (Optional):
wandb login --relogin
Run the Script:
python train.py
Monitoring and Evaluation
- WandB Dashboard: If you're using WandB, monitor your training progress in real-time on your WandB project dashboard.
- Hugging Face Hub: Once the training is complete, your fine-tuned model will be automatically pushed to your Hugging Face Hub profile under the repository name specified in the script (
repo_name
).
Using Your Fine-tuned Model
You can then use your fine-tuned model for various downstream tasks using the Hugging Face Transformers library:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "your_username/modernbert-large-language" # Replace with your model name (e.g., your username and the repo name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Use the model for inference, e.g., filling in masked tokens
inputs = tokenizer("Het weer is vandaag [MASK].", return_tensors="pt")
outputs = model(**inputs)
# ... process the outputs ...
Tips and Considerations
- GPU Memory: ModernBERT is a large model. Adjust
chunk_size
,per_device_train_batch_size
, andgradient_accumulation_steps
to fit your GPU's memory. - Dataset Size: The script is designed for large, streaming datasets. Adjust
estimated_dataset_size_in_rows
if you're using a smaller dataset. - Hyperparameter Tuning: Experiment with different hyperparameters (learning rate, masking probabilities, etc.) to find the best settings for your task.
- Evaluation: The script performs periodic evaluations. You can customize the evaluation frequency using
eval_interval
. - Saving: The script automatically saves intermediate and final models to the Hugging Face Hub. You can adjust the saving frequency using
save_interval
.
Troubleshooting
- CUDA Errors: If you encounter CUDA errors, reduce
per_device_train_batch_size
,chunk_size
, or increasegradient_accumulation_steps
. - Shape Errors: The
StableDataCollator
is designed to handle most shape-related issues. If you encounter any, ensure your dataset is properly formatted and that you're using the latest version of thetransformers
library.
This guide provides a comprehensive overview of how to use the provided code to fine-tune ModernBERT. Remember to adapt the instructions and parameters to your specific needs and dataset. Good luck!
You are using the dataset ssmits/fineweb-2-dutch which was created using data from CommonCrawl.
Could you share the Python code used to generate this dataset from Common Crawl?
Hello!
I believe the dataset is simply the Dutch subsection of the Fineweb 2 dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
This dataset card has a lot of information on how it was generated.
- Tom Aarsen