# Fine-tuning ModernBERT on a Large Dataset with Masked Language Modelling

#4
by ssmits - opened

This guide demonstrates how to fine-tune the ModernBERT-large model on a Dutch dataset using the code from the s-smits/modernbert-finetune repository and the Hugging Face Transformers library. We'll walk through the steps of setting up your environment, preparing the dataset, configuring the training process, and running the fine-tuning script.

Prerequisites

  • Hugging Face Account: You'll need a Hugging Face account to access models, datasets, and push your fine-tuned model to the Hub. Sign up here.
  • Hugging Face API Token: Generate a User Access Token (with "write" access) from your Hugging Face profile settings. This token will be used to authenticate your interactions with the Hugging Face Hub.
  • WandB Account (Optional but Recommended): Weights & Biases (WandB) is a great tool for tracking and visualizing your training runs. Create a free account at wandb.ai.
  • WandB API Key: If you're using WandB, get your API key from your WandB settings.
  • Environment: A GPU environment is strongly recommended. We suggest using the latest pytorch version.

Installation

  1. Clone the Repository:

    git clone https://github.com/s-smits/modernbert-finetune.git
    cd modernbert-finetune
    
  2. Install Dependencies:

    pip install -r requirements.txt
    

    This command will install all the necessary packages listed in the requirements.txt file, including torch, datasets, huggingface-hub, transformers, and wandb. It will also install the correct version of transformers from the main branch to get the latest features.

Configuration

  1. Environment Variables:

    • Set the following environment variables:

      export HUGGINGFACE_TOKEN="your_huggingface_token"
      export WANDB_API_KEY="your_wandb_api_key" # Optional
      

      Replace "your_huggingface_token" with your actual Hugging Face token and "your_wandb_api_key" with your WandB API key.

  2. Script Parameters:

    • The train.py script defines several configurable parameters. Here are some of the most important ones:

      • model_checkpoint: "answerdotai/ModernBERT-large" (default, the large ModernBERT model).
      • dataset_name: "ssmits/fineweb-2-dutch" (default, a Dutch dataset). You can change this to any other dataset on the Hugging Face Hub.
      • num_train_epochs: 1 (default). Increase for longer training, but be mindful of overfitting.
      • chunk_size: 8192 (default). Adjust based on your GPU memory.
      • gradient_accumulation_steps: 32 (default). Modify based on your desired effective batch size and GPU memory.
      • per_device_train_batch_size: 1 (default). Adjust based on your GPU memory.
      • eval_size_ratio: 0.05 (default). The proportion of the dataset used for evaluation.
      • masking_probabilities: [0.3, 0.2, 0.18, 0.16, 0.14] (default). The curriculum learning masking probabilities.
    • You can modify these parameters directly in the train.py file or by using environment variables.

Running the Fine-tuning Script

  1. Login to Hugging Face Hub:

    huggingface-cli login --token $HUGGINGFACE_TOKEN
    
  2. Login to WandB (Optional):

    wandb login --relogin
    
  3. Run the Script:

    python train.py
    

Monitoring and Evaluation

  • WandB Dashboard: If you're using WandB, monitor your training progress in real-time on your WandB project dashboard.
  • Hugging Face Hub: Once the training is complete, your fine-tuned model will be automatically pushed to your Hugging Face Hub profile under the repository name specified in the script (repo_name).

Using Your Fine-tuned Model

You can then use your fine-tuned model for various downstream tasks using the Hugging Face Transformers library:

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "your_username/modernbert-large-language"  # Replace with your model name (e.g., your username and the repo name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Use the model for inference, e.g., filling in masked tokens
inputs = tokenizer("Het weer is vandaag [MASK].", return_tensors="pt")
outputs = model(**inputs)
# ... process the outputs ...

Tips and Considerations

  • GPU Memory: ModernBERT is a large model. Adjust chunk_size, per_device_train_batch_size, and gradient_accumulation_steps to fit your GPU's memory.
  • Dataset Size: The script is designed for large, streaming datasets. Adjust estimated_dataset_size_in_rows if you're using a smaller dataset.
  • Hyperparameter Tuning: Experiment with different hyperparameters (learning rate, masking probabilities, etc.) to find the best settings for your task.
  • Evaluation: The script performs periodic evaluations. You can customize the evaluation frequency using eval_interval.
  • Saving: The script automatically saves intermediate and final models to the Hugging Face Hub. You can adjust the saving frequency using save_interval.

Troubleshooting

  • CUDA Errors: If you encounter CUDA errors, reduce per_device_train_batch_size, chunk_size, or increase gradient_accumulation_steps.
  • Shape Errors: The StableDataCollator is designed to handle most shape-related issues. If you encounter any, ensure your dataset is properly formatted and that you're using the latest version of the transformers library.

This guide provides a comprehensive overview of how to use the provided code to fine-tune ModernBERT. Remember to adapt the instructions and parameters to your specific needs and dataset. Good luck!

You are using the dataset ssmits/fineweb-2-dutch which was created using data from CommonCrawl.
Could you share the Python code used to generate this dataset from Common Crawl?

Answer.AI org

Hello!
I believe the dataset is simply the Dutch subsection of the Fineweb 2 dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
This dataset card has a lot of information on how it was generated.

  • Tom Aarsen

Sign up or log in to comment