JuriBERT: A Masked-Language Model Adaptation for French Legal Text

Introduction

JuriBERT is a set of BERT models (tiny, mini, small and base) pre-trained from scratch on French legal-domain specific corpora. JuriBERT models are pretrained on 6.3GB of legal french raw text from two different sources: the first dataset is crawled from Légifrance and the other one consists of anonymized court’s decisions and the pleadings from the Court of Cassation (mémoires ampliatifs). The latter contains more than 100k long documents from different court cases.

It is now on Hugging Face in four different versions with varying number of parameters.

JuriBERT Pre-trained models

Model #params Architecture
dascim/juribert-tiny 6M Tiny (L=2, H=128, A=2)
dascim/juribert-mini 15M Mini (L=4, H=256, A=4)
dascim/juribert-small 42M Small (L=6, H=512, A=8)
dascim/juribert-base 110M Base (L=12, H=768, A=12)

JuriBERT Usage

Load JuriBERT and its sub-word tokenizer :
from transformers import AutoModel, AutoTokenizer

# You can replace "juribert-base" with any other model from the table, e.g. "dascim/juribert-small".
tokenizer = AutoTokenizer.from_pretrained("dascim/juribert-base")
juribert = AutoModel.from_pretrained("dascim/juribert-base")

juribert.eval()  # disable dropout (or leave in train mode to finetune)
Filling masks using pipeline
from transformers import pipeline 

juribert_fill_mask  = pipeline("fill-mask", model="dascim/juribert-base", tokenizer="dascim/juribert-base")
results = juribert_fill_mask("la chambre <mask> est une chambre de la cour de cassation.")
# results
# [{'score': 0.3455437421798706, 'token': 579, 'token_str': ' civile', 'sequence': 'la chambre civile est une chambre de la cour de cassation.'}, 
# {'score': 0.13046401739120483, 'token': 397, 'token_str': ' qui', 'sequence': 'la chambre qui est une chambre de la cour de cassation.'}, 
# {'score': 0.12387491017580032, 'token': 1060, 'token_str': ' sociale', 'sequence': 'la chambre sociale est une chambre de la cour de cassation.'}, 
# {'score': 0.05491165071725845, 'token': 266, 'token_str': ' c', 'sequence': 'la chambre c est une chambre de la cour de cassation.'},
# {'score': 0.04244831204414368, 'token': 2421, 'token_str': ' commerciale', 'sequence': 'la chambre commerciale est une chambre de la cour de cassation.'}]
Extract contextual embedding features from JuriBERT output
encoded_sentence = tokenizer.encode("Les articles 21 et 22 de la présente annexe sont applicables au titre V de la loi du 1er juin 1924 mettant en vigueur la législation civile française dans les départements du Bas-Rhin, du Haut-Rhin et de la Moselle, et relatif à l'exécution forcée sur les immeubles, à la procédure en matière de purge des hypothèques et à la procédure d'ordre.", return_tensors='pt')

embeddings = juribert(encoded_sentence).last_hidden_state
print(embeddings)
# tensor([[[-0.5490, -1.4505, -0.6244,  ..., -0.9739,  0.4767, -0.0655],
#          [ 0.6415, -1.4368,  0.8708,  ..., -0.4093,  0.6691,  0.7238],
#          [-0.2195, -0.1235,  0.2674,  ...,  0.5372, -0.4903,  0.5960],
#          ...,
#          [-1.4168, -1.3238,  1.1748,  ...,  0.7590,  1.0338, -0.4865],
#          [-0.5240, -0.7168,  0.8667,  ..., -0.5848,  1.0086, -1.3153],
#          [ 0.2743, -0.3438,  1.1101,  ..., -0.5587,  0.0830, -0.3144]]],
#        grad_fn=<NativeLayerNormBackward0>)

Authors

JuriBERT was trained and evaluated at École Polytechnique in collaboration with HEC Paris by Stella Douka, Hadi Abdine, Mihcalis Vazirgiannis, Rajaa El Hamdani and David Restrepo Amariles.

Citation

If you use our work, please cite:

@inproceedings{douka-etal-2021-juribert,
         title = "{J}uri{BERT}: A Masked-Language Model Adaptation for {F}rench Legal Text",
         author="Douka, Stella and Abdine, Hadi and Vazirgiannis, Michalis and El Hamdani, Rajaa and Restrepo Amariles, David",
         booktitle="Proceedings of the Natural Legal Language Processing Workshop 2021",
         month=nov,
         year="2021",
         address = "Punta Cana, Dominican Republic",
         publisher = "Association for Computational Linguistics",
         url = "https://aclanthology.org/2021.nllp-1.9",
         pages = "95--101",
         }
Downloads last month
23
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including dascim/juribert-base