dlicari
/

Italian-Legal-BERT-SC

Inference Endpoints

Model card Files Files and versions Community

Italian-Legal-BERT-SC / README.md

dlicari's picture

Update README.md

f9ec94b almost 2 years ago

|

2.03 kB

	---
	language: it
	license: afl-3.0
	widget:
	- text: Il <mask> ha chiesto revocarsi l'obbligo di pagamento
	---
	<img src="https://huggingface.co/dlicari/Italian-Legal-BERT-SC/resolve/main/ITALIAN_LEGAL_BERT-SC.jpg" width="600"/>

	# ITALIAN-LEGAL-BERT-SC
	It is the [ITALIAN-LEGAL-BERT](https://huggingface.co/dlicari/Italian-Legal-BERT) variant pre-trained from scratch on Italian legal documents (ITA-LEGAL-BERT-SC) based on the CamemBERT architecture

	## Training procedure
	It was trained from scratch using a larger training dataset, 6.6GB of civil and criminal cases.
	We used [CamemBERT](https://huggingface.co/docs/transformers/main/en/model_doc/camembert) architecture with a language modeling head on top, AdamW Optimizer, initial learning rate 2e-5 (with linear learning rate decay), sequence length 512, batch size 18, 1 million training steps,
	device 8*NVIDIA A100 40GB using distributed data parallel (each step performs 8 batches). It uses SentencePiece tokenization trained from scratch on a subset of training set (5 milions sentences)
	and vocabulary size of 32000


	<h2> Usage </h2>

	ITALIAN-LEGAL-BERT model can be loaded like:

	```python
	from transformers import AutoModel, AutoTokenizer
	model_name = "dlicari/Italian-Legal-BERT-SC"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)
	```

	You can use the Transformers library fill-mask pipeline to do inference with ITALIAN-LEGAL-BERT.
	```python
	# %pip install sentencepiece
	# %pip install transformers

	from transformers import pipeline
	model_name = "dlicari/Italian-Legal-BERT-SC"
	fill_mask = pipeline("fill-mask", model_name)
	fill_mask("Il <mask> ha chiesto revocarsi l'obbligo di pagamento")
	# [{'score': 0.6529251933097839,'token_str': 'ricorrente',
	# {'score': 0.0380014143884182, 'token_str': 'convenuto',
	# {'score': 0.0360226035118103, 'token_str': 'richiedente',
	# {'score': 0.023908283561468124,'token_str': 'Condominio',
	# {'score': 0.020863816142082214, 'token_str': 'lavoratore'}]
	```