law-ai
/

InCaseLawBERT

Inference Endpoints

Model card Files Files and versions Community

InCaseLawBERT / README.md

law-ai's picture

Update README.md

24c7e92 over 2 years ago

|

2.22 kB

	---
	language: en
	pipeline_tag: fill-mask
	tags:
	- legal
	license: mit
	---
	### InCaseLawBERT
	Model and tokenizer files for the InLegalBERT model.

	### Training Data
	For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
	These documents were collected from diverse publicly available sources on the Web, such as official websites of these courts (e.g., [the website of the Indian Supreme Court](https://main.sci.gov.in/)), the erstwhile website of the Legal Information Institute of India,
	the popular legal repository [IndianKanoon](https://www.indiankanoon.org), and so on.
	The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on.
	Additionally, we collected 1,113 Central Government Acts, which are the documents codifying the laws of the country. Each Act is a collection of related laws, called Sections. These 1,113 Acts contain a total of 32,021 Sections.
	In total, our dataset contains around 5.4 million Indian legal documents (all in the English language).
	The raw text corpus size is around 27 GB.

	### Training Objective
	This model is initialized with the [Legal-BERT model](https://huggingface.co/zlucia/legalbert) from the paper [When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings](https://dl.acm.org/doi/abs/10.1145/3462757.3466088). In our work, we refer to this model as CaseLawBERT, and our re-trained model as InCaseLawBERT.

	### Usage
	Using the tokenizer (same as [CaseLawBERT](https://huggingface.co/zlucia/legalbert))
	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("law-ai/InCaseLawBERT")
	```
	Using the model to get embeddings/representations for a sentence
	```python
	from transformers import AutoModel
	model = AutoModel.from_pretrained("law-ai/InCaseLawBERT")
	```
	Using the model for further pre-training with MLM and NSP
	```python
	from transformers import BertForPreTraining
	model_with_pretraining_heads = BertForPreTraining.from_pretrained("law-ai/InCaseLawBERT")
	```

	### Citation