Model Card
๐ค NER Model ๐งโโ๏ธ
๐ Date Extraction for Sentencias from DR ๐ฉ๐ด
Choose a PDF or DOCX file to extract text, clean it, and perform Named Entity Recognition (NER) for date extraction.
Model Details
Model Description
This is a Named Entity Recognition (NER) model which identifies and extracts date entities from Spanish legal documents from the Dominican Republic. This model is based on MMG/XLM-roberta-large-ner-spanish
and was finetuned using boletines judiciales.
- Developed by: Victor Fernandez, Alejandro Gomez, Karol Gutierrez, Nathan Dahlberg, Bree Shi, Dr. Charlotte Alexander
- Model type: NER
- Language(s) (NLP): Spanish
- License:
- Finetuned from model: MMG/xlm-roberta-large-ner-spanish which is a derivative of FacebookAI/xlm-roberta-large
Model Sources
- Repository: Coming Soon
- Paper: Coming Soon
- Demo: Try it out
Uses
This NER model is intended for use in processing and analyzing legal documents from the Dominican Republic to extract date-related information. It is particularly useful for legal professionals, researchers, and organizations that need to automate the extraction of dates for case management, compliance, and archival purposes.
Direct Use
- Legal professionals working with documents in Spanish
- Researchers analyzing legal texts in Spanish
Out-of-Scope Use
- Extraction of non-date entities (e.g. persons, locations, organizations, etc.)
- High risk or critical applications
Bias, Risks, and Limitations
- This is trained with 3 boletines judiciales only
- Date format variations
- Potential for misclassification
Recommendations
- Human QA/Due diligence follow the NER extraction
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
REPO = "agomez302/nlp-dr-ner"
class NerProcessor:
def __init__(self):
self.deployed_tokenizer = AutoTokenizer.from_pretrained(REPO)
self.deployed_model = AutoModelForTokenClassification.from_pretrained(REPO)
self.deployed_ner_pipeline = pipeline(
"ner",
model=self.deployed_model,
tokenizer=self.deployed_tokenizer,
aggregation_strategy="simple"
)
def process_text(self, text):
"""Runs NER model on text and returns JSONL string."""
try:
chunks = self.split_text_with_overlap(text)
all_predictions = []
for chunk in chunks:
preds = self.deployed_ner_pipeline(chunk)
all_predictions.extend(preds)
all_predictions = self.deduplicate_entities(all_predictions)
formatted_output = {
"entities": self.run_predictions(all_predictions)
}
return json.dumps(formatted_output)
except Exception as e:
logger.error(f"Failed to run NER model on extracted text: {e}")
def split_text_with_overlap(self, text, max_tokens=450, overlap=50):
"""Split text into chunks with overlap to handle long sequences."""
if not text:
return []
max_tokens = min(max_tokens, 512)
tokenizer = self.deployed_tokenizer
tokens = tokenizer.encode(text, truncation=False)
if len(tokens) <= max_tokens:
return [text]
chunks = []
i = 0
while i < len(tokens):
chunk = tokenizer.decode(tokens[i:i + max_tokens], skip_special_tokens=True)
chunks.append(chunk)
i += max_tokens - overlap
return chunks
def deduplicate_entities(self, predictions):
"""Remove duplicate entities from overlapping chunks."""
unique = []
seen = set()
for entity in predictions:
key = (entity['entity_group'], entity['word'], entity['start'], entity['end'])
if key not in seen:
unique.append(entity)
seen.add(key)
return unique
def run_predictions(self, predictions: list):
"""Format predictions for output, converting float32 to regular float."""
try:
processed_predictions = []
for pred in predictions:
pred_dict = dict(pred)
pred_dict['score'] = float(pred_dict['score'])
processed_predictions.append(pred_dict)
return processed_predictions
except Exception as e:
logging.error(f"Failed to process predictions: {e}")
raise
def main():
text = "SENTENCIA DEL 31 DE ENERO DE 2024 ... que la sentencia que antecede fue dada y firmada por los jueces que figuran en ella, en la fecha arriba indicada. www.poderjudicial.gob.do\n"
ner_processor = NerProcessor()
ner_output = ner_processor.process_text(text)
print(ner_output)
if __name__ = '__main__':
main()
Some Sample output
{
"entities":[
0:{
"entity_group":"DATE"
"score":0.9878288507461548
"word":"veintitrรฉs (23) dรญas del mes de mayo del aรฑo dos mil veintitrรฉs (2023)"
"start":290
"end":360
}
1:{
"entity_group":"DATE"
"score":0.9994959831237793
"word":"23 de mayo del aรฑo 2023"
"start":1058
"end":1081
}
}
Training Details
Training Data
The training data consists of a JSON Lines (.jsonl) file specifically designed for Named Entity Recognition (NER) tasks in Spanish legal texts. Each entry in the dataset includes the text and corresponding entities labeled with their respective types.
{"text": "SENTENCIA DEL 31 DE ENERO DE 2024 ... que la sentencia que antecede fue dada y firmada por los jueces que figuran en ella, en la fecha arriba indicada. www.poderjudicial.gob.do\n", "entities": [{"start": 113, "end": 132, "label": "DATE"}, {"start": 271, "end": 292, "label": "DATE"}, {"start": 2009, "end": 2029, "label": "DATE"}, {"start": 2246, "end": 2265, "label": "DATE"}, {"start": 3083, "end": 3102, "label": "DATE"}, {"start": 3281, "end": 3300, "label": "DATE"}, {"start": 3479, "end": 3497, "label": "DATE"}, {"start": 3569, "end": 3588, "label": "DATE"}, {"start": 3872, "end": 3891, "label": "DATE"}, {"start": 7936, "end": 7955, "label": "DATE"}]}
// and so forth with further json lines
Dataset Path: ner_dataset.jsonl
Description: The dataset contains annotated legal documents with entities related to dates (e.g., B-DATE, I-DATE). This focused annotation helps the model accurately recognize and classify date-related entities within legal texts.
Data Preprocessing:
- Chunking: The text data is split into manageable chunks to handle long sequences effectively. Each chunk maintains an overlap to ensure entities are not fragmented across chunks.
- Tokenization: The AutoTokenizer from Hugging Face is used to tokenize the text, aligning labels with tokenized inputs while handling special tokens and padding appropriately.
- Filtering: Chunks that begin with partial entities are discarded to maintain the integrity of entity recognition.
Training Procedure
The training procedure involves fine-tuning a pre-trained XLM-RoBERTa model for the specific NER task. The process is orchestrated through the NerFinetuner class, which manages data loading, preprocessing, model training, evaluation, and saving.
Preprocessing
Loading the Dataset: The dataset is loaded using the datasets library's load_dataset function, targeting the train split from the specified JSON Lines file.
Chunking Texts: Texts are divided into chunks of a maximum of 128 tokens with an overlap of 50 tokens to preserve entity continuity. Entities are adjusted to align with the chunked text segments. Chunks with incomplete entity annotations are filtered out to ensure consistency.
Tokenization and Label Alignment: The AutoTokenizer tokenizes the text, and labels are aligned with the tokenized output. Special tokens and padding are handled by assigning a label of -100 to ignore them during training.
Training Hyperparameters
- Training regime: Output Directory: ./_ner_results Evaluation Strategy: Evaluates the model at the end of each epoch (eval_strategy: epoch) Save Strategy: Saves the model at the end of each epoch (save_strategy: epoch) Learning Rate: 2e-5 Batch Sizes: Training Batch Size: 16 per device Evaluation Batch Size: 16 per device Number of Epochs: 10 Weight Decay: 0.01 Mixed Precision: Enabled using FP16 (fp16: True)
Evaluation
Testing Data, Factors & Metrics
Testing Data
Testing data came from the original data and was split in code as such:
# Split the dataset into training and testing sets (e.g., 80% train, 20% test)
split_dataset = dataset.train_test_split(test_size=0.2)
train_dataset = split_dataset["train"]
validation_dataset = split_dataset["test"]
Factors
- Entity Type: The primary focus is on detecting DATE entities within the legal texts.
Metrics
The evaluation employs the following metrics using the seqeval library:
Precision: Measures the accuracy of the positive predictions. Recall: Assesses the ability of the model to find all relevant instances. F1 Score: Combines precision and recall into a single metric. Accuracy: Evaluates the overall correctness of the predictions.
Results
Coming Soon
Summary
Model Architecture and Objective
The model architecture is based on XLMRobertaForTokenClassification, a transformer-based model from Hugging Face tailored for token classification tasks such as NER.
Base Model: MMG/xlm-roberta-large-ner-spanish Number of Labels: 3 (e.g., O, B-DATE, I-DATE) Label Mappings: O: Outside of any entity B-DATE: Beginning of a date entity I-DATE: Inside of a date entity
Citation
BibTeX: Coming Soon
APA: Coming Soon
Glossary [optional]
sentencia - a document that is a formal judicial decision or judgment issued by a court at the conclusion of a legal proceeding.
More Information
This was developed as part of the HAAG Fall 2024 NLP-DR cohort under Dr. Charlotte Alexander and Bree Shi.
Model Card Contact
Reach out to the HAAG team or Dr. Alexander at Georgia Tech with any inquiries
- Downloads last month
- 39
Model tree for agomez302/nlp-dr-ner
Base model
MMG/xlm-roberta-large-ner-spanish