Citation Pre-Screening

Overview

Click to expand
  • Model type: Language Model
  • Architecture: DistilBERT
  • Language: Multilingual
  • License: Apache 2.0
  • Task: Binary Classification (Citation Pre-Screening)
  • Dataset: SIRIS-Lab/citation-parser-TYPE
  • Additional Resources:

Model description

The Citation Pre-Screening model is part of the Citation Parser package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on DistilBERT, is specifically designed for automated citation processing workflows, making it an essential component of the Citation Parser tool for citation metadata extraction and validation.

The model was trained on a dataset containing citation texts, with the labels True (valid citation) and False (invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label.

The fine-tuning process was done with the DistilBERT-base-multilingual-cased architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data.

Intended Usage

This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows.

How to use

from transformers import pipeline

# Load the model
citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening")

# Example citation text
citation_text = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》"

# Classify the citation
result = citation_classifier(citation_text)
print(result)

Training

The model was trained using the Citation Pre-Screening Dataset consisting of:

  • Training data: 3599 samples
  • Test data: 400 samples

The following hyperparameters were used for training:

  • Model Path: distilbert/distilbert-base-multilingual-cased
  • Batch Size: 32
  • Number of Epochs: 4
  • Learning Rate: 2e-5
  • Max Sequence Length: 512

Evaluation Metrics

The model's performance was evaluated on the test set, and the following results were obtained:

Metric Value
Accuracy 0.95
Macro avg F1 0.94
Weighted avg F1 0.95

Additional information

Authors

  • SIRIS Lab, Research Division of SIRIS Academic.

License

This work is distributed under a Apache License, Version 2.0.

Contact

For further information, send an email to either [email protected] or [email protected].

Downloads last month
153
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for SIRIS-Lab/citation-parser-TYPE

Finetuned
(234)
this model