SciBERT

This is the pretrained model presented in SciBERT: A Pretrained Language Model for Scientific Text, which is a BERT model trained on scientific text.

The training corpus was papers taken from Semantic Scholar. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.

SciBERT has its own wordpiece vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions.

Available models include:

  • scibert_scivocab_cased
  • scibert_scivocab_uncased

The original repo can be found here.

If using these models, please cite the following paper:

@inproceedings{beltagy-etal-2019-scibert,
    title = "SciBERT: A Pretrained Language Model for Scientific Text",
    author = "Beltagy, Iz  and Lo, Kyle  and Cohan, Arman",
    booktitle = "EMNLP",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1371"
}
Downloads last month
755,906
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for allenai/scibert_scivocab_uncased

Adapters
1 model
Finetunes
69 models
Quantizations
1 model

Spaces using allenai/scibert_scivocab_uncased 17