File size: 3,203 Bytes
46f82b8
aa693c2
 
 
 
 
46f82b8
 
aa693c2
 
 
 
8a7378e
aa693c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a7378e
 
 
 
 
 
 
 
 
aa693c2
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
language: en
tags:
- transformers
- feature-extraction
- materials
license: other
---

# MaterialsBERT

This model is a fine-tuned version of [PubMedBERT model](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) on a dataset of 2.4 million materials science abstracts.
It was introduced in [this](https://www.nature.com/articles/s41524-023-01003-w) paper. This model is uncased.

## Model description

Domain-specific fine-tuning has been [shown](https://arxiv.org/abs/2007.15779) to improve performance in downstream performance on a variety of NLP tasks. MaterialsBERT fine-tunes PubMedBERT, a pre-trained language model trained using biomedical literature. This model was chosen as the biomedical domain is close to the materials science domain. MaterialsBERT when further fine-tuned on a variety of downstream sequence labeling tasks in materials science, outperformed other baseline language models tested on three out of five datasets.

## Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
be fine-tuned on materials-science relevant downstream tasks.

Note that this model is primarily aimed at being fine-tuned on tasks that use a sentence or a paragraph (potentially masked)
to make decisions, such as sequence classification, token classification or question answering.


## How to Use

Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import BertForMaskedLM, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('pranav-s/MaterialsBERT')
model = BertForMaskedLM.from_pretrained('pranav-s/MaterialsBERT')
text = "Enter any text you like"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

## Training data

A fine-tuning corpus of 2.4 million materials science abstracts was used. The DOI's of the journal articles used are provided in the file training_DOI.txt

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0
- mixed_precision_training: Native AMP


### Framework versions

- Transformers 4.17.0
- Pytorch 1.10.2
- Datasets 1.18.3
- Tokenizers 0.11.0


## Citation

If you find MaterialsBERT useful in your research, please cite the following paper:

```latex
@article{materialsbert,
  title={A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing},
  author={Shetty, Pranav and Rajan, Arunkumar Chitteth and Kuenneth, Chris and Gupta, Sonakshi and Panchumarti, Lakshmi Prerana and Holm, Lauren and Zhang, Chao and Ramprasad, Rampi},
  journal={npj Computational Materials},
  volume={9},
  number={1},
  pages={52},
  year={2023},
  publisher={Nature Publishing Group UK London}
}
```

<a href="https://huggingface.co/exbert/?model=pranav-s/MaterialsBERT">
	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>