pranav-s
/

MaterialsBERT

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

MaterialsBERT / README.md

pranav-s's picture

Updated README.md

8a7378e over 1 year ago

|

3.2 kB

	---
	language: en
	tags:
	- transformers
	- feature-extraction
	- materials
	license: other
	---

	# MaterialsBERT

	This model is a fine-tuned version of [PubMedBERT model](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) on a dataset of 2.4 million materials science abstracts.
	It was introduced in [this](https://www.nature.com/articles/s41524-023-01003-w) paper. This model is uncased.

	## Model description

	Domain-specific fine-tuning has been [shown](https://arxiv.org/abs/2007.15779) to improve performance in downstream performance on a variety of NLP tasks. MaterialsBERT fine-tunes PubMedBERT, a pre-trained language model trained using biomedical literature. This model was chosen as the biomedical domain is close to the materials science domain. MaterialsBERT when further fine-tuned on a variety of downstream sequence labeling tasks in materials science, outperformed other baseline language models tested on three out of five datasets.

	## Intended uses & limitations

	You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
	be fine-tuned on materials-science relevant downstream tasks.

	Note that this model is primarily aimed at being fine-tuned on tasks that use a sentence or a paragraph (potentially masked)
	to make decisions, such as sequence classification, token classification or question answering.


	## How to Use

	Here is how to use this model to get the features of a given text in PyTorch:

	```python
	from transformers import BertForMaskedLM, BertTokenizer
	tokenizer = BertTokenizer.from_pretrained('pranav-s/MaterialsBERT')
	model = BertForMaskedLM.from_pretrained('pranav-s/MaterialsBERT')
	text = "Enter any text you like"
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)
	```

	## Training data

	A fine-tuning corpus of 2.4 million materials science abstracts was used. The DOI's of the journal articles used are provided in the file training_DOI.txt

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3.0
	- mixed_precision_training: Native AMP


	### Framework versions

	- Transformers 4.17.0
	- Pytorch 1.10.2
	- Datasets 1.18.3
	- Tokenizers 0.11.0


	## Citation

	If you find MaterialsBERT useful in your research, please cite the following paper:

	```latex
	@article{materialsbert,
	title={A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing},
	author={Shetty, Pranav and Rajan, Arunkumar Chitteth and Kuenneth, Chris and Gupta, Sonakshi and Panchumarti, Lakshmi Prerana and Holm, Lauren and Zhang, Chao and Ramprasad, Rampi},
	journal={npj Computational Materials},
	volume={9},
	number={1},
	pages={52},
	year={2023},
	publisher={Nature Publishing Group UK London}
	}
	```

	<a href="https://huggingface.co/exbert/?model=pranav-s/MaterialsBERT">
	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
	</a>