GleghornLab
/

cdsBERT-plus

Feature Extraction

protein language model

Inference Endpoints

Model card Files Files and versions Community

cdsBERT-plus / README.md

lhallee's picture

Update README.md

a7debd2 over 1 year ago

|

history blame contribute delete

3.05 kB

	---
	license: cc-by-nc-2.0
	library_name: transformers
	datasets:
	- CCDS
	- Ensembl
	pipeline_tag: feature-extraction
	tags:
	- protein language model
	- biology
	widget:
	- text: >-
	( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C B L Q D T N N F F G A
	g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p
	^ v
	example_title: Feature extraction
	---

	# cdsBERT
	<img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/yA-f7tnvNNV52DK2QYNq_.png" width="350">

	## Model description

	[cdsBERT+](https://doi.org/10.1101/2023.09.15.558027) is a pLM with a codon vocabulary that was seeded with [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) and trained with a novel vocabulary extension pipeline called MELD. cdsBERT+ offers a highly biologically relevant latent space with excellent EC number prediction surpassing ProtBERT.
	Specifically, this is the half-precision checkpoint after student-teacher knowledge distillation with Ankh-base.

	## How to use

	```python
	# Imports
	import re
	import torch
	import torch.nn.functional as F
	from transformers import BertModel, BertTokenizer

	model = BertModel.from_pretrained('lhallee/cdsBERT') # load model
	tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
	device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
	model.to(device) # move to device
	model.eval() # put in eval mode

	sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1\|Hs110\|chr1
	sequence = ' '.join(list(sequence)) # need spaces in-between codons

	example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
	with torch.no_grad():
	matrix_embedding = model(**example).last_hidden_state.cpu()

	vector_embedding = matrix_embedding.mean(dim=0)
	```

	## Intended use and limitations
	cdsBERT+ serves as a general-purpose protein language model with a codon vocabulary. Fine-tuning with Huggingface transformers models like BertForSequenceClassification enables downstream classification and regression tasks. Currently, the base capability enables feature extraction. The based checkpoint after MLM, cdsBERT, can conduct mask-filling.

	## Our lab
	The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group at the University of Delaware that focuses on solving translational problems with our expertise in engineering, biology, and chemistry. We develop inexpensive and reliable tools to study organ development, maternal-fetal health, and drug delivery. Recently we have begun exploration into protein language models and strive to make protein design and annotation accessible.

	## Please cite
	@article {Hallee_cds_2023,
	author = {Logan Hallee, Nikolaos Rafailidis, and Jason P. Gleghorn},
	title = {cdsBERT - Extending Protein Language Models with Codon Awareness},
	year = {2023},
	doi = {10.1101/2023.09.15.558027},
	publisher = {Cold Spring Harbor Laboratory},
	journal = {bioRxiv}
	}