File size: 3,047 Bytes
6eafa1e
c028b75
7080255
 
86782c8
 
0e1538c
7080255
 
 
 
0e1538c
 
 
 
 
6eafa1e
7080255
86782c8
 
7080255
86782c8
7080255
a7debd2
b4a005e
7080255
 
 
 
 
 
 
 
2ef2239
7080255
0e1538c
86782c8
7080255
 
 
 
86782c8
37b5083
7080255
86782c8
7080255
86782c8
7080255
86782c8
7080255
 
 
b4a005e
7080255
 
7e3be76
7080255
 
7e3be76
 
 
 
 
 
 
0e1538c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: cc-by-nc-2.0
library_name: transformers
datasets:
- CCDS
- Ensembl
pipeline_tag: feature-extraction
tags:
- protein language model
- biology
widget:
- text: >-
    ( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C B L Q D T N N F F G A
    g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p
    ^ v
  example_title: Feature extraction
---

# cdsBERT
<img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/yA-f7tnvNNV52DK2QYNq_.png" width="350">

## Model description

[cdsBERT+](https://doi.org/10.1101/2023.09.15.558027) is a pLM with a codon vocabulary that was seeded with [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) and trained with a novel vocabulary extension pipeline called MELD. cdsBERT+ offers a highly biologically relevant latent space with excellent EC number prediction surpassing ProtBERT.
Specifically, this is the half-precision checkpoint after student-teacher knowledge distillation with Ankh-base.

## How to use

```python
# Imports
import re
import torch
import torch.nn.functional as F
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('lhallee/cdsBERT') # load model
tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
model.to(device) # move to device
model.eval() # put in eval mode

sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1
sequence = ' '.join(list(sequence)) # need spaces in-between codons

example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
with torch.no_grad():
    matrix_embedding = model(**example).last_hidden_state.cpu()

vector_embedding = matrix_embedding.mean(dim=0)
```

## Intended use and limitations
cdsBERT+ serves as a general-purpose protein language model with a codon vocabulary. Fine-tuning with Huggingface transformers models like BertForSequenceClassification enables downstream classification and regression tasks. Currently, the base capability enables feature extraction. The based checkpoint after MLM, cdsBERT, can conduct mask-filling.

## Our lab
The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group at the University of Delaware that focuses on solving translational problems with our expertise in engineering, biology, and chemistry. We develop inexpensive and reliable tools to study organ development, maternal-fetal health, and drug delivery. Recently we have begun exploration into protein language models and strive to make protein design and annotation accessible.

## Please cite
@article {Hallee_cds_2023,
	author = {Logan Hallee, Nikolaos Rafailidis, and Jason P. Gleghorn},
	title = {cdsBERT - Extending Protein Language Models with Codon Awareness},
	year = {2023},
	doi = {10.1101/2023.09.15.558027},
	publisher = {Cold Spring Harbor Laboratory},
	journal = {bioRxiv}
}