lhallee commited on
Commit
b4a005e
·
1 Parent(s): 2ef2239

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -22,6 +22,7 @@ widget:
22
  ## Model description
23
 
24
  [cdsBERT+](https://doi.org/10.1101/2023.09.15.558027) is pLM with a codon vocabulary that was seeded with [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) and trained with a novel vocabulary extension pipeline called MELD. cdsBERT+ offers a highly biologically relevant latent space with excellent EC number prediction surpassing ProtBERT.
 
25
 
26
  ## How to use
27
 
@@ -49,7 +50,7 @@ vector_embedding = matrix_embedding.mean(dim=0)
49
  ```
50
 
51
  ## Intended use and limitations
52
- cdsBERT serves as a general-purpose protein language model with a codon vocabulary. Fine-tuning with Huggingface transformers models like BertForSequenceClassification enables downstream classification and regression tasks. Currently, the base capability enables feature extraction and mask filling.
53
 
54
  ## Our lab
55
  The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group at the University of Delaware that focuses on solving translational problems with our expertise in engineering, biology, and chemistry. We develop inexpensive and reliable tools to study organ development, maternal-fetal health, and drug delivery. Recently we have begun exploration into protein language models and strive to make protein design and annotation accessible.
 
22
  ## Model description
23
 
24
  [cdsBERT+](https://doi.org/10.1101/2023.09.15.558027) is pLM with a codon vocabulary that was seeded with [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) and trained with a novel vocabulary extension pipeline called MELD. cdsBERT+ offers a highly biologically relevant latent space with excellent EC number prediction surpassing ProtBERT.
25
+ Specifically, this is the half-precision checkpoint after student-teacher knowledge distillation with Ankh-base.
26
 
27
  ## How to use
28
 
 
50
  ```
51
 
52
  ## Intended use and limitations
53
+ cdsBERT+ serves as a general-purpose protein language model with a codon vocabulary. Fine-tuning with Huggingface transformers models like BertForSequenceClassification enables downstream classification and regression tasks. Currently, the base capability enables feature extraction. The based checkpoint after MLM, cdsBERT, can conduct mask-filling.
54
 
55
  ## Our lab
56
  The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group at the University of Delaware that focuses on solving translational problems with our expertise in engineering, biology, and chemistry. We develop inexpensive and reliable tools to study organ development, maternal-fetal health, and drug delivery. Recently we have begun exploration into protein language models and strive to make protein design and annotation accessible.