Bo1015 commited on
Commit
1bfafb5
1 Parent(s): 875f1e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -13
README.md CHANGED
@@ -3,20 +3,21 @@ tags:
3
  - biology
4
  ---
5
 
6
- # xTrimoPGLM-1B-MLM
7
 
8
  ## Model Introduction
9
 
10
- **xTrimoPGLM-1B-MLM** is the open-source version of the latest masked protein language models designed to protein understanding tasks. The xTrimoPGLM family models are developed by BioMap and Tsinghua University. Along with this, we have released the int4 quantization xTrimoPGLM-100B weights and other xTrimo-series small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
11
 
12
  ### Out-of-Distribution Perplexity Evaluation
13
 
14
- We evaluated the xTrimoPGLM-MLM (xTMLM) and xTrimoPGLM(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The perplexity results, compared against ESM2-3B and ESM2-15B, are as follows (lower is better):
15
 
16
- | Model | ESM2(3B)| ESM2 (15B) | xTMLM (1B) | xTMLM (3B) | xTMLM (10B) | xT (100B)-INT4 |
17
  |:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
18
- | < 0.9 ID | 7.7 | 7.3 | 9.3 | 7.8 | 7.6 | **6.8** |
19
- | < 0.5 ID | 11.5 | 11.0 | 13.5 | 11.9 | 11.6 | **10.8** |
 
20
 
21
 
22
  ## Downstream Protein Understanding Tasks Evaluation
@@ -30,8 +31,8 @@ We evaluated the xTrimoPGLM-MLM (xTMLM) and xTrimoPGLM(100B) models on two OOD t
30
  from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
31
  import torch
32
 
33
- tokenizer = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-1b-mlm", trust_remote_code=True, use_fast=True)
34
- model = AutoModelForMaskedLM.from_pretrained("biomap-research/xtrimopglm-1b-mlm", trust_remote_code=True, torch_dtype=torch.bfloat16)
35
  if torch.cuda.is_available():
36
  model = model.cuda()
37
  model.eval()
@@ -44,16 +45,13 @@ with torch.inference_mode():
44
 
45
 
46
  # model for the sequence-level tasks
47
- model = AutoModelForSequenceClassification.from_pretrained("biomap-research/xtrimopglm-1b-mlm", trust_remote_code=True, torch_dtype=torch.bfloat16)
48
 
49
  # model for the token-level tasks
50
- model = AutoModelForTokenClassification.from_pretrained("biomap-research/xtrimopglm-1b-mlm", trust_remote_code=True, torch_dtype=torch.bfloat16)
51
 
52
  ```
53
 
54
-
55
- For more inference or fine-tuning code, datasets, and requirements, please visit our [GitHub page](https://github.com/biomap-research/xTrimoPGLM).
56
-
57
  ## LICENSE
58
 
59
  The code in this repository is open source under the [Creative Commons Attribution-NonCommercial 4.0 International License](./LICENSE).
 
3
  - biology
4
  ---
5
 
6
+ # ProteinPGLM-1B-MLM
7
 
8
  ## Model Introduction
9
 
10
+ **ProteinPGLM-1B-MLM** is the open-source version of the latest masked protein language models designed to protein understanding tasks. The ProteinPGLM family models are developed by Tsinghua University. Along with this, we have released the int4 quantization ProteinPGLM-100B weights and other small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
11
 
12
  ### Out-of-Distribution Perplexity Evaluation
13
 
14
+ We evaluated the ProteinPGLM-MLM (PGLM) and ProteinPGLM-INT4(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The perplexity results, compared against ESM2-3B and ESM2-15B, are as follows (lower is better):
15
 
16
+ | Model | ESM2(3B)| ESM2 (15B) | PGLM (1B) | PGLM (3B) | PGLM (10B) | PGLM-INT4 (100B) |
17
  |:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
18
+ | < 0.9 ID | 7.7 | 7.3 | 9.3 | 7.8 | 7.6 | **6.8** |
19
+ | < 0.5 ID | 11.5 | 11.0 | 13.5 | 11.9 | 11.6 | **10.8** |
20
+
21
 
22
 
23
  ## Downstream Protein Understanding Tasks Evaluation
 
31
  from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
32
  import torch
33
 
34
+ tokenizer = AutoTokenizer.from_pretrained("Bo1015/proteinglm-1b-mlm", trust_remote_code=True, use_fast=True)
35
+ model = AutoModelForMaskedLM.from_pretrained("Bo1015/proteinglm-1b-mlm", trust_remote_code=True, torch_dtype=torch.bfloat16)
36
  if torch.cuda.is_available():
37
  model = model.cuda()
38
  model.eval()
 
45
 
46
 
47
  # model for the sequence-level tasks
48
+ model = AutoModelForSequenceClassification.from_pretrained("Bo1015/proteinglm-10b-mlm", trust_remote_code=True, torch_dtype=torch.bfloat16)
49
 
50
  # model for the token-level tasks
51
+ model = AutoModelForTokenClassification.from_pretrained("Bo1015/proteinglm-10b-mlm", trust_remote_code=True, torch_dtype=torch.bfloat16)
52
 
53
  ```
54
 
 
 
 
55
  ## LICENSE
56
 
57
  The code in this repository is open source under the [Creative Commons Attribution-NonCommercial 4.0 International License](./LICENSE).