Bo1015 commited on
Commit
74246c0
·
verified ·
1 Parent(s): aa44c2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -3
README.md CHANGED
@@ -1,3 +1,73 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - biology
5
+ ---
6
+
7
+ # xTrimoPGLM-1B-CLM
8
+
9
+ ## Model Introduction
10
+
11
+ **xTrimoPGLM-1B-MLM** is the open-source version of the latest masked protein language models designed to prtein understanding tasks. The xTrimoPGLM family models are developed by BioMap and Tsinghua University. Along with this, we have released the int4 quantization xTrimoPGLM-100B weights and other xTrimo-series small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
12
+
13
+ ### Out-of-Distribution Perplexity Evaluation
14
+
15
+ We evaluated the xTrimoPGLM-MLM (xTMLM) and xTrimoPGLM(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The perplexity results, compared against ESM2-3B and ESM2-15B, are as follows (lower is better):
16
+
17
+ | Model | ESM2(3B)| ESM2 (15B) | xTMLM (1B) | xTMLM (3B) | xTMLM (10B) | xT (100B) |
18
+ |:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
19
+ | < 0.9 ID | 7.7 | 7.3 | 9.3 | 7.8 | 7.6 | **6.7** |
20
+ | < 0.5 ID | 11.5 | 11.0 | 13.5 | 11.9 | 11.6 | **10.8** |
21
+
22
+
23
+ ## Downstream Protein Understanding Tasks Evaluation
24
+ (TODO)
25
+
26
+
27
+ ## How to use
28
+ ```python
29
+
30
+ # Obtain residue embeddings
31
+ from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
32
+ import torch
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-1b-mlm", trust_remote_code=True, use_fast=True)
35
+ config = AutoConfig.from_pretrained("biomap-research/xtrimopglm-1b-mlm", trust_remote_code=True, torch_dtype=torch.bfloat16)
36
+ model = AutoModelForMaskedLM.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)
37
+ if torch.cuda.is_available():
38
+ model = model.cuda()
39
+ model.eval()
40
+
41
+ seq = 'MILMCQHFSGQFSKYFLAVSSDFCHFVFPIILVSHVNFKQMKRKGFALWNDRAVPFTQGIFTTVMILLQYLHGTG'
42
+ output = tokenizer(seq, add_special_tokens=True, return_tensors='pt')
43
+ with torch.inference_mode():
44
+ inputs = {"input_ids": output["input_ids"].cuda(), "attention_mask": output["attention_mask"].cuda()}
45
+ output_embeddings = model(**inputs, output_hidden_states=True, return_last_hidden_state=True).hidden_states[:-1, 0] # get rid of the <eos> token
46
+
47
+
48
+ # model for the sequence-level tasks
49
+ model = AutoModelForSequenceClassification.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)
50
+
51
+ # model for the token-level tasks
52
+ model = AutoModelForTokenClassification.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)
53
+
54
+ ```
55
+
56
+
57
+ For more inference or fine-tuning code, datasets, and requirements, please visit our [GitHub page](https://github.com/biomap-research/xTrimoPGLM).
58
+
59
+ ## LICENSE
60
+
61
+ The code in this repository is open source under the [Apache-2.0 license](./LICENSE).
62
+
63
+ ## Citations
64
+
65
+ If you find our work useful, please consider citing the following paper:
66
+ ```
67
+ @article{chen2024xtrimopglm,
68
+ title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein},
69
+ author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others},
70
+ journal={arXiv preprint arXiv:2401.06199},
71
+ year={2024}
72
+ }
73
+ ```