File size: 5,579 Bytes
05e2d85
 
 
908a709
05e2d85
 
 
 
 
 
 
 
3c759ea
05e2d85
3c759ea
 
 
 
05e2d85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
908a709
 
05e2d85
 
 
 
 
908a709
05e2d85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: mit
datasets:
  - chandar-lab/UR100P
language:
  - en
tags:
  - biology
---

## AMPLIFY

AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP ([UR100P](https://huggingface.co/datasets/chandar-lab/UR100P)). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the `_base` models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the [accompanying paper](https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1).

- [`AMPLIFY_350M`](https://huggingface.co/chandar-lab/AMPLIFY_350M)
- [`AMPLIFY_350M_base`](https://huggingface.co/chandar-lab/AMPLIFY_350M_base)
- [`AMPLIFY_120M`](https://huggingface.co/chandar-lab/AMPLIFY_120M)
- [`AMPLIFY_120M_base`](https://huggingface.co/chandar-lab/AMPLIFY_120M_base)

### Model Descritpion

|                                | AMPLIFY 120M | AMPLIFY 350M |
| :----------------------------- | -----------: | -----------: |
| `hidden-size`                  |          640 |          960 |
| `num-hidden-layers`            |           24 |           32 |
| `num-attention-heads`          |           10 |           15 |
| `intermediate-size`            |         2560 |         3840 |
| `max-position-embeddings`      |         2048 |         2048 |
| `vocab-size`                   |           27 |           27 |
| `rope-theta`                   |        10000 |        10000 |
| `dropout-prob`                 |            0 |            0 |
| `embedding-init-range`         |         0.02 |         0.02 |
| `norm-eps`                     |      1.0e-05 |      1.0e-05 |
| `hidden-act`                   |       swiglu |       swiglu |
| `pre-activation-layer-norm`    |         true |         true |
| `layer-norm-after-embedding`   |        false |        false |
| `layer-norm-before-last-layer` |         true |         true |
| `rms-norm`                     |         true |         true |
| `ffn-bias`                     |        false |        false |
| `attn-bias`                    |        false |        false |

### Training Descritpion

|                     |     Stage 1 |                      Stage 2 |
| :------------------ | ----------: | ---------------------------: |
| `dataset`           |      UR100P |                       UR100P |
| `max-steps`         |     1000000 | 25000 (120M) or 50000 (350M) |
| `max-length`        |         512 |                         2048 |
| `optimizer`         |       adamw |                        adamw |
| `lr`                |       0.001 |                        0.001 |
| `betas`             | (0.9, 0.95) |                  (0.9, 0.95) |
| `eps`               |     1.0e-08 |                      1.0e-08 |
| `weight-decay`      |        0.01 |                         0.01 |
| `scheduler`         | cosinedecay |                         none |
| `warmup-steps`      |       1,000 |                         none |
| `final-step`        |     900,000 |                         none |
| `warmup-steps`      |       1,000 |                         none |
| `gradient-clipping` |         1.0 |                          1.0 |
| `tf32`              |        true |                         true |
| `mixed-precision`   |        bf16 |                         bf16 |
| `padding`           |  max-length |                   max-length |
| `random-truncate`   |        true |                         true |
| `mask-probability`  |        0.15 |                         0.15 |
| `total-batch-size`  |        4096 |                         4096 |
| `deepspeed`         |        true |                         true |
| `zero-stage`        |           3 |                            3 |

## Get Started

```python
from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset

# Load AMPLIFY and tokenizer
model = AutoModel.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True)

# Move the model to GPU (required due to Flash Attention)
model = model.to("cuda")

# Load the UniProt validation set
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")

for sample in dataset:
    # Protein
    print("Sample: ", sample["name"], sample["sequence"])

    # Tokenize the protein
    input = tokenizer.encode(sample["sequence"], return_tensors="pt")
    print("Input: ", input)

    # Move to the GPU and make a prediction
    input = input.to("cuda")
    output = model(input)
    print("Output: ", output)

    break
```

## Citations

If you find the models useful in your research, we ask that you cite the paper:

```bibtex
@article{Fournier2024.09.23.614603,
	title        = {Protein Language Models: Is Scaling Necessary?},
	author       = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
	year         = {2024},
	journal      = {bioRxiv},
	publisher    = {Cold Spring Harbor Laboratory},
	doi          = {10.1101/2024.09.23.614603},
	url          = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603},
	elocation-id = {2024.09.23.614603},
	eprint       = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf}
}
```