metadata
license: mit
datasets:
- drug-discovery/UR100P
language:
- en
tags:
- biology
AMPLIFY
AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP (UR100P). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the _base
models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the accompanying paper.
Model Descritpion
AMPLIFY 120M | AMPLIFY 350M | |
---|---|---|
hidden-size |
640 | 960 |
num-hidden-layers |
24 | 32 |
num-attention-heads |
10 | 15 |
intermediate-size |
2560 | 3840 |
max-position-embeddings |
2048 | 2048 |
vocab-size |
27 | 27 |
rope-theta |
10000 | 10000 |
dropout-prob |
0 | 0 |
embedding-init-range |
0.02 | 0.02 |
norm-eps |
1.0e-05 | 1.0e-05 |
hidden-act |
swiglu | swiglu |
pre-activation-layer-norm |
true | true |
layer-norm-after-embedding |
false | false |
layer-norm-before-last-layer |
true | true |
rms-norm |
true | true |
ffn-bias |
false | false |
attn-bias |
false | false |
Training Descritpion
Stage 1 | Stage 2 | |
---|---|---|
dataset |
UR100P | UR100P |
max-steps |
1000000 | 25000 (120M) or 50000 (350M) |
max-length |
512 | 2048 |
optimizer |
adamw | adamw |
lr |
0.001 | 0.001 |
betas |
(0.9, 0.95) | (0.9, 0.95) |
eps |
1.0e-08 | 1.0e-08 |
weight-decay |
0.01 | 0.01 |
scheduler |
cosinedecay | none |
warmup-steps |
1,000 | none |
final-step |
900,000 | none |
warmup-steps |
1,000 | none |
gradient-clipping |
1.0 | 1.0 |
tf32 |
true | true |
mixed-precision |
bf16 | bf16 |
padding |
max-length | max-length |
random-truncate |
true | true |
mask-probability |
0.15 | 0.15 |
total-batch-size |
4096 | 4096 |
deepspeed |
true | true |
zero-stage |
3 | 3 |
Get Started
from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset
# Load AMPLIFY and tokenizer
model = AutoModel.from_pretrained("drug-discovery/AMPLIFY_350M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("drug-discovery/AMPLIFY_350M", trust_remote_code=True)
# Move the model to GPU (required due to Flash Attention)
model = model.to("cuda")
# Load the UniProt validation set
dataset = load_dataset("drug-discovery/UR100P", data_dir="UniProt", split="test")
for sample in dataset:
# Protein
print("Sample: ", sample["name"], sample["sequence"])
# Tokenize the protein
input = tokenizer.encode(sample["sequence"], return_tensors="pt")
print("Input: ", input)
# Move to the GPU and make a prediction
input = input.to("cuda")
output = model(input)
print("Output: ", output)
break
Citations
If you find the models useful in your research, we ask that you cite the paper:
@article{Fournier2024.09.23.614603,
title = {Protein Language Models: Is Scaling Necessary?},
author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
year = {2024},
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
doi = {10.1101/2024.09.23.614603},
url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603},
elocation-id = {2024.09.23.614603},
eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf}
}