AMPLIFY_350M_base / README.md
qfournier's picture
Update links
908a709 verified
---
license: mit
datasets:
- chandar-lab/UR100P
language:
- en
tags:
- biology
---
## AMPLIFY
AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP ([UR100P](https://huggingface.co/datasets/chandar-lab/UR100P)). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the `_base` models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the [accompanying paper](https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1).
- [`AMPLIFY_350M`](https://huggingface.co/chandar-lab/AMPLIFY_350M)
- [`AMPLIFY_350M_base`](https://huggingface.co/chandar-lab/AMPLIFY_350M_base)
- [`AMPLIFY_120M`](https://huggingface.co/chandar-lab/AMPLIFY_120M)
- [`AMPLIFY_120M_base`](https://huggingface.co/chandar-lab/AMPLIFY_120M_base)
### Model Descritpion
| | AMPLIFY 120M | AMPLIFY 350M |
| :----------------------------- | -----------: | -----------: |
| `hidden-size` | 640 | 960 |
| `num-hidden-layers` | 24 | 32 |
| `num-attention-heads` | 10 | 15 |
| `intermediate-size` | 2560 | 3840 |
| `max-position-embeddings` | 2048 | 2048 |
| `vocab-size` | 27 | 27 |
| `rope-theta` | 10000 | 10000 |
| `dropout-prob` | 0 | 0 |
| `embedding-init-range` | 0.02 | 0.02 |
| `norm-eps` | 1.0e-05 | 1.0e-05 |
| `hidden-act` | swiglu | swiglu |
| `pre-activation-layer-norm` | true | true |
| `layer-norm-after-embedding` | false | false |
| `layer-norm-before-last-layer` | true | true |
| `rms-norm` | true | true |
| `ffn-bias` | false | false |
| `attn-bias` | false | false |
### Training Descritpion
| | Stage 1 | Stage 2 |
| :------------------ | ----------: | ---------------------------: |
| `dataset` | UR100P | UR100P |
| `max-steps` | 1000000 | 25000 (120M) or 50000 (350M) |
| `max-length` | 512 | 2048 |
| `optimizer` | adamw | adamw |
| `lr` | 0.001 | 0.001 |
| `betas` | (0.9, 0.95) | (0.9, 0.95) |
| `eps` | 1.0e-08 | 1.0e-08 |
| `weight-decay` | 0.01 | 0.01 |
| `scheduler` | cosinedecay | none |
| `warmup-steps` | 1,000 | none |
| `final-step` | 900,000 | none |
| `warmup-steps` | 1,000 | none |
| `gradient-clipping` | 1.0 | 1.0 |
| `tf32` | true | true |
| `mixed-precision` | bf16 | bf16 |
| `padding` | max-length | max-length |
| `random-truncate` | true | true |
| `mask-probability` | 0.15 | 0.15 |
| `total-batch-size` | 4096 | 4096 |
| `deepspeed` | true | true |
| `zero-stage` | 3 | 3 |
## Get Started
```python
from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset
# Load AMPLIFY and tokenizer
model = AutoModel.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True)
# Move the model to GPU (required due to Flash Attention)
model = model.to("cuda")
# Load the UniProt validation set
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")
for sample in dataset:
# Protein
print("Sample: ", sample["name"], sample["sequence"])
# Tokenize the protein
input = tokenizer.encode(sample["sequence"], return_tensors="pt")
print("Input: ", input)
# Move to the GPU and make a prediction
input = input.to("cuda")
output = model(input)
print("Output: ", output)
break
```
## Citations
If you find the models useful in your research, we ask that you cite the paper:
```bibtex
@article{Fournier2024.09.23.614603,
title = {Protein Language Models: Is Scaling Necessary?},
author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
year = {2024},
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
doi = {10.1101/2024.09.23.614603},
url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603},
elocation-id = {2024.09.23.614603},
eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf}
}
```