|
--- |
|
license: mit |
|
datasets: |
|
- chandar-lab/UR100P |
|
language: |
|
- en |
|
tags: |
|
- biology |
|
--- |
|
|
|
## AMPLIFY |
|
|
|
AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP ([UR100P](https://huggingface.co/datasets/chandar-lab/UR100P)). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the `_base` models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the [accompanying paper](https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1). |
|
|
|
- [`AMPLIFY_350M`](https://huggingface.co/chandar-lab/AMPLIFY_350M) |
|
- [`AMPLIFY_350M_base`](https://huggingface.co/chandar-lab/AMPLIFY_350M_base) |
|
- [`AMPLIFY_120M`](https://huggingface.co/chandar-lab/AMPLIFY_120M) |
|
- [`AMPLIFY_120M_base`](https://huggingface.co/chandar-lab/AMPLIFY_120M_base) |
|
|
|
### Model Descritpion |
|
|
|
| | AMPLIFY 120M | AMPLIFY 350M | |
|
| :----------------------------- | -----------: | -----------: | |
|
| `hidden-size` | 640 | 960 | |
|
| `num-hidden-layers` | 24 | 32 | |
|
| `num-attention-heads` | 10 | 15 | |
|
| `intermediate-size` | 2560 | 3840 | |
|
| `max-position-embeddings` | 2048 | 2048 | |
|
| `vocab-size` | 27 | 27 | |
|
| `rope-theta` | 10000 | 10000 | |
|
| `dropout-prob` | 0 | 0 | |
|
| `embedding-init-range` | 0.02 | 0.02 | |
|
| `norm-eps` | 1.0e-05 | 1.0e-05 | |
|
| `hidden-act` | swiglu | swiglu | |
|
| `pre-activation-layer-norm` | true | true | |
|
| `layer-norm-after-embedding` | false | false | |
|
| `layer-norm-before-last-layer` | true | true | |
|
| `rms-norm` | true | true | |
|
| `ffn-bias` | false | false | |
|
| `attn-bias` | false | false | |
|
|
|
### Training Descritpion |
|
|
|
| | Stage 1 | Stage 2 | |
|
| :------------------ | ----------: | ---------------------------: | |
|
| `dataset` | UR100P | UR100P | |
|
| `max-steps` | 1000000 | 25000 (120M) or 50000 (350M) | |
|
| `max-length` | 512 | 2048 | |
|
| `optimizer` | adamw | adamw | |
|
| `lr` | 0.001 | 0.001 | |
|
| `betas` | (0.9, 0.95) | (0.9, 0.95) | |
|
| `eps` | 1.0e-08 | 1.0e-08 | |
|
| `weight-decay` | 0.01 | 0.01 | |
|
| `scheduler` | cosinedecay | none | |
|
| `warmup-steps` | 1,000 | none | |
|
| `final-step` | 900,000 | none | |
|
| `warmup-steps` | 1,000 | none | |
|
| `gradient-clipping` | 1.0 | 1.0 | |
|
| `tf32` | true | true | |
|
| `mixed-precision` | bf16 | bf16 | |
|
| `padding` | max-length | max-length | |
|
| `random-truncate` | true | true | |
|
| `mask-probability` | 0.15 | 0.15 | |
|
| `total-batch-size` | 4096 | 4096 | |
|
| `deepspeed` | true | true | |
|
| `zero-stage` | 3 | 3 | |
|
|
|
## Get Started |
|
|
|
```python |
|
from transformers import AutoModel |
|
from transformers import AutoTokenizer |
|
from datasets import load_dataset |
|
|
|
# Load AMPLIFY and tokenizer |
|
model = AutoModel.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True) |
|
|
|
# Move the model to GPU (required due to Flash Attention) |
|
model = model.to("cuda") |
|
|
|
# Load the UniProt validation set |
|
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test") |
|
|
|
for sample in dataset: |
|
# Protein |
|
print("Sample: ", sample["name"], sample["sequence"]) |
|
|
|
# Tokenize the protein |
|
input = tokenizer.encode(sample["sequence"], return_tensors="pt") |
|
print("Input: ", input) |
|
|
|
# Move to the GPU and make a prediction |
|
input = input.to("cuda") |
|
output = model(input) |
|
print("Output: ", output) |
|
|
|
break |
|
``` |
|
|
|
## Citations |
|
|
|
If you find the models useful in your research, we ask that you cite the paper: |
|
|
|
```bibtex |
|
@article{Fournier2024.09.23.614603, |
|
title = {Protein Language Models: Is Scaling Necessary?}, |
|
author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James}, |
|
year = {2024}, |
|
journal = {bioRxiv}, |
|
publisher = {Cold Spring Harbor Laboratory}, |
|
doi = {10.1101/2024.09.23.614603}, |
|
url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603}, |
|
elocation-id = {2024.09.23.614603}, |
|
eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf} |
|
} |
|
``` |