PreMode / README.md
gzhong's picture
Upload folder using huggingface_hub
7718235 verified
|
raw
history blame
8.55 kB
# PreMode
This is the repository for our manuscript "PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context" posted on bioRxiv: https://www.biorxiv.org/content/10.1101/2024.02.20.581321v2
# Data
Please use the git lfs to download all files in `data.files/` folder
Unzip the files with this script:
```
bash unzip.files.sh
```
Unfortunately we are not allowed to share the HGMD data, so in the `data.files/pretrain/training.*` files we removed all the pathogenic variants from HGMD (49218 in total). This might affect the plots of `analysis/figs/fig.sup.12.pdf` and `analysis/figs/fig.sup.13.pdf` if you re-run the R codes in `analysis/` folder.
We shared the trained weights of our models trained using HGMD instead.
# Install Packages
Please install the necessary packages using
```
mamba env create -f PreMode.yaml
mamba env create -f r4-base.yaml
```
You can check the installation by running
```
conda activate PreMode
python train.py --conf scripts/TEST.yaml --mode train
```
If no error occurs, it means successful installation.
# New Experiment
## Start from scratch and use our G/LoF datasets
1. Please prepare a folder under `scripts/` and create a file named `pretrain.seed.0.yaml` inside the folder, check `scripts/PreMode/pretrain.seed.0.yaml` for example.
2. Run pretrain in pathogenicity task:
```
python train.py --conf scripts/NEW_FOLDER/pretrain.seed.0.yaml
```
3. Prepare transfer learning config files:
```
bash scripts/DMS.prepare.yaml.sh scripts/NEW_FOLDER/
```
4. Run transfer learning:
```
bash scripts/DMS.5fold.run.sh scripts/NEW_FOLDER TASK_NAME GPU_ID
```
If you have multiple tasks, just separate each task by comma in the TASK_NAME, like "task_1,task_2,task_3".
5. (Optional) To reuse the transfer learning tasks in our paper using 8 GPU cards, just do
```
bash transfer.all.sh scripts/NEW_FOLDER
```
If you only have one GPU card, then do
```
bash transfer.all.in.one.card.sh scripts/NEW_FOLDER
```
6. Save inference results:
```
bash scripts/DMS.5fold.inference.sh scripts/NEW_FOLDER analysis/NEW_FOLDER TASK_NAME GPU_ID
```
7. You'll get a folder `analysis/NEW_FOLDER/TASK_NAME` with 5 `.csv` files, each file has 4 columns `logits.FOLD.[0-3]`. Each column represent the G/LoF prediction at one cross-validation (closer to 0 means more likely GoF, closer to 1 means more likely LoF). We suggest averaging the predictions at 4 columns.
## Only transfer learning, user defined mode-of-action datasets
1. Prepare a `.csv` file for training and inference, there are two accepted formats:
+ Format 1 (only for missense variants):
| uniprotID | aaChg | score | ENST |
| :-: | :-: | :-: | :-: |
| P15056 | p.V600E | 1 | ENST00000646891 |
| P15056 | p.G446V | -1 | ENST00000646891 |
+ `uniprotID`: the uniprot ID of the protein.
+ `aaChg`: the amino acid change induced by missense variant.
+ `score`: 1 for GoF, -1 for LoF. For inference it is not required. For DMS, this could be experimental readouts. If you have multiplexed assays, you can change it to `score.1, score.2, score.3, ..., score.N`.
+ `ENST` (optional): the ensemble transcript ID that matched the uniprotID.
+ Format 2 (can be missense variant or multiple variants):
| uniprotID | ref | alt | pos.orig | score | ENST | wt.orig | sequence.len.orig
| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| P15056 | V | E | 600 | 1 | ENST00000646891 | ... | 766 |
| P15056 | G | V | 446 | -1 | ENST00000646891 | ... | 766 |
| P15056 | G;V | V;F | 446;471 | -1 | ENST00000646891 | ... | 766 |
+ `uniprotID`: the uniprot ID of the protein.
+ `ref`: the reference amino acid, if multiple variants, separated by ";".
+ `alt`: the alternative, if multiple variants, separated by ";" in the same order of "ref".
+ `pos.orig`: the amino acid change position, if multiple variants, separated by ";" in the same order of "ref".
+ `score`: same as above.
+ `ENST` (optional): same as above.
+ `wt.orig`: the wild type protein sequence, in the uniprot format.
+ `sequence.len.orig`: the wild type protein sequence length.
+ If you prepared your input in Format 1, please run
```
bash parse.input.table/parse.input.table.sh YOUR_FILE TRANSFORMED_FILE
```
to transform it to Format 2, note it will drop some lines if your aaChg doesn't match the corresponding alphafold sequence.
2. Prepare a config file for training the model and inference.
```
bash scripts/prepare.new.task.yaml.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME YOUR_TRAINING_FILE YOUR_INFERENCE_FILE TASK_TYPE MODE_OF_ACTION_N
```
+ `PRETRAIN_MODEL_NAME` could be one of the following:
+ `scripts/PreMode`: Default PreMode
+ `scripts/PreMode.ptm`: PreMode + ptm as input
+ `scripts/PreMode.noStructure`: PreMode without structure input
+ `scripts/PreMode.noESM`: PreMode, replaced ESM2 input with one-hot encodings of 20 AAs.
+ `scripts/PreMode.noMSA`: PreMode without MSA input
+ `scripts/ESM.SLP`: ESM embedding + Single Layer Perceptron
+ `YOUR_TASK_NAME` can be anything on your preference
+ `YOUR_TRAINING_FILE` is the training file you prepared in step 1.
+ `YOUR_INFERENCE_FILE` is the inference file you prepared in step 1.
+ `TASK_TYPE` could be `DMS` or `GLOF`.
+ `MODE_OF_ACTION_N` The number of dimensions of mode-of-action. For `GLOF` this is usually 1. For multiplexed `DMS` dataset, this could be the number of biochemical properties measured. Note that if it is larger than 1, then you have to make sure the `score` column in step 1 is replaced to `score.1, score.2, ..., score.N` correspondingly.
3. Run your config file
```
conda activate PreMode
bash scripts/run.new.task.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME OUTPUT_FOLDER GPU_ID
```
This should take ~30min on a NVIDIA A40 GPU depending on your data set size.
4. You'll get a file in the `OUTPUT_FOLDER` named as `YOUR_TASK_NAME.inference.result.csv`.
+ If your `TASK_TYPE` is `GLOF`, then the column `logits` will be the inference results. Closer to 0 means GoF, closer to 1 means LoF.
+ If your `TASK_TYPE` is `DMS` and `MODE_OF_ACTION_N` is 1, then the column `logits` will be the inference results. If your `MODE_OF_ACTION_N` is larger than 1, then you will get multiple columns of `logits.*`, each represent a predicted DMS measurement.
# Models & Figures in our manuscript
## Pretrained Models
Here is the list of models in our manuscript:
`scripts/PreMode/` PreMode, it takes 250 GB RAM and 4 A40 Nvidia GPUs to run, will finish in ~50h.
`scripts/ESM.SLR/` Baseline Model, ESM2 (650M) + Single Layer Perceptron
`scripts/PreMode.large.window/` PreMode, window size set to 1251 AA.
`scripts/PreMode.noESM/` PreMode, replace the ESM2 embeddings to one hot encodings of 20 AA.
`scripts/PreMode.noMSA/` PreMode, remove the MSA input.
`scripts/PreMode.noPretrain/` PreMode, but didn't pretrain on ClinVar/HGMD.
`scripts/PreMode.noStructure/` PreMode, remove the AF2 predicted structure input.
`scripts/PreMode.ptm/` PreMode, add the onehot encoding of post transcriptional modification sites as input.
`scripts/PreMode.mean.var/` PreMode, it will output both predicted value (mean) and confidence (var), used in adaptive learning tasks.
## Predicted mode-of-action
| gene | file |
| :-: | :-: |
| BRAF | `analysis/5genes.all.mut/PreMode/P15056.logits.csv` |
| RET | `analysis/5genes.all.mut/PreMode/P07949.logits.csv` |
| TP53 | `analysis/5genes.all.mut/PreMode/P04637.logits.csv` |
| KCNJ11 | `analysis/5genes.all.mut/PreMode/Q14654.logits.csv` |
| CACNA1A | `analysis/5genes.all.mut/PreMode/O00555.logits.csv` |
| SCN5A | `analysis/5genes.all.mut/PreMode/Q14524.logits.csv` |
| SCN2A | `analysis/5genes.all.mut/PreMode/Q99250.logits.csv` |
| ABCC8 | `analysis/5genes.all.mut/PreMode/Q09428.logits.csv` |
| PTEN | `analysis/5genes.all.mut/PreMode/P60484.logits.csv` |
For each file, column `logits.0` is predicted pathogenicity. column `logits.1` is predicted LoF probability, `logits.2` is predicted GoF probability.
For PTEN, column `logits.1` is predicted stability, 0 is loss, 1 is neutral, `logits.2` is predicted enzyme activity, 0 is loss, 1 is neutral
## Figures
Please go to `analysis/` folder and run the corresponding R scripts.