PreMode / README.md

Upload folder using huggingface_hub

7718235 verified 8 months ago

8.55 kB

	# PreMode
	This is the repository for our manuscript "PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context" posted on bioRxiv: https://www.biorxiv.org/content/10.1101/2024.02.20.581321v2

	# Data
	Please use the git lfs to download all files in `data.files/` folder

	Unzip the files with this script:
	```
	bash unzip.files.sh
	```

	Unfortunately we are not allowed to share the HGMD data, so in the `data.files/pretrain/training.*` files we removed all the pathogenic variants from HGMD (49218 in total). This might affect the plots of `analysis/figs/fig.sup.12.pdf` and `analysis/figs/fig.sup.13.pdf` if you re-run the R codes in `analysis/` folder.

	We shared the trained weights of our models trained using HGMD instead.

	# Install Packages
	Please install the necessary packages using
	```
	mamba env create -f PreMode.yaml
	mamba env create -f r4-base.yaml
	```

	You can check the installation by running
	```
	conda activate PreMode
	python train.py --conf scripts/TEST.yaml --mode train
	```
	If no error occurs, it means successful installation.

	# New Experiment
	## Start from scratch and use our G/LoF datasets
	1. Please prepare a folder under `scripts/` and create a file named `pretrain.seed.0.yaml` inside the folder, check `scripts/PreMode/pretrain.seed.0.yaml` for example.
	2. Run pretrain in pathogenicity task:
	```
	python train.py --conf scripts/NEW_FOLDER/pretrain.seed.0.yaml
	```
	3. Prepare transfer learning config files:
	```
	bash scripts/DMS.prepare.yaml.sh scripts/NEW_FOLDER/
	```
	4. Run transfer learning:
	```
	bash scripts/DMS.5fold.run.sh scripts/NEW_FOLDER TASK_NAME GPU_ID
	```
	If you have multiple tasks, just separate each task by comma in the TASK_NAME, like "task_1,task_2,task_3".
	5. (Optional) To reuse the transfer learning tasks in our paper using 8 GPU cards, just do
	```
	bash transfer.all.sh scripts/NEW_FOLDER
	```
	If you only have one GPU card, then do
	```
	bash transfer.all.in.one.card.sh scripts/NEW_FOLDER
	```
	6. Save inference results:
	```
	bash scripts/DMS.5fold.inference.sh scripts/NEW_FOLDER analysis/NEW_FOLDER TASK_NAME GPU_ID
	```
	7. You'll get a folder `analysis/NEW_FOLDER/TASK_NAME` with 5 `.csv` files, each file has 4 columns `logits.FOLD.[0-3]`. Each column represent the G/LoF prediction at one cross-validation (closer to 0 means more likely GoF, closer to 1 means more likely LoF). We suggest averaging the predictions at 4 columns.

	## Only transfer learning, user defined mode-of-action datasets
	1. Prepare a `.csv` file for training and inference, there are two accepted formats:
	+ Format 1 (only for missense variants):
	\| uniprotID \| aaChg \| score \| ENST \|
	\| :-: \| :-: \| :-: \| :-: \|
	\| P15056 \| p.V600E \| 1 \| ENST00000646891 \|
	\| P15056 \| p.G446V \| -1 \| ENST00000646891 \|
	+ `uniprotID`: the uniprot ID of the protein.
	+ `aaChg`: the amino acid change induced by missense variant.
	+ `score`: 1 for GoF, -1 for LoF. For inference it is not required. For DMS, this could be experimental readouts. If you have multiplexed assays, you can change it to `score.1, score.2, score.3, ..., score.N`.
	+ `ENST` (optional): the ensemble transcript ID that matched the uniprotID.
	+ Format 2 (can be missense variant or multiple variants):
	\| uniprotID \| ref \| alt \| pos.orig \| score \| ENST \| wt.orig \| sequence.len.orig
	\| :-: \| :-: \| :-: \| :-: \| :-: \| :-: \| :-: \| :-: \|
	\| P15056 \| V \| E \| 600 \| 1 \| ENST00000646891 \| ... \| 766 \|
	\| P15056 \| G \| V \| 446 \| -1 \| ENST00000646891 \| ... \| 766 \|
	\| P15056 \| G;V \| V;F \| 446;471 \| -1 \| ENST00000646891 \| ... \| 766 \|
	+ `uniprotID`: the uniprot ID of the protein.
	+ `ref`: the reference amino acid, if multiple variants, separated by ";".
	+ `alt`: the alternative, if multiple variants, separated by ";" in the same order of "ref".
	+ `pos.orig`: the amino acid change position, if multiple variants, separated by ";" in the same order of "ref".
	+ `score`: same as above.
	+ `ENST` (optional): same as above.
	+ `wt.orig`: the wild type protein sequence, in the uniprot format.
	+ `sequence.len.orig`: the wild type protein sequence length.

	+ If you prepared your input in Format 1, please run
	```
	bash parse.input.table/parse.input.table.sh YOUR_FILE TRANSFORMED_FILE
	```
	to transform it to Format 2, note it will drop some lines if your aaChg doesn't match the corresponding alphafold sequence.
	2. Prepare a config file for training the model and inference.
	```
	bash scripts/prepare.new.task.yaml.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME YOUR_TRAINING_FILE YOUR_INFERENCE_FILE TASK_TYPE MODE_OF_ACTION_N
	```
	+ `PRETRAIN_MODEL_NAME` could be one of the following:
	+ `scripts/PreMode`: Default PreMode
	+ `scripts/PreMode.ptm`: PreMode + ptm as input
	+ `scripts/PreMode.noStructure`: PreMode without structure input
	+ `scripts/PreMode.noESM`: PreMode, replaced ESM2 input with one-hot encodings of 20 AAs.
	+ `scripts/PreMode.noMSA`: PreMode without MSA input
	+ `scripts/ESM.SLP`: ESM embedding + Single Layer Perceptron
	+ `YOUR_TASK_NAME` can be anything on your preference
	+ `YOUR_TRAINING_FILE` is the training file you prepared in step 1.
	+ `YOUR_INFERENCE_FILE` is the inference file you prepared in step 1.
	+ `TASK_TYPE` could be `DMS` or `GLOF`.
	+ `MODE_OF_ACTION_N` The number of dimensions of mode-of-action. For `GLOF` this is usually 1. For multiplexed `DMS` dataset, this could be the number of biochemical properties measured. Note that if it is larger than 1, then you have to make sure the `score` column in step 1 is replaced to `score.1, score.2, ..., score.N` correspondingly.

	3. Run your config file
	```
	conda activate PreMode
	bash scripts/run.new.task.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME OUTPUT_FOLDER GPU_ID
	```
	This should take ~30min on a NVIDIA A40 GPU depending on your data set size.

	4. You'll get a file in the `OUTPUT_FOLDER` named as `YOUR_TASK_NAME.inference.result.csv`.
	+ If your `TASK_TYPE` is `GLOF`, then the column `logits` will be the inference results. Closer to 0 means GoF, closer to 1 means LoF.
	+ If your `TASK_TYPE` is `DMS` and `MODE_OF_ACTION_N` is 1, then the column `logits` will be the inference results. If your `MODE_OF_ACTION_N` is larger than 1, then you will get multiple columns of `logits.*`, each represent a predicted DMS measurement.


	# Models & Figures in our manuscript
	## Pretrained Models
	Here is the list of models in our manuscript:

	`scripts/PreMode/` PreMode, it takes 250 GB RAM and 4 A40 Nvidia GPUs to run, will finish in ~50h.

	`scripts/ESM.SLR/` Baseline Model, ESM2 (650M) + Single Layer Perceptron

	`scripts/PreMode.large.window/` PreMode, window size set to 1251 AA.

	`scripts/PreMode.noESM/` PreMode, replace the ESM2 embeddings to one hot encodings of 20 AA.

	`scripts/PreMode.noMSA/` PreMode, remove the MSA input.

	`scripts/PreMode.noPretrain/` PreMode, but didn't pretrain on ClinVar/HGMD.

	`scripts/PreMode.noStructure/` PreMode, remove the AF2 predicted structure input.

	`scripts/PreMode.ptm/` PreMode, add the onehot encoding of post transcriptional modification sites as input.

	`scripts/PreMode.mean.var/` PreMode, it will output both predicted value (mean) and confidence (var), used in adaptive learning tasks.

	## Predicted mode-of-action
	\| gene \| file \|
	\| :-: \| :-: \|
	\| BRAF \| `analysis/5genes.all.mut/PreMode/P15056.logits.csv` \|
	\| RET \| `analysis/5genes.all.mut/PreMode/P07949.logits.csv` \|
	\| TP53 \| `analysis/5genes.all.mut/PreMode/P04637.logits.csv` \|
	\| KCNJ11 \| `analysis/5genes.all.mut/PreMode/Q14654.logits.csv` \|
	\| CACNA1A \| `analysis/5genes.all.mut/PreMode/O00555.logits.csv` \|
	\| SCN5A \| `analysis/5genes.all.mut/PreMode/Q14524.logits.csv` \|
	\| SCN2A \| `analysis/5genes.all.mut/PreMode/Q99250.logits.csv` \|
	\| ABCC8 \| `analysis/5genes.all.mut/PreMode/Q09428.logits.csv` \|
	\| PTEN \| `analysis/5genes.all.mut/PreMode/P60484.logits.csv` \|

	For each file, column `logits.0` is predicted pathogenicity. column `logits.1` is predicted LoF probability, `logits.2` is predicted GoF probability.
	For PTEN, column `logits.1` is predicted stability, 0 is loss, 1 is neutral, `logits.2` is predicted enzyme activity, 0 is loss, 1 is neutral

	## Figures
	Please go to `analysis/` folder and run the corresponding R scripts.