wenkai commited on
Commit
03473e7
·
verified ·
1 Parent(s): a54b84f

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Introduction
2
+ <p align="center">
3
+ <br>
4
+ <img src="assets/FAPM.png"/>
5
+ <br>
6
+ <p>
7
+
8
+ Huggingface repo: *https://huggingface.co/wenkai/FAPM/*
9
+
10
+ ## Installation
11
+
12
+ 1. (Optional) Creating conda environment
13
+
14
+ ```bash
15
+ conda create -n lavis python=3.8
16
+ conda activate lavis
17
+ ```
18
+
19
+ 2. for development, you may build from source
20
+
21
+ ```bash
22
+ git clone https://github.com/xiangwenkai/FAPM.git
23
+ cd FAPM
24
+ pip install -e .
25
+
26
+ pip install Biopython
27
+ pip install fair-esm
28
+ ```
29
+
30
+ ### Datasets
31
+ #### 1.raw dataset
32
+ Raw data are avaliable at *https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2023_04/knowledgebase/*, this file is very large and need to be processed to get its name, sequence, GO label, function description and prompt.
33
+ The domain level protein dataset we used are avaliable at *https://ftp.ebi.ac.uk/pub/databases/interpro/releases/95.0/protein2ipr.dat.gz*
34
+ In this respository, We provide the experimental train/val/test sets of Swiss-Prot, which are avaliable at data/swissprot_exp
35
+ #### 2.ESM2 embeddings
36
+ Source code for ESM2 embeddings generation: *https://github.com/facebookresearch/esm*
37
+ The generation command:
38
+ ```bash
39
+ conda activate FAPM
40
+ python esm_scripts/extract.py esm2_t36_3B_UR50D you_path/protein.fasta you_path_to_save_embedding_files --repr_layers 36 --truncation_seq_length 1024 --include per_tok
41
+ ```
42
+ Example:
43
+ ```
44
+ conda activate FAPM
45
+ python esm_scripts/extract.py esm2_t36_3B_UR50D data/fasta/example.fasta data/emb_esm2_3b --repr_layers 36 --truncation_seq_length 1024 --include per_tok
46
+ ```
47
+
48
+ The default path to save embedding files is **data/emb_esm2_3b**
49
+
50
+ ## Pretraining language models
51
+ Source: *https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B*
52
+
53
+ ## Training
54
+ data config: lavis/configs/datasets/protein/GO_defaults_cap.yaml
55
+ stage1 config: lavis/projects/blip2/train/protein_pretrain_stage1.yaml
56
+ stage1 training command: run_scripts/blip2/train/protein_pretrain_domain_stage1.sh
57
+ stage2 config: lavis/projects/blip2/train/protein_pretrain_stage2.yaml
58
+ stage2 training/finetuning command: run_scripts/blip2/train/protein_pretrain_domain_stage2.sh
59
+
60
+ ## Trained models
61
+ The models are avaliable at **https://huggingface.co/wenkai/FAPM/tree/main/model**
62
+ You can also download our trained models from google drive: *https://drive.google.com/drive/folders/1aA0eSYxNw3DvrU5GU1Cu-4q2kIxxAGSE?usp=drive_link*
63
+
64
+ ## Testing
65
+ config: lavis/projects/blip2/eval/caption_protein_eval.yaml
66
+ command: run_scripts/blip2/eval/eval_cap_protein.sh
67
+
68
+ ## Inference example
69
+ ```
70
+ python FAPM_inference.py \
71
+ --model_path model/checkpoint_mf2.pth \
72
+ --example_path data/emb_esm2_3b/P18281.pt \
73
+ --device cuda \
74
+ --prompt Acanthamoeba \
75
+ --prop True
76
+ ```
77
+
78
+
79
+
80
+
81
+
82
+
83
+
84
+
85
+
86
+