Upload README.md

99cac13 verified 3 months ago

7.58 kB

	---
	tags:
	- Causal Language Modeling
	- GPT2
	- ESM2
	- Proteins
	- GNN
	library_name: transformers
	pipeline_tag: text-generation
	language:
	- en
	license: cc-by-nc-4.0
	datasets:
	- habdine/Prot2Text-Data
	metrics:
	- bertscore
	- bleu
	- rouge
	---
	# Prot2Text Model Card

	![](Prot2Text.drawio.png)

	## Model Information

	Model Page: [Prot2Text](http://nlp.polytechnique.fr/prot2text#proteins) <br>
	Paper: [https://arxiv.org/abs/2307.14367](https://arxiv.org/abs/2307.14367) <br>
	Github: [https://github.com/hadi-abdine/Prot2Text](https://github.com/hadi-abdine/Prot2Text) <br>
	Authors: Hadi Abdine<sup>(1)</sup>, Michail Chatzianastasis<sup>(1)</sup>, Costas Bouyioukos<sup>(2, 3)</sup>, Michalis Vazirgiannis<sup>(1)</sup><br>
	<sup>(1)</sup>DaSciM, LIX, École Polytechnique, Institut Polytechnique de Paris, France.<br>
	<sup>(2)</sup>Epigenetics and Cell Fate, CNRS UMR7216, Université Paris Cité, Paris, France.<br>
	<sup>(3)</sup>Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus.<br>

	Prot2Text paper is published in AAAI 2024. Preliminary versions of the paper were accepted as a spotlight at [DGM4H@NeurIPS 2023](https://sites.google.com/ethz.ch/dgm4h-neurips2023/home?authuser=0) and [AI4Science@NeurIPS 2023](https://ai4sciencecommunity.github.io/neurips23.html).

	```
	@inproceedings{abdine2024prot2text,
	title={Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers},
	author={Abdine, Hadi and Chatzianastasis, Michail and Bouyioukos, Costas and Vazirgiannis, Michalis},
	booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
	volume={38},
	pages={10757--10765},
	year={2024}
	}
	```

	### Description

	Prot2Text is a family of models that predict a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework. Prot2Text effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions.

	Prot2Text is trained on a [multimodal dataset](https://huggingface.co/datasets/habdine/Prot2Text-Data) that consists of 256,690 proteins. For each protein, we have three information: the correspond- ing sequence, the AlphaFold accession ID and the textual description. To build this dataset, we used the SwissProt database the only curated proteins knowledge base with full proteins’ textual description included in the UniProtKB Consortium (2016) Release 2022_04.

	### Models and Results


	\| Model \| #params \| BLEU Score \| ROUGE-1 \| ROUGE-2 \| ROUGE-L \| BERT Score \| Link \|
	\|:--------------------------:\|:--------:\|:-----------:\|:-----------:\|:-----------:\|:-----------:\|:-----------:\|:------------:\|
	\| Prot2Text<sub>SMALL</sub> \| 256M \| 30.01 \| 45.78 \| 38.08 \| 43.97 \| 82.60 \| [v1.0](https://huggingface.co/habdine/Prot2Text-Small-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Small-v1-1) \|
	\| Prot2Text<sub>BASE</sub> \| 283M \| 35.11 \| 50.59 \| 42.71 \| 48.49 \| 84.30 \| [v1.0](https://huggingface.co/habdine/Prot2Text-Base-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Base-v1-1) \|
	\| Prot2Text<sub>MEDIUM</sub>\| 398M \| 36.51 \| 52.13 \| 44.17 \| 50.04 \| 84.83 \| [v1.0](https://huggingface.co/habdine/Prot2Text-Medium-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Medium-v1-1) \|
	\| Prot2Text<sub>LARGE</sub> \| 898M \| 36.29 \| 53.68 \| 45.60 \| 51.40 \| 85.20 \| [v1.0](https://huggingface.co/habdine/Prot2Text-Large-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Large-v1-1) \|
	\| Esm2Text<sub>BASE</sub> \| 225M \| 32.11 \| 47.46 \| 39.18 \| 45.31 \| 83.21 \| [v1.0](https://huggingface.co/habdine/Esm2Text-Base-v1-0)- [v1.1](https://huggingface.co/habdine/Esm2Text-Base-v1-1) \|

	The reported results are computed using v1.0

	### Usage

	Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library, graphein, DSSP, torch and torch geometric with:
	```sh
	pip install -U transformers
	git clone https://github.com/a-r-j/graphein.git
	pip install -e graphein/
	pip install torch
	pip install torch_geometric
	pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv
	sudo apt-get install dssp
	sudo ln -s /usr/bin/mkdssp /usr/bin/dssp
	```
	You might need to install different versions/variants according to your environnement.

	Then, copy the snippet from the section that is relevant for your usecase.

	#### Running Prot2Text to generate a protein's function using both its structure and sequence
	To generate a protein's function using both its structure and amino-acid sequence, you need to load one of Prot2Text models and choose the AlphaFold database ID of the protein.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('habdine/Prot2Text-Base-v1-1',
	trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained('habdine/Prot2Text-Base-v1-1',
	trust_remote_code=True)

	function = model.generate_protein_description(protein_pdbID='Q10MK9',
	tokenizer=tokenizer,
	device='cuda' # replace with 'mps' to run on a Mac device
	)

	print(function)
	# 'Carboxylate--CoA ligase that may use 4-coumarate as substrate. Follows a two-step reaction mechanism, wherein the carboxylate substrate first undergoes adenylation by ATP, followed by a thioesterification in the presence of CoA to yield the final CoA thioester.'
	```
	<br>

	#### Running Esm2Text to generate a protein's function using only its sequence
	To generate a protein's function using only its amino-acid sequence, you need to load Esm2Text-Base model and pass an amino-acid sequence.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('habdine/Esm2Text-Base-v1-1',
	trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained('habdine/Esm2Text-Base-v1-1',
	trust_remote_code=True)

	function = model.generate_protein_description(protein_sequence='AEQAERYEEMVEFMEKL',
	tokenizer=tokenizer,
	device='cuda' # replace with 'mps' to run on a Mac device
	)

	print(function)
	# 'A cytochrome b6-f complex catalyzes the calcium-dependent hydrolysis of the 2-acyl groups in 3-sn-phosphoglycerides. Its physiological function is not known.'
	```
	<br>

	## Notice
	THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. THE INFORMATION IS NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE.