Prot2Text-Base-v1-0 / README.md
habdine's picture
Upload README.md
99cac13 verified
---
tags:
- Causal Language Modeling
- GPT2
- ESM2
- Proteins
- GNN
library_name: transformers
pipeline_tag: text-generation
language:
- en
license: cc-by-nc-4.0
datasets:
- habdine/Prot2Text-Data
metrics:
- bertscore
- bleu
- rouge
---
# Prot2Text Model Card
![](Prot2Text.drawio.png)
## Model Information
**Model Page:** [Prot2Text](http://nlp.polytechnique.fr/prot2text#proteins) <br>
**Paper:** [https://arxiv.org/abs/2307.14367](https://arxiv.org/abs/2307.14367) <br>
**Github:** [https://github.com/hadi-abdine/Prot2Text](https://github.com/hadi-abdine/Prot2Text) <br>
**Authors:** Hadi Abdine<sup>(1)</sup>, Michail Chatzianastasis<sup>(1)</sup>, Costas Bouyioukos<sup>(2, 3)</sup>, Michalis Vazirgiannis<sup>(1)</sup><br>
<sup>**(1)**</sup>DaSciM, LIX, École Polytechnique, Institut Polytechnique de Paris, France.<br>
<sup>**(2)**</sup>Epigenetics and Cell Fate, CNRS UMR7216, Université Paris Cité, Paris, France.<br>
<sup>**(3)**</sup>Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus.<br>
**Prot2Text** paper is published in **AAAI 2024**. Preliminary versions of the paper were accepted as a spotlight at [DGM4H@NeurIPS 2023](https://sites.google.com/ethz.ch/dgm4h-neurips2023/home?authuser=0) and [AI4Science@NeurIPS 2023](https://ai4sciencecommunity.github.io/neurips23.html).
```
@inproceedings{abdine2024prot2text,
title={Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers},
author={Abdine, Hadi and Chatzianastasis, Michail and Bouyioukos, Costas and Vazirgiannis, Michalis},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
pages={10757--10765},
year={2024}
}
```
### Description
Prot2Text is a family of models that predict a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework. Prot2Text effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions.
Prot2Text is trained on a [multimodal dataset](https://huggingface.co/datasets/habdine/Prot2Text-Data) that consists of 256,690 proteins. For each protein, we have three information: the correspond- ing sequence, the AlphaFold accession ID and the textual description. To build this dataset, we used the SwissProt database the only curated proteins knowledge base with full proteins’ textual description included in the UniProtKB Consortium (2016) Release 2022_04.
### Models and Results
| Model | #params | BLEU Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BERT Score | Link |
|:--------------------------:|:--------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:------------:|
| Prot2Text<sub>SMALL</sub> | 256M | 30.01 | 45.78 | 38.08 | 43.97 | 82.60 | [v1.0](https://huggingface.co/habdine/Prot2Text-Small-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Small-v1-1) |
| Prot2Text<sub>BASE</sub> | 283M | 35.11 | 50.59 | 42.71 | 48.49 | 84.30 | [v1.0](https://huggingface.co/habdine/Prot2Text-Base-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Base-v1-1) |
| Prot2Text<sub>MEDIUM</sub>| 398M | 36.51 | 52.13 | 44.17 | 50.04 | 84.83 | [v1.0](https://huggingface.co/habdine/Prot2Text-Medium-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Medium-v1-1) |
| Prot2Text<sub>LARGE</sub> | 898M | 36.29 | 53.68 | 45.60 | 51.40 | 85.20 | [v1.0](https://huggingface.co/habdine/Prot2Text-Large-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Large-v1-1) |
| Esm2Text<sub>BASE</sub> | 225M | 32.11 | 47.46 | 39.18 | 45.31 | 83.21 | [v1.0](https://huggingface.co/habdine/Esm2Text-Base-v1-0)- [v1.1](https://huggingface.co/habdine/Esm2Text-Base-v1-1) |
The reported results are computed using v1.0
### Usage
Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library, graphein, DSSP, torch and torch geometric with:
```sh
pip install -U transformers
git clone https://github.com/a-r-j/graphein.git
pip install -e graphein/
pip install torch
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv
sudo apt-get install dssp
sudo ln -s /usr/bin/mkdssp /usr/bin/dssp
```
You might need to install different versions/variants according to your environnement.
Then, copy the snippet from the section that is relevant for your usecase.
#### Running Prot2Text to generate a protein's function using both its structure and sequence
To generate a protein's function using both its structure and amino-acid sequence, you need to load one of Prot2Text models and choose the AlphaFold database ID of the protein.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('habdine/Prot2Text-Base-v1-1',
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('habdine/Prot2Text-Base-v1-1',
trust_remote_code=True)
function = model.generate_protein_description(protein_pdbID='Q10MK9',
tokenizer=tokenizer,
device='cuda' # replace with 'mps' to run on a Mac device
)
print(function)
# 'Carboxylate--CoA ligase that may use 4-coumarate as substrate. Follows a two-step reaction mechanism, wherein the carboxylate substrate first undergoes adenylation by ATP, followed by a thioesterification in the presence of CoA to yield the final CoA thioester.'
```
<br>
#### Running Esm2Text to generate a protein's function using only its sequence
To generate a protein's function using only its amino-acid sequence, you need to load Esm2Text-Base model and pass an amino-acid sequence.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('habdine/Esm2Text-Base-v1-1',
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('habdine/Esm2Text-Base-v1-1',
trust_remote_code=True)
function = model.generate_protein_description(protein_sequence='AEQAERYEEMVEFMEKL',
tokenizer=tokenizer,
device='cuda' # replace with 'mps' to run on a Mac device
)
print(function)
# 'A cytochrome b6-f complex catalyzes the calcium-dependent hydrolysis of the 2-acyl groups in 3-sn-phosphoglycerides. Its physiological function is not known.'
```
<br>
## Notice
THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. THE INFORMATION IS NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE.