|
--- |
|
license: mit |
|
metrics: |
|
- accuracy |
|
tags: |
|
- biology |
|
pipeline_tag: text-classification |
|
--- |
|
# Model description |
|
**PPPSL**(PPPSL, Prediction of prokaryotic protein subcellular localization) is a protein language model fine-tuned from [**ESM2**](https://github.com/facebookresearch/esm) pretrained model [(***facebook/esm2_t36_3B_UR50D***)](https://huggingface.co/facebook/esm2_t36_3B_UR50D) on a prokaryotic protein subcellular localization dataset. |
|
|
|
**PPPSL** achieved the following results: |
|
Train Loss: 0.0148 |
|
Train Accuracy: 0.9923 |
|
Validation Loss: 0.0718 |
|
Validation Accuracy: 0.9893 |
|
Epoch: 20 |
|
# The dataset for training **PPPSL** |
|
The full dataset contains 11,970 protein sequences, including Cellwall (87), Cytoplasmic (6,905), CYtoplasmic Membrane (2,567), Extracellular (1,085), Outer Membrane (758), and Periplasmic (568). |
|
The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification. |
|
|
|
The dataset was downloaded from the website at [**DeepLocPro - 1.0**](https://services.healthtech.dtu.dk/services/DeepLocPro-1.0/). |
|
|
|
# Model training code at GitHub |
|
https://github.com/pengsihua2023/PPPSL-ESM2 |
|
|
|
# How to use **PPPSL** |
|
### An example |
|
Pytorch and transformers libraries should be installed in your system. |
|
### Install pytorch |
|
``` |
|
pip install torch torchvision torchaudio |
|
|
|
``` |
|
### Install transformers |
|
``` |
|
pip install transformers |
|
|
|
``` |
|
### Run the following code |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
# Load the fine-tuned model and tokenizer |
|
model_name = "sihuapeng/PPPSL-ESM2" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
# Sample protein sequence |
|
protein_sequence = "MSKKVLITGGAGYIGSVLTPILLEKGYEVCVIDNLMFDQISLLSCFHNKNFTFINGDAMDENLIRQEVAKADIIIPLAALVGAPLCKRNPKLAKMINYEAVKMISDFASPSQIFIYPNTNSGYGIGEKDAMCTEESPLRPISEYGIDKVHAEQYLLDKGNCVTFRLATVFGISPRMRLDLLVNDFTYRAYRDKFIVLFEEHFRRNYIHVRDVVKGFIHGIENYDKMKGQAYNMGLSSANLTKRQLAETIKKYIPDFYIHSANIGEDPDKRDYLVSNTKLEATGWKPDNTLEDGIKELLRAFKMMKVNRFANFN" |
|
|
|
# Encode the sequence as model input |
|
inputs = tokenizer(protein_sequence, return_tensors="pt") |
|
|
|
# Perform inference using the model |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
# Get the prediction result |
|
logits = outputs.logits |
|
predicted_class_id = torch.argmax(logits, dim=-1).item() |
|
id2label = {0: 'CYtoplasmicMembrane', 1: 'Cellwall', 2: 'Cytoplasmic', 3: 'Extracellular', 4: 'OuterMembrane', 5: 'Periplasmic'} |
|
predicted_label = id2label[predicted_class_id] |
|
|
|
# Output the predicted class |
|
print ("===========================================================================================================================================") |
|
print(f"Predicted class Label: {predicted_label}") |
|
print ("===========================================================================================================================================") |
|
|
|
``` |
|
|
|
## Funding |
|
This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738). |
|
### Model architecture, coding and implementation |
|
Sihua Peng |
|
## Group, Department and Institution |
|
### Lab: [Justin Bahl](https://bahl-lab.github.io/) |
|
### Department: [College of Veterinary Medicine Department of Infectious Diseases](https://vet.uga.edu/education/academic-departments/infectious-diseases/) |
|
### Institution: [The University of Georgia](https://www.uga.edu/) |
|
|
|
 |