Model description
PPPSL(PPPSL, Prediction of prokaryotic protein subcellular localization) is a protein language model fine-tuned from ESM2 pretrained model (facebook/esm2_t36_3B_UR50D) on a prokaryotic protein subcellular localization dataset.
PPPSL achieved the following results:
Train Loss: 0.0148
Train Accuracy: 0.9923
Validation Loss: 0.0718
Validation Accuracy: 0.9893
Epoch: 20
The dataset for training PPPSL
The full dataset contains 11,970 protein sequences, including Cellwall (87), Cytoplasmic (6,905), CYtoplasmic Membrane (2,567), Extracellular (1,085), Outer Membrane (758), and Periplasmic (568). The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification.
The dataset was downloaded from the website at DeepLocPro - 1.0.
Model training code at GitHub
https://github.com/pengsihua2023/PPPSL-ESM2
How to use PPPSL
An example
Pytorch and transformers libraries should be installed in your system.
Install pytorch
pip install torch torchvision torchaudio
Install transformers
pip install transformers
Run the following code
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the fine-tuned model and tokenizer
model_name = "sihuapeng/PPPSL-ESM2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Sample protein sequence
protein_sequence = "MSKKVLITGGAGYIGSVLTPILLEKGYEVCVIDNLMFDQISLLSCFHNKNFTFINGDAMDENLIRQEVAKADIIIPLAALVGAPLCKRNPKLAKMINYEAVKMISDFASPSQIFIYPNTNSGYGIGEKDAMCTEESPLRPISEYGIDKVHAEQYLLDKGNCVTFRLATVFGISPRMRLDLLVNDFTYRAYRDKFIVLFEEHFRRNYIHVRDVVKGFIHGIENYDKMKGQAYNMGLSSANLTKRQLAETIKKYIPDFYIHSANIGEDPDKRDYLVSNTKLEATGWKPDNTLEDGIKELLRAFKMMKVNRFANFN"
# Encode the sequence as model input
inputs = tokenizer(protein_sequence, return_tensors="pt")
# Perform inference using the model
with torch.no_grad():
outputs = model(**inputs)
# Get the prediction result
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=-1).item()
id2label = {0: 'CYtoplasmicMembrane', 1: 'Cellwall', 2: 'Cytoplasmic', 3: 'Extracellular', 4: 'OuterMembrane', 5: 'Periplasmic'}
predicted_label = id2label[predicted_class_id]
# Output the predicted class
print ("===========================================================================================================================================")
print(f"Predicted class Label: {predicted_label}")
print ("===========================================================================================================================================")
Funding
This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738).
Model architecture, coding and implementation
Sihua Peng
Group, Department and Institution
Lab: Justin Bahl
Department: College of Veterinary Medicine Department of Infectious Diseases
Institution: The University of Georgia
- Downloads last month
- 9