Model description

PPPSL(PPPSL, Prediction of prokaryotic protein subcellular localization) is a protein language model fine-tuned from ESM2 pretrained model (facebook/esm2_t36_3B_UR50D) on a prokaryotic protein subcellular localization dataset.

PPPSL achieved the following results:
Train Loss: 0.0148
Train Accuracy: 0.9923
Validation Loss: 0.0718
Validation Accuracy: 0.9893
Epoch: 20

The dataset for training PPPSL

The full dataset contains 11,970 protein sequences, including Cellwall (87), Cytoplasmic (6,905), CYtoplasmic Membrane (2,567), Extracellular (1,085), Outer Membrane (758), and Periplasmic (568). The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification.

The dataset was downloaded from the website at DeepLocPro - 1.0.

Model training code at GitHub

https://github.com/pengsihua2023/PPPSL-ESM2

How to use PPPSL

An example

Pytorch and transformers libraries should be installed in your system.

Install pytorch

pip install torch torchvision torchaudio

Install transformers

pip install transformers

Run the following code

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer
model_name = "sihuapeng/PPPSL-ESM2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Sample protein sequence
protein_sequence = "MSKKVLITGGAGYIGSVLTPILLEKGYEVCVIDNLMFDQISLLSCFHNKNFTFINGDAMDENLIRQEVAKADIIIPLAALVGAPLCKRNPKLAKMINYEAVKMISDFASPSQIFIYPNTNSGYGIGEKDAMCTEESPLRPISEYGIDKVHAEQYLLDKGNCVTFRLATVFGISPRMRLDLLVNDFTYRAYRDKFIVLFEEHFRRNYIHVRDVVKGFIHGIENYDKMKGQAYNMGLSSANLTKRQLAETIKKYIPDFYIHSANIGEDPDKRDYLVSNTKLEATGWKPDNTLEDGIKELLRAFKMMKVNRFANFN"

# Encode the sequence as model input
inputs = tokenizer(protein_sequence, return_tensors="pt")

# Perform inference using the model
with torch.no_grad():
    outputs = model(**inputs)

# Get the prediction result
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=-1).item()
id2label = {0: 'CYtoplasmicMembrane', 1: 'Cellwall', 2: 'Cytoplasmic', 3: 'Extracellular', 4: 'OuterMembrane', 5: 'Periplasmic'}
predicted_label = id2label[predicted_class_id]

# Output the predicted class
print ("===========================================================================================================================================")
print(f"Predicted class Label: {predicted_label}")
print ("===========================================================================================================================================")

sihuapeng
/

PPPSL

Model description

The dataset for training PPPSL

Model training code at GitHub

How to use PPPSL

An example

Install pytorch

Install transformers

Run the following code

Funding

Model architecture, coding and implementation

Group, Department and Institution

Lab: Justin Bahl

Department: College of Veterinary Medicine Department of Infectious Diseases

Institution: The University of Georgia