Model description

PPPSL(PPPSL, Prediction of prokaryotic protein subcellular localization) is a protein language model fine-tuned from ESM2 pretrained model (facebook/esm2_t36_3B_UR50D) on a prokaryotic protein subcellular localization dataset.

PPPSL achieved the following results:
Train Loss: 0.0148
Train Accuracy: 0.9923
Validation Loss: 0.0718
Validation Accuracy: 0.9893
Epoch: 20

The dataset for training PPPSL

The full dataset contains 11,970 protein sequences, including Cellwall (87), Cytoplasmic (6,905), CYtoplasmic Membrane (2,567), Extracellular (1,085), Outer Membrane (758), and Periplasmic (568). The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification.

The dataset was downloaded from the website at DeepLocPro - 1.0.

Model training code at GitHub

https://github.com/pengsihua2023/PPPSL-ESM2

How to use PPPSL

An example

Pytorch and transformers libraries should be installed in your system.

Install pytorch

pip install torch torchvision torchaudio

Install transformers

pip install transformers

Run the following code

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer
model_name = "sihuapeng/PPPSL-ESM2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Sample protein sequence
protein_sequence = "MSKKVLITGGAGYIGSVLTPILLEKGYEVCVIDNLMFDQISLLSCFHNKNFTFINGDAMDENLIRQEVAKADIIIPLAALVGAPLCKRNPKLAKMINYEAVKMISDFASPSQIFIYPNTNSGYGIGEKDAMCTEESPLRPISEYGIDKVHAEQYLLDKGNCVTFRLATVFGISPRMRLDLLVNDFTYRAYRDKFIVLFEEHFRRNYIHVRDVVKGFIHGIENYDKMKGQAYNMGLSSANLTKRQLAETIKKYIPDFYIHSANIGEDPDKRDYLVSNTKLEATGWKPDNTLEDGIKELLRAFKMMKVNRFANFN"

# Encode the sequence as model input
inputs = tokenizer(protein_sequence, return_tensors="pt")

# Perform inference using the model
with torch.no_grad():
    outputs = model(**inputs)

# Get the prediction result
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=-1).item()
id2label = {0: 'CYtoplasmicMembrane', 1: 'Cellwall', 2: 'Cytoplasmic', 3: 'Extracellular', 4: 'OuterMembrane', 5: 'Periplasmic'}
predicted_label = id2label[predicted_class_id]

# Output the predicted class
print ("===========================================================================================================================================")
print(f"Predicted class Label: {predicted_label}")
print ("===========================================================================================================================================")

Funding

This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738).

Model architecture, coding and implementation

Sihua Peng

Group, Department and Institution

Lab: Justin Bahl

Department: College of Veterinary Medicine Department of Infectious Diseases

Institution: The University of Georgia

image/png

Downloads last month
9
Safetensors
Model size
2.84B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.