Model description

In biology, "targeting peptides" typically refer to "targeting signal peptides" or "targeting sequences," also known as "signal peptides" or "signal sequences." These are short amino acid sequences located at the N-terminal or C-terminal of a protein that direct the protein to specific locations within the cell, such as the mitochondria, chloroplasts, plastids, endoplasmic reticulum, and more. Targeting peptides play a crucial signaling role during protein synthesis, ensuring that the protein is correctly localized to its intended cellular destination.

TarPepSubLoc-ESM2 (TarPepSubLoc, Targeting Peptide Subcellular Localization) is a protein language model fine-tuned from ESM2 pretrained model (facebook/esm2_t36_3B_UR50D) on a trageting peptides subcelluar localization dataset with five classes.

TarPepSubLoc-ESM2 achieved the following results:
Train Loss: 0.0385
Train Accuracy: 0.9881
Validation Loss: 0.0566
Validation Accuracy: 0.9812
Epoch: 20

The dataset for training TarPepSubLoc-ESM2

The full dataset contains 13,005 protein sequences, including SP (2,697), MT (499), CH (227), TH (45), and Other (9,537). The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification.

  • "SP" for signal peptide,
  • "MT" for mitochondrial transit peptide (mTP),
  • "CH" for chloroplast transit peptide (cTP),
  • "TH" for thylakoidal lumen composite transit peptide (lTP),
  • "Other" for no targeting peptide (in this case, the length is given as 0).

The dataset was downloaded from the website at TargetP - 2.0.

Model training code at GitHub

https://github.com/pengsihua2023/TarPepSubLoc-ESM2

How to use TarPepSubLoc-ESM2

An example

Pytorch and transformers libraries should be installed in your system.

Install pytorch

pip install torch torchvision torchaudio

Install transformers

pip install transformers

Run the following code

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer from Hugging Face
model_name = "sihuapeng/TarPepSubLoc-ESM2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Define the amino acid sequence
sequence = "MNSLLMITACLALVGTVWAKEGYLVNSYTGCKFECFKLGDNDYCLRECRQQYGKGSGGYCYAFGCWCTHLYEQAVVWPLPNKTCNGK"

# Tokenize the sequence
inputs = tokenizer(sequence, return_tensors="pt")

# Make the prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_id = logits.argmax().item()

# Define the ID to Label mapping
id2label = {0: 'CH', 1: 'MT', 2: 'Other', 3: 'SP', 4: 'TH'}

# Get the predicted label
predicted_label = id2label[predicted_class_id]

print(f"The predicted class for the sequence is: {predicted_label}")

Funding

This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738).

Model architecture, coding and implementation

Sihua Peng

Group, Department and Institution

Lab: Justin Bahl

Department: College of Veterinary Medicine Department of Infectious Diseases

Institution: The University of Georgia

image/png

Downloads last month
7
Safetensors
Model size
2.84B params
Tensor type
FP16
·
Inference Examples
Unable to determine this model's library. Check the docs .