---
license: mit
metrics:
- accuracy
tags:
- biology
pipeline_tag: text-classification
---
# Model description
In biology, "targeting peptides" typically refer to "targeting signal peptides" or "targeting sequences," also known as "signal peptides" or "signal sequences." These are short amino acid sequences located at the N-terminal or C-terminal of a protein that direct the protein to specific locations within the cell, such as the mitochondria, chloroplasts, plastids, endoplasmic reticulum, and more. Targeting peptides play a crucial signaling role during protein synthesis, ensuring that the protein is correctly localized to its intended cellular destination.  

**TarPepSubLoc-ESM2**  (TarPepSubLoc, Targeting Peptide Subcellular Localization) is a protein language model fine-tuned from [**ESM2**](https://github.com/facebookresearch/esm) pretrained model [(***facebook/esm2_t36_3B_UR50D***)](https://huggingface.co/facebook/esm2_t36_3B_UR50D) on a trageting peptides subcelluar localization dataset with five classes.   

**TarPepSubLoc-ESM2** achieved the following results:  
Train Loss: 0.0385  
Train Accuracy: 0.9881  
Validation Loss: 0.0566  
Validation Accuracy: 0.9812  
Epoch: 20 
# The dataset for training **TarPepSubLoc-ESM2**
The full dataset contains 13,005 protein sequences, including SP (2,697), MT (499), CH (227), TH (45), and Other (9,537).
The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification.  
- "SP" for signal peptide,
- "MT" for mitochondrial transit peptide (mTP),
- "CH" for chloroplast transit peptide (cTP),
- "TH" for thylakoidal lumen composite transit peptide (lTP),
- "Other" for no targeting peptide (in this case, the length is given as 0).

The dataset was downloaded from the website at [**TargetP - 2.0**](https://services.healthtech.dtu.dk/services/TargetP-2.0/).  
# Model training code at GitHub
https://github.com/pengsihua2023/TarPepSubLoc-ESM2  

# How to use **TarPepSubLoc-ESM2**
### An example
Pytorch and transformers libraries should be installed in your system.  
### Install pytorch
```
pip install torch torchvision torchaudio

```
### Install transformers
```
pip install transformers

```
### Run the following code
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer from Hugging Face
model_name = "sihuapeng/TarPepSubLoc-ESM2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Define the amino acid sequence
sequence = "MNSLLMITACLALVGTVWAKEGYLVNSYTGCKFECFKLGDNDYCLRECRQQYGKGSGGYCYAFGCWCTHLYEQAVVWPLPNKTCNGK"

# Tokenize the sequence
inputs = tokenizer(sequence, return_tensors="pt")

# Make the prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_id = logits.argmax().item()

# Define the ID to Label mapping
id2label = {0: 'CH', 1: 'MT', 2: 'Other', 3: 'SP', 4: 'TH'}

# Get the predicted label
predicted_label = id2label[predicted_class_id]

print(f"The predicted class for the sequence is: {predicted_label}")

```

## Funding
This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738).  
### Model architecture, coding and implementation
Sihua Peng  
## Group, Department and Institution  
### Lab: [Justin Bahl](https://bahl-lab.github.io/)  
### Department: [College of Veterinary Medicine Department of Infectious Diseases](https://vet.uga.edu/education/academic-departments/infectious-diseases/)  
### Institution: [The University of Georgia](https://www.uga.edu/)  

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c56e2d2d07296c7e35994f/2rlokZM1FBTxibqrM8ERs.png)