--- license: mit metrics: - accuracy tags: - biology pipeline_tag: text-classification --- # Model description In biology, "targeting peptides" typically refer to "targeting signal peptides" or "targeting sequences," also known as "signal peptides" or "signal sequences." These are short amino acid sequences located at the N-terminal or C-terminal of a protein that direct the protein to specific locations within the cell, such as the mitochondria, chloroplasts, plastids, endoplasmic reticulum, and more. Targeting peptides play a crucial signaling role during protein synthesis, ensuring that the protein is correctly localized to its intended cellular destination. **TarPepSubLoc-ESM2** (TarPepSubLoc, Targeting Peptide Subcellular Localization) is a protein language model fine-tuned from [**ESM2**](https://github.com/facebookresearch/esm) pretrained model [(***facebook/esm2_t36_3B_UR50D***)](https://huggingface.co/facebook/esm2_t36_3B_UR50D) on a trageting peptides subcelluar localization dataset with five classes. **TarPepSubLoc-ESM2** achieved the following results: Train Loss: 0.0385 Train Accuracy: 0.9881 Validation Loss: 0.0566 Validation Accuracy: 0.9812 Epoch: 20 # The dataset for training **TarPepSubLoc-ESM2** The full dataset contains 13,005 protein sequences, including SP (2,697), MT (499), CH (227), TH (45), and Other (9,537). The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification. - "SP" for signal peptide, - "MT" for mitochondrial transit peptide (mTP), - "CH" for chloroplast transit peptide (cTP), - "TH" for thylakoidal lumen composite transit peptide (lTP), - "Other" for no targeting peptide (in this case, the length is given as 0). The dataset was downloaded from the website at [**TargetP - 2.0**](https://services.healthtech.dtu.dk/services/TargetP-2.0/). # Model training code at GitHub https://github.com/pengsihua2023/TarPepSubLoc-ESM2 # How to use **TarPepSubLoc-ESM2** ### An example Pytorch and transformers libraries should be installed in your system. ### Install pytorch ``` pip install torch torchvision torchaudio ``` ### Install transformers ``` pip install transformers ``` ### Run the following code ``` from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load the fine-tuned model and tokenizer from Hugging Face model_name = "sihuapeng/TarPepSubLoc-ESM2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Define the amino acid sequence sequence = "MNSLLMITACLALVGTVWAKEGYLVNSYTGCKFECFKLGDNDYCLRECRQQYGKGSGGYCYAFGCWCTHLYEQAVVWPLPNKTCNGK" # Tokenize the sequence inputs = tokenizer(sequence, return_tensors="pt") # Make the prediction with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predicted_class_id = logits.argmax().item() # Define the ID to Label mapping id2label = {0: 'CH', 1: 'MT', 2: 'Other', 3: 'SP', 4: 'TH'} # Get the predicted label predicted_label = id2label[predicted_class_id] print(f"The predicted class for the sequence is: {predicted_label}") ``` ## Funding This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738). ### Model architecture, coding and implementation Sihua Peng ## Group, Department and Institution ### Lab: [Justin Bahl](https://bahl-lab.github.io/) ### Department: [College of Veterinary Medicine Department of Infectious Diseases](https://vet.uga.edu/education/academic-departments/infectious-diseases/) ### Institution: [The University of Georgia](https://www.uga.edu/) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c56e2d2d07296c7e35994f/2rlokZM1FBTxibqrM8ERs.png)