metadata
license: mit
language:
- en
library_name: transformers
tags:
- esm
- esm2
- protein language model
- biology
ESM-2 (esm2_t6_8M_UR50D
)
This is a fine-tuned version of ESM-2 for sequence classification that categorizes protein sequences into two classes, either "cystolic" or "membrane".
Training and Accuracy
The model is trained using this notebook and achieved an eval accuracy of 94.83163664839468 %.
Using the Model
To use try running:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Initialize the tokenizer and model
model_path_directory = "AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-localization"
tokenizer = AutoTokenizer.from_pretrained(model_path_directory)
model = AutoModelForSequenceClassification.from_pretrained(model_path_directory)
# Define a function to predict the category of a protein sequence
def predict_category(sequence):
# Tokenize the sequence and convert it to tensor format
inputs = tokenizer(sequence, return_tensors="pt", truncation=True, max_length=512, padding="max_length")
# Make prediction
with torch.no_grad():
logits = model(**inputs).logits
# Determine the category with the highest score
predicted_class = torch.argmax(logits, dim=1).item()
# Return the category: 0 for cytosolic, 1 for membrane
return "cytosolic" if predicted_class == 0 else "membrane"
# Example sequence
new_protein_sequence = "MTQRAGAAMLPSALLLLCVPGCLTVSGPSTVMGAVGESLSVQCRYEEKYKTFNKYWCRQPCLPIWHEMVETGGSEGVVRSDQVIITDHPGDLTFTVTLENLTADDAGKYRCGIATILQEDGLSGFLPDPFFQVQVLVSSASSTENSVKTPASPTRPSQCQGSLPSSTCFLLLPLLKVPLLLSILGAILWVNRPWRTPWTES"
# Predict the category
category = predict_category(new_protein_sequence)
print(f"The predicted category for the sequence is: {category}")