AmelieSchreiber's picture
Update README.md
4bb0bd0
metadata
license: mit
language:
  - en
library_name: transformers
tags:
  - esm
  - esm2
  - protein language model
  - biology

ESM-2 (esm2_t6_8M_UR50D)

This is a fine-tuned version of ESM-2 for sequence classification that categorizes protein sequences into two classes, either "cystolic" or "membrane".

Training and Accuracy

The model is trained using this notebook and achieved an eval accuracy of 94.83163664839468 %.

Using the Model

To use try running:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Initialize the tokenizer and model
model_path_directory = "AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-localization"
tokenizer = AutoTokenizer.from_pretrained(model_path_directory)
model = AutoModelForSequenceClassification.from_pretrained(model_path_directory)

# Define a function to predict the category of a protein sequence
def predict_category(sequence):
    # Tokenize the sequence and convert it to tensor format
    inputs = tokenizer(sequence, return_tensors="pt", truncation=True, max_length=512, padding="max_length")

    # Make prediction
    with torch.no_grad():
        logits = model(**inputs).logits

    # Determine the category with the highest score
    predicted_class = torch.argmax(logits, dim=1).item()

    # Return the category: 0 for cytosolic, 1 for membrane
    return "cytosolic" if predicted_class == 0 else "membrane"

# Example sequence
new_protein_sequence = "MTQRAGAAMLPSALLLLCVPGCLTVSGPSTVMGAVGESLSVQCRYEEKYKTFNKYWCRQPCLPIWHEMVETGGSEGVVRSDQVIITDHPGDLTFTVTLENLTADDAGKYRCGIATILQEDGLSGFLPDPFFQVQVLVSSASSTENSVKTPASPTRPSQCQGSLPSSTCFLLLPLLKVPLLLSILGAILWVNRPWRTPWTES"

# Predict the category
category = predict_category(new_protein_sequence)
print(f"The predicted category for the sequence is: {category}")