--- library_name: peft tags: - esm - esm2 - ESM-2 - protein language model - LoRA - Low Rank Adaptation - biology - CAFA-5 - protein function prediction datasets: - AmelieSchreiber/cafa_5 license: mit language: - en --- # ESM-2 LoRA for CAFA-5 Protein Function Prediction This is a Low Rank Adaptation (LoRA) of [cafa_5_protein_function_prediction](https://huggingface.co/AmelieSchreiber/cafa_5_protein_function_prediction), which is a fine-tuned (without LoRA) version of `facebook/esm2_t6_8M_UR50D`, for the same task. For more information on training a sequence classifier langauge model with LoRA [see here](https://github.com/huggingface/peft/blob/main/examples/sequence_classification/LoRA.ipynb). Note, this is for natural language processing and must be adapted to our use case using a protein language model like ESM-2. ## Training procedure Using Hugging Face's Parameter Efficient Fine-Tuning (PEFT) library, a Low Rank Adaptation was trained for 3 epochs on the CAFA-5 protein sequences dataset at an 80/20 train/test split. The dataset can be [found here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5). Somewhat naively, the model was trained on the `train_sequences.fasta` file of protein sequences, with the `train_terms.tsv` file serving as the labels. The gene ontology used is a hierarchy, and so the labels lower in the hierchay should be weighted more, or the graph structure should be taken into account. The model achieved the following metrics: ``` Epoch: 3, Validation Loss: 0.0031, Validation Micro F1: 0.3752, Validation Macro F1: 0.9968, Validation Micro Precision: 0.5287, Validation Macro Precision: 0.9992, Validation Micro Recall: 0.2911, Validation Macro Recall: 0.9968 ``` Future iterations of this model will likely need to take into account class weighting. ### Framework versions - PEFT 0.4.0 ## Using the Model To use the model, try downloading the data [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5), adjust the paths to the files in the code below to their local paths on your machine, and try running: ```python import os import numpy as np import torch from transformers import AutoTokenizer, EsmForSequenceClassification, AdamW from torch.nn.functional import binary_cross_entropy_with_logits from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score, precision_score, recall_score from accelerate import Accelerator from Bio import SeqIO # Step 1: Data Preprocessing fasta_file = "data/Train/train_sequences.fasta" tsv_file = "data/Train/train_terms.tsv" fasta_data = {} tsv_data = {} for record in SeqIO.parse(fasta_file, "fasta"): fasta_data[record.id] = str(record.seq) with open(tsv_file, 'r') as f: for line in f: parts = line.strip().split("\t") tsv_data[parts[0]] = parts[1:] unique_terms = list(set(term for terms in tsv_data.values() for term in terms)) def parse_fasta(file_path): """ Parses a FASTA file and returns a list of sequences. """ with open(file_path, 'r') as f: content = f.readlines() sequences = [] current_sequence = "" for line in content: if line.startswith(">"): if current_sequence: sequences.append(current_sequence) current_sequence = "" else: current_sequence += line.strip() if current_sequence: sequences.append(current_sequence) return sequences # Parse the provided FASTA file fasta_file_path = "data/Test/testsuperset.fasta" protein_sequences = parse_fasta(fasta_file_path) # protein_sequences[:3] # Displaying the first 3 sequences for verification import torch from transformers import AutoTokenizer, EsmForSequenceClassification from sklearn.metrics import precision_recall_fscore_support # 1. Parsing the go-basic.obo file (Assuming this is still needed) def parse_obo_file(file_path): with open(file_path, 'r') as f: data = f.read().split("[Term]") terms = [] for entry in data[1:]: lines = entry.strip().split("\n") term = {} for line in lines: if line.startswith("id:"): term["id"] = line.split("id:")[1].strip() elif line.startswith("name:"): term["name"] = line.split("name:")[1].strip() elif line.startswith("namespace:"): term["namespace"] = line.split("namespace:")[1].strip() elif line.startswith("def:"): term["definition"] = line.split("def:")[1].split('"')[1] terms.append(term) return terms # Let's assume the path to go-basic.obo is as follows (please modify if different) obo_file_path = "data/Train/go-basic.obo" parsed_terms = parse_obo_file("data/Train/go-basic.obo") # Replace with your path # 2. Load the saved model and tokenizer # Assuming the model path provided is correct from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel, PeftConfig # Load the tokenizer and model model_id = "AmelieSchreiber/esm2_t6_8M_UR50D_cafa5_lora" # Replace with your Hugging Face hub model name tokenizer = AutoTokenizer.from_pretrained(model_id) # First, we load the underlying base model base_model = AutoModelForSequenceClassification.from_pretrained(model_id) # Then, we load the model with PEFT model = PeftModel.from_pretrained(base_model, model_id) loaded_model = model loaded_tokenizer = AutoTokenizer.from_pretrained(model_id) # 3. The predict_protein_function function def predict_protein_function(sequence, model, tokenizer, go_terms): inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=1022) model.eval() with torch.no_grad(): outputs = model(**inputs) predictions = torch.sigmoid(outputs.logits) predicted_indices = torch.where(predictions > 0.05)[1].tolist() functions = [] for idx in predicted_indices: term_id = unique_terms[idx] # Use the unique_terms list from your training script for term in go_terms: if term["id"] == term_id: functions.append(term["name"]) break return functions # 4. Predicting protein function for the sequences in the FASTA file protein_functions = {} for seq in protein_sequences[:20]: # Using only the first 3 sequences for demonstration predicted_functions = predict_protein_function(seq, loaded_model, loaded_tokenizer, parsed_terms) protein_functions[seq[:20] + "..."] = predicted_functions # Using first 20 characters as key protein_functions ```