|
--- |
|
library_name: peft |
|
tags: |
|
- esm |
|
- esm2 |
|
- ESM-2 |
|
- protein language model |
|
- LoRA |
|
- Low Rank Adaptation |
|
- biology |
|
- CAFA-5 |
|
- protein function prediction |
|
datasets: |
|
- AmelieSchreiber/cafa_5 |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# ESM-2 LoRA for CAFA-5 Protein Function Prediction |
|
This is a Low Rank Adaptation (LoRA) of [cafa_5_protein_function_prediction](https://huggingface.co/AmelieSchreiber/cafa_5_protein_function_prediction), |
|
which is a fine-tuned (without LoRA) version of `facebook/esm2_t6_8M_UR50D`, for the same task. For more information |
|
on training a sequence classifier langauge model with LoRA [see here](https://github.com/huggingface/peft/blob/main/examples/sequence_classification/LoRA.ipynb). |
|
Note, this is for natural language processing and must be adapted to our use case using a protein language model like ESM-2. |
|
|
|
## Training procedure |
|
Using Hugging Face's Parameter Efficient Fine-Tuning (PEFT) library, a Low Rank Adaptation was trained for |
|
3 epochs on the CAFA-5 protein sequences dataset at an 80/20 train/test split. The dataset can be |
|
[found here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5). Somewhat naively, the model was trained on |
|
the `train_sequences.fasta` file of protein sequences, with the `train_terms.tsv` file serving as the labels. |
|
The gene ontology used is a hierarchy, and so the labels lower in the hierchay should be weighted more, or the |
|
graph structure should be taken into account. The model achieved the following metrics: |
|
|
|
``` |
|
Epoch: 3, |
|
Validation Loss: 0.0031, |
|
Validation Micro F1: 0.3752, |
|
Validation Macro F1: 0.9968, |
|
Validation Micro Precision: 0.5287, |
|
Validation Macro Precision: 0.9992, |
|
Validation Micro Recall: 0.2911, |
|
Validation Macro Recall: 0.9968 |
|
``` |
|
|
|
Future iterations of this model will likely need to take into account class weighting. |
|
|
|
### Framework versions |
|
|
|
- PEFT 0.4.0 |
|
|
|
## Using the Model |
|
|
|
To use the model, try downloading the data [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5), |
|
adjust the paths to the files in the code below to their local paths on your machine, and try running: |
|
|
|
```python |
|
import os |
|
import numpy as np |
|
import torch |
|
from transformers import AutoTokenizer, EsmForSequenceClassification, AdamW |
|
from torch.nn.functional import binary_cross_entropy_with_logits |
|
from sklearn.model_selection import train_test_split |
|
from sklearn.metrics import f1_score, precision_score, recall_score |
|
from accelerate import Accelerator |
|
from Bio import SeqIO |
|
|
|
# Step 1: Data Preprocessing |
|
fasta_file = "data/Train/train_sequences.fasta" |
|
tsv_file = "data/Train/train_terms.tsv" |
|
|
|
fasta_data = {} |
|
tsv_data = {} |
|
|
|
for record in SeqIO.parse(fasta_file, "fasta"): |
|
fasta_data[record.id] = str(record.seq) |
|
|
|
with open(tsv_file, 'r') as f: |
|
for line in f: |
|
parts = line.strip().split("\t") |
|
tsv_data[parts[0]] = parts[1:] |
|
|
|
unique_terms = list(set(term for terms in tsv_data.values() for term in terms)) |
|
|
|
def parse_fasta(file_path): |
|
""" |
|
Parses a FASTA file and returns a list of sequences. |
|
""" |
|
with open(file_path, 'r') as f: |
|
content = f.readlines() |
|
|
|
sequences = [] |
|
current_sequence = "" |
|
|
|
for line in content: |
|
if line.startswith(">"): |
|
if current_sequence: |
|
sequences.append(current_sequence) |
|
current_sequence = "" |
|
else: |
|
current_sequence += line.strip() |
|
|
|
if current_sequence: |
|
sequences.append(current_sequence) |
|
|
|
return sequences |
|
|
|
# Parse the provided FASTA file |
|
fasta_file_path = "data/Test/testsuperset.fasta" |
|
protein_sequences = parse_fasta(fasta_file_path) |
|
# protein_sequences[:3] # Displaying the first 3 sequences for verification |
|
|
|
import torch |
|
from transformers import AutoTokenizer, EsmForSequenceClassification |
|
from sklearn.metrics import precision_recall_fscore_support |
|
|
|
# 1. Parsing the go-basic.obo file (Assuming this is still needed) |
|
def parse_obo_file(file_path): |
|
with open(file_path, 'r') as f: |
|
data = f.read().split("[Term]") |
|
|
|
terms = [] |
|
for entry in data[1:]: |
|
lines = entry.strip().split("\n") |
|
term = {} |
|
for line in lines: |
|
if line.startswith("id:"): |
|
term["id"] = line.split("id:")[1].strip() |
|
elif line.startswith("name:"): |
|
term["name"] = line.split("name:")[1].strip() |
|
elif line.startswith("namespace:"): |
|
term["namespace"] = line.split("namespace:")[1].strip() |
|
elif line.startswith("def:"): |
|
term["definition"] = line.split("def:")[1].split('"')[1] |
|
terms.append(term) |
|
return terms |
|
|
|
# Let's assume the path to go-basic.obo is as follows (please modify if different) |
|
obo_file_path = "data/Train/go-basic.obo" |
|
parsed_terms = parse_obo_file("data/Train/go-basic.obo") # Replace with your path |
|
|
|
# 2. Load the saved model and tokenizer |
|
# Assuming the model path provided is correct |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
from peft import PeftModel, PeftConfig |
|
|
|
# Load the tokenizer and model |
|
model_id = "AmelieSchreiber/esm2_t6_8M_UR50D_cafa5_lora" # Replace with your Hugging Face hub model name |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
# First, we load the underlying base model |
|
base_model = AutoModelForSequenceClassification.from_pretrained(model_id) |
|
|
|
# Then, we load the model with PEFT |
|
model = PeftModel.from_pretrained(base_model, model_id) |
|
loaded_model = model |
|
loaded_tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
# 3. The predict_protein_function function |
|
def predict_protein_function(sequence, model, tokenizer, go_terms): |
|
inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=1022) |
|
model.eval() |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.sigmoid(outputs.logits) |
|
predicted_indices = torch.where(predictions > 0.05)[1].tolist() |
|
|
|
functions = [] |
|
for idx in predicted_indices: |
|
term_id = unique_terms[idx] # Use the unique_terms list from your training script |
|
for term in go_terms: |
|
if term["id"] == term_id: |
|
functions.append(term["name"]) |
|
break |
|
|
|
return functions |
|
|
|
# 4. Predicting protein function for the sequences in the FASTA file |
|
protein_functions = {} |
|
for seq in protein_sequences[:20]: # Using only the first 3 sequences for demonstration |
|
predicted_functions = predict_protein_function(seq, loaded_model, loaded_tokenizer, parsed_terms) |
|
protein_functions[seq[:20] + "..."] = predicted_functions # Using first 20 characters as key |
|
|
|
protein_functions |
|
``` |