|
--- |
|
datasets: |
|
- lamm-mit/protein_secondary_structure_from_PDB |
|
library_name: transformers |
|
--- |
|
|
|
# Predict dominant secondary structure from protein sequence |
|
|
|
This model is instruction-tuned on top of ```lamm-mit/BioinspiredLlama-3-1-8B-128k``` to predict the dominant secondary structure, based on the input of an amino acid sequence. |
|
|
|
Sample instruction: |
|
```raw |
|
Dominant secondary structure of < N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E > |
|
``` |
|
Response: |
|
```raw |
|
AH |
|
``` |
|
Raw format of training data (in Llama 3.1 chat template format): |
|
```raw |
|
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nDominant secondary structure of < V V F D V V F D V V F D V V F D V V F D V V F D V V F D V V F D ><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nUNSTRUCTURED<|eot_id|> |
|
``` |
|
|
|
Here is a visual representation of what the model predicts: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/LGuWyzE6LZt4WqUaRUwwH.png) |
|
|
|
## How to load the model |
|
|
|
``` |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure', |
|
trust_remote_code=True, |
|
device_map="auto", |
|
torch_dtype =torch.bfloat16, |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',) |
|
``` |
|
Load in 4 bit quantization: |
|
``` |
|
base_model_name = "lamm-mit/BioinspiredLlama-3-1-8B-128k" |
|
|
|
model = 'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure' |
|
|
|
bnb_config4bit = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
bnb_4bit_use_double_quant=True, |
|
use_nested_quant = False, |
|
) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
base_model_name, |
|
trust_remote_code=True, |
|
device_map="auto", |
|
quantization_config= bnb_config4bit, |
|
torch_dtype =torch.bfloat16, |
|
|
|
) |
|
model = PeftModel.from_pretrained(model, new_model, ) |
|
``` |
|
|
|
## Example |
|
|
|
Inference function for convenience: |
|
``` |
|
def generate_response (text_input="What is spider silk?", |
|
system_prompt='You are a biological materials scientist.', |
|
num_return_sequences=1, |
|
temperature=1., #the higher the temperature, the more creative the model becomes |
|
max_new_tokens=127,device='cuda', |
|
num_beams=1,eos_token_id= [ |
|
128001, |
|
128008, |
|
128009 |
|
], |
|
top_k = 50, |
|
top_p =0.9, |
|
repetition_penalty=1.1, |
|
messages=[], |
|
): |
|
|
|
if messages==[]: |
|
messages=[{"role": "system", "content":system_prompt}, |
|
{"role": "user", "content":text_input}] |
|
else: |
|
messages.append ({"role": "user", "content":text_input}) |
|
|
|
text_input = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
inputs = tokenizer([text_input], add_special_tokens =True, return_tensors ='pt' ).to(device) |
|
with torch.no_grad(): |
|
outputs = model.generate(**inputs, |
|
max_new_tokens=max_new_tokens, |
|
temperature=temperature, |
|
num_beams=num_beams, |
|
top_k = top_k,eos_token_id=eos_token_id, |
|
top_p =top_p, |
|
num_return_sequences = num_return_sequences, |
|
do_sample =True, repetition_penalty=repetition_penalty, |
|
) |
|
|
|
outputs=outputs[:, inputs["input_ids"].shape[1]:] |
|
|
|
return tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True), messages |
|
``` |
|
Usage: |
|
``` |
|
AA_sequence='N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E' |
|
|
|
answer,_ = generate_response (text_input='Dominant secondary structure of < '+AA_sequence+' >', max_new_tokens=16, temperature=0.1) |
|
|
|
print (f"Prediction: {answer[0]}") |
|
``` |
|
The output is: |
|
```raw |
|
Prediction: AH |
|
``` |
|
A visualization of the protein, to check: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/aO-0BbS8Sp_dV796w-Hm0.png) |
|
|
|
As predicted, this protein (PDB ID 6N7P, https://www.rcsb.org/structure/6N7P) is primarily alpha-helical. |
|
|
|
## Notes |
|
|
|
This model has been trained using QLoRA, on sequences shorter than 128 amino acids. |
|
|
|
## Reference |
|
|
|
```bibtex |
|
@article{Buehler_2024, |
|
title={Fine-tuning LLMs for protein feature predictions}, |
|
author={Markus J. Buehler}, |
|
journal={}, |
|
year={2024} |
|
} |
|
``` |