File size: 5,077 Bytes
62d5f7f d4fefed c3cfda3 62d5f7f 8de2797 b8e8200 8de2797 b8e8200 8de2797 84da3c0 c099db6 84da3c0 62d5f7f d4fefed 84da3c0 d4fefed 84da3c0 d4fefed 84da3c0 2ab7011 af78edb 84da3c0 2ab7011 75e8175 84da3c0 af78edb 84da3c0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
datasets:
- lamm-mit/protein_secondary_structure_from_PDB
library_name: transformers
---
# Predict dominant secondary structure from protein sequence
This model is instruction-tuned on top of ```lamm-mit/BioinspiredLlama-3-1-8B-128k``` to predict the dominant secondary structure, based on the input of an amino acid sequence.
Sample instruction:
```raw
Dominant secondary structure of < N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E >
```
Response:
```raw
AH
```
Raw format of training data (in Llama 3.1 chat template format):
```raw
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nDominant secondary structure of < V V F D V V F D V V F D V V F D V V F D V V F D V V F D V V F D ><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nUNSTRUCTURED<|eot_id|>
```
Here is a visual representation of what the model predicts:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/LGuWyzE6LZt4WqUaRUwwH.png)
## How to load the model
```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',
trust_remote_code=True,
device_map="auto",
torch_dtype =torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',)
```
Load in 4 bit quantization:
```
base_model_name = "lamm-mit/BioinspiredLlama-3-1-8B-128k"
model = 'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure'
bnb_config4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
use_nested_quant = False,
)
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
trust_remote_code=True,
device_map="auto",
quantization_config= bnb_config4bit,
torch_dtype =torch.bfloat16,
)
model = PeftModel.from_pretrained(model, new_model, )
```
## Example
Inference function for convenience:
```
def generate_response (text_input="What is spider silk?",
system_prompt='You are a biological materials scientist.',
num_return_sequences=1,
temperature=1., #the higher the temperature, the more creative the model becomes
max_new_tokens=127,device='cuda',
num_beams=1,eos_token_id= [
128001,
128008,
128009
],
top_k = 50,
top_p =0.9,
repetition_penalty=1.1,
messages=[],
):
if messages==[]:
messages=[{"role": "system", "content":system_prompt},
{"role": "user", "content":text_input}]
else:
messages.append ({"role": "user", "content":text_input})
text_input = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text_input], add_special_tokens =True, return_tensors ='pt' ).to(device)
with torch.no_grad():
outputs = model.generate(**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
num_beams=num_beams,
top_k = top_k,eos_token_id=eos_token_id,
top_p =top_p,
num_return_sequences = num_return_sequences,
do_sample =True, repetition_penalty=repetition_penalty,
)
outputs=outputs[:, inputs["input_ids"].shape[1]:]
return tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True), messages
```
Usage:
```
AA_sequence='N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E'
answer,_ = generate_response (text_input='Dominant secondary structure of < '+AA_sequence+' >', max_new_tokens=16, temperature=0.1)
print (f"Prediction: {answer[0]}")
```
The output is:
```raw
Prediction: AH
```
A visualization of the protein, to check:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/aO-0BbS8Sp_dV796w-Hm0.png)
As predicted, this protein (PDB ID 6N7P, https://www.rcsb.org/structure/6N7P) is primarily alpha-helical.
## Notes
This model has been trained using QLoRA, on sequences shorter than 128 amino acids.
## Reference
```bibtex
@article{Buehler_2024,
title={Fine-tuning LLMs for protein feature predictions},
author={Markus J. Buehler},
journal={},
year={2024}
}
``` |