Transformers
Safetensors
Inference Endpoints
File size: 5,077 Bytes
62d5f7f
d4fefed
 
c3cfda3
62d5f7f
 
8de2797
 
b8e8200
 
8de2797
 
 
 
 
 
 
 
b8e8200
8de2797
 
 
 
84da3c0
 
c099db6
84da3c0
 
62d5f7f
d4fefed
84da3c0
 
 
d4fefed
84da3c0
d4fefed
 
 
 
 
84da3c0
2ab7011
af78edb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84da3c0
2ab7011
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75e8175
 
 
 
 
 
 
 
 
 
84da3c0
 
 
 
af78edb
84da3c0
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
datasets:
- lamm-mit/protein_secondary_structure_from_PDB
library_name: transformers
---

# Predict dominant secondary structure from protein sequence

This model is instruction-tuned on top of ```lamm-mit/BioinspiredLlama-3-1-8B-128k``` to predict the dominant secondary structure, based on the input of an amino acid sequence. 

Sample instruction:
```raw
Dominant secondary structure of < N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E >
```
Response:
```raw
AH
```
Raw format of training data (in Llama 3.1 chat template format):
```raw
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nDominant secondary structure of < V V F D V V F D V V F D V V F D V V F D V V F D V V F D V V F D ><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nUNSTRUCTURED<|eot_id|>
```

Here is a visual representation of what the model predicts:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/LGuWyzE6LZt4WqUaRUwwH.png)

## How to load the model 

```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',
    trust_remote_code=True,
    device_map="auto",
    torch_dtype =torch.bfloat16,
) 

tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',)
```
Load in 4 bit quantization:
```
base_model_name = "lamm-mit/BioinspiredLlama-3-1-8B-128k"

model = 'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure'

bnb_config4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    use_nested_quant = False,
)
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    trust_remote_code=True,
    device_map="auto",
    quantization_config= bnb_config4bit,
    torch_dtype =torch.bfloat16,

)
model = PeftModel.from_pretrained(model, new_model,  )
```

## Example

Inference function for convenience:
```
def generate_response (text_input="What is spider silk?",
                       system_prompt='You are a biological materials scientist.',
                       num_return_sequences=1,
                       temperature=1., #the higher the temperature, the more creative the model becomes
                       max_new_tokens=127,device='cuda',
                       num_beams=1,eos_token_id= [
                                            128001,
                                            128008,
                                            128009
                                          ],
                       top_k = 50,
                       top_p =0.9,
                       repetition_penalty=1.1,
                       messages=[],
                      ):

    if messages==[]:
        messages=[{"role": "system", "content":system_prompt},
                          {"role": "user", "content":text_input}]
    else:
        messages.append ({"role": "user", "content":text_input})

    text_input = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
    )
    inputs = tokenizer([text_input],  add_special_tokens  =True,  return_tensors ='pt'  ).to(device)
    with torch.no_grad():
          outputs = model.generate(**inputs,
                                   max_new_tokens=max_new_tokens,
                                   temperature=temperature,
                                   num_beams=num_beams,
                                   top_k = top_k,eos_token_id=eos_token_id,
                                   top_p =top_p,
                                   num_return_sequences = num_return_sequences,
                                   do_sample =True, repetition_penalty=repetition_penalty,
                                  )

    outputs=outputs[:, inputs["input_ids"].shape[1]:]

    return tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True), messages
```
Usage:
```
AA_sequence='N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E'

answer,_ = generate_response (text_input='Dominant secondary structure of < '+AA_sequence+' >', max_new_tokens=16, temperature=0.1)

print (f"Prediction:     {answer[0]}")
```
The output is:
```raw
Prediction:     AH
```
A visualization of the protein, to check:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/aO-0BbS8Sp_dV796w-Hm0.png)

As predicted, this protein (PDB ID 6N7P, https://www.rcsb.org/structure/6N7P) is primarily alpha-helical. 

## Notes

This model has been trained using QLoRA, on sequences shorter than 128 amino acids. 

## Reference

```bibtex
@article{Buehler_2024,
  title={Fine-tuning LLMs for protein feature predictions},
  author={Markus J. Buehler},
  journal={},
  year={2024}
}
```