File size: 5,077 Bytes

62d5f7f
d4fefed
 
c3cfda3
62d5f7f
 
8de2797
 
b8e8200
 
8de2797
 
 
 
 
 
 
 
b8e8200
8de2797
 
 
 
84da3c0
 
c099db6
84da3c0
 
62d5f7f
d4fefed
84da3c0
 
 
d4fefed
84da3c0
d4fefed
 
 
 
 
84da3c0
2ab7011
af78edb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84da3c0
2ab7011
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75e8175
 
 
 
 
 
 
 
 
 
84da3c0
 
 
 
af78edb
84da3c0

---
datasets:
- lamm-mit/protein_secondary_structure_from_PDB
library_name: transformers
---

# Predict dominant secondary structure from protein sequence

This model is instruction-tuned on top of ```lamm-mit/BioinspiredLlama-3-1-8B-128k``` to predict the dominant secondary structure, based on the input of an amino acid sequence. 

Sample instruction:
```raw
Dominant secondary structure of < N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E >
```
Response:
```raw
AH
```
Raw format of training data (in Llama 3.1 chat template format):
```raw
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nDominant secondary structure of < V V F D V V F D V V F D V V F D V V F D V V F D V V F D V V F D ><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nUNSTRUCTURED<|eot_id|>
```

Here is a visual representation of what the model predicts:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/LGuWyzE6LZt4WqUaRUwwH.png)

## How to load the model 

```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',
    trust_remote_code=True,
    device_map="auto",
    torch_dtype =torch.bfloat16,
) 

tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',)
```
Load in 4 bit quantization:
```
base_model_name = "lamm-mit/BioinspiredLlama-3-1-8B-128k"

model = 'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure'

bnb_config4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    use_nested_quant = False,
)
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    trust_remote_code=True,
    device_map="auto",
    quantization_config= bnb_config4bit,
    torch_dtype =torch.bfloat16,

)
model = PeftModel.from_pretrained(model, new_model,  )
```

## Example

Inference function for convenience:
```
def generate_response (text_input="What is spider silk?",
                       system_prompt='You are a biological materials scientist.',
                       num_return_sequences=1,
                       temperature=1., #the higher the temperature, the more creative the model becomes
                       max_new_tokens=127,device='cuda',
                       num_beams=1,eos_token_id= [
                                            128001,
                                            128008,
                                            128009
                                          ],
                       top_k = 50,
                       top_p =0.9,
                       repetition_penalty=1.1,
                       messages=[],
                      ):

    if messages==[]:
        messages=[{"role": "system", "content":system_prompt},
                          {"role": "user", "content":text_input}]
    else:
        messages.append ({"role": "user", "content":text_input})

    text_input = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
    )
    inputs = tokenizer([text_input],  add_special_tokens  =True,  return_tensors ='pt'  ).to(device)
    with torch.no_grad():
          outputs = model.generate(**inputs,
                                   max_new_tokens=max_new_tokens,
                                   temperature=temperature,
                                   num_beams=num_beams,
                                   top_k = top_k,eos_token_id=eos_token_id,
                                   top_p =top_p,
                                   num_return_sequences = num_return_sequences,
                                   do_sample =True, repetition_penalty=repetition_penalty,
                                  )

    outputs=outputs[:, inputs["input_ids"].shape[1]:]

    return tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True), messages
```
Usage:
```
AA_sequence='N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E'

answer,_ = generate_response (text_input='Dominant secondary structure of < '+AA_sequence+' >', max_new_tokens=16, temperature=0.1)

print (f"Prediction:     {answer[0]}")
```
The output is:
```raw
Prediction:     AH
```
A visualization of the protein, to check:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/aO-0BbS8Sp_dV796w-Hm0.png)

As predicted, this protein (PDB ID 6N7P, https://www.rcsb.org/structure/6N7P) is primarily alpha-helical. 

## Notes

This model has been trained using QLoRA, on sequences shorter than 128 amino acids. 

## Reference

```bibtex
@article{Buehler_2024,
  title={Fine-tuning LLMs for protein feature predictions},
  author={Markus J. Buehler},
  journal={},
  year={2024}
}
```