Upload model

c3cfda3 verified 7 months ago

5.08 kB

	---
	datasets:
	- lamm-mit/protein_secondary_structure_from_PDB
	library_name: transformers
	---

	# Predict dominant secondary structure from protein sequence

	This model is instruction-tuned on top of ```lamm-mit/BioinspiredLlama-3-1-8B-128k``` to predict the dominant secondary structure, based on the input of an amino acid sequence.

	Sample instruction:
	```raw
	Dominant secondary structure of < N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E >
	```
	Response:
	```raw
	AH
	```
	Raw format of training data (in Llama 3.1 chat template format):
	```raw
	<\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>\n\nDominant secondary structure of < V V F D V V F D V V F D V V F D V V F D V V F D V V F D V V F D ><\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\nUNSTRUCTURED<\|eot_id\|>
	```

	Here is a visual representation of what the model predicts:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/LGuWyzE6LZt4WqUaRUwwH.png)

	## How to load the model

	```
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',
	trust_remote_code=True,
	device_map="auto",
	torch_dtype =torch.bfloat16,
	)

	tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',)
	```
	Load in 4 bit quantization:
	```
	base_model_name = "lamm-mit/BioinspiredLlama-3-1-8B-128k"

	model = 'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure'

	bnb_config4bit = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	use_nested_quant = False,
	)
	model = AutoModelForCausalLM.from_pretrained(
	base_model_name,
	trust_remote_code=True,
	device_map="auto",
	quantization_config= bnb_config4bit,
	torch_dtype =torch.bfloat16,

	)
	model = PeftModel.from_pretrained(model, new_model, )
	```

	## Example

	Inference function for convenience:
	```
	def generate_response (text_input="What is spider silk?",
	system_prompt='You are a biological materials scientist.',
	num_return_sequences=1,
	temperature=1., #the higher the temperature, the more creative the model becomes
	max_new_tokens=127,device='cuda',
	num_beams=1,eos_token_id= [
	128001,
	128008,
	128009
	],
	top_k = 50,
	top_p =0.9,
	repetition_penalty=1.1,
	messages=[],
	):

	if messages==[]:
	messages=[{"role": "system", "content":system_prompt},
	{"role": "user", "content":text_input}]
	else:
	messages.append ({"role": "user", "content":text_input})

	text_input = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	inputs = tokenizer([text_input], add_special_tokens =True, return_tensors ='pt' ).to(device)
	with torch.no_grad():
	outputs = model.generate(**inputs,
	max_new_tokens=max_new_tokens,
	temperature=temperature,
	num_beams=num_beams,
	top_k = top_k,eos_token_id=eos_token_id,
	top_p =top_p,
	num_return_sequences = num_return_sequences,
	do_sample =True, repetition_penalty=repetition_penalty,
	)

	outputs=outputs[:, inputs["input_ids"].shape[1]:]

	return tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True), messages
	```
	Usage:
	```
	AA_sequence='N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E'

	answer,_ = generate_response (text_input='Dominant secondary structure of < '+AA_sequence+' >', max_new_tokens=16, temperature=0.1)

	print (f"Prediction: {answer[0]}")
	```
	The output is:
	```raw
	Prediction: AH
	```
	A visualization of the protein, to check:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/aO-0BbS8Sp_dV796w-Hm0.png)

	As predicted, this protein (PDB ID 6N7P, https://www.rcsb.org/structure/6N7P) is primarily alpha-helical.

	## Notes

	This model has been trained using QLoRA, on sequences shorter than 128 amino acids.

	## Reference

	```bibtex
	@article{Buehler_2024,
	title={Fine-tuning LLMs for protein feature predictions},
	author={Markus J. Buehler},
	journal={},
	year={2024}
	}
	```