LoRA Fine Tuning for PTM Site Prediction
Hi there,
My team and I are working on training a model for PTM site prediction, and we are considering using ProLLaMA as the base model.
Would you recommend this model for such a task? Additionally, is it feasible to fine-tune ProLLaMA using LoRA for this specific problem?
We are currently getting familiar with LLaMA Factory and building an instruction dataset for fine-tuning. Any insights or recommendations would be greatly appreciated!
Hello!
Thanks for your consideration. In general, I think it is feasible to use ProLLaMA and LoRA for PTM site prediction. Below are my suggestions:
- You'd better fine-tune ProLLaMA_Stage_1 instead of ProLLaMA, since the latter has been already fine-tuned on downstream tasks.
- Please follow the formatting requirements, i.e. the protein sequence needs to be wrapped in prefix "Seq=<" and suffix ">".
- Using Llama-Factory is a good choice. In fact, our model, whether ProLLaMA_Stage_1 or ProLLaMA, is suitable for most LLM training tool libraries. In my github repo, I also provide a fine-tuning script.
We could keep in touch to communicate the issues you are experiencing.
Best regards!
Hi there,
We were able to fine tune the ProLLaMA_Stage_1 model using our instruction dataset designed for the task of predicting PTM sites. However, the accuracy is low, with only 35% of PTM sites being predicted. We used LLaMA-Factory, and this was our command:
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path GreatCaptainNemo/ProLLaMA_Stage_1 \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template alpaca \
--flash_attn auto \
--dataset_dir data \
--dataset adpr_train \
--cutoff_len 2048 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--max_samples 100000 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 20 \
--packing False \
--report_to none \
--output_dir saves/Custom/lora/train_2025-03-11-22-40-04 \
--bf16 True \
--plot_loss True \
--trust_remote_code True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--optim adamw_torch \
--lora_rank 64 \
--lora_alpha 128 \
--lora_dropout 0.01 \
--lora_target q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj \
--val_size 0.1 \
--eval_strategy steps \
--eval_steps 100 \
--per_device_eval_batch_size 8
Our instruction dataset looks like this:
[
{
"input": "Seq=<QKMFSPKKHSVSTSDRNQEER>",
"output": "Mask=<000000000001000000000>",
"instruction": "[Predict PTM site mask]"
},
{
"input": "Seq=<PLSLAGDEETECQSSPKHSSR>",
"output": "Mask=<000000000000000000100>",
"instruction": "[Predict PTM site mask]"
},
...
Do you have any suggestions on how to improve the accuracy from here? We had the thought that perhaps we need to customize the loss function to modify the weight of the PTM sites and/or use a LM head that is specialized for binary classification. We are thinking about moving away from LLaMA-Factory as it doesn't seem to offer this level of customization (though we did try overriding the Trainer custom_loss() method, but without any success so far).
Would your instruction tuning script help facilitate some of these modifications?
Any guidance you can provide would be much appreciated! Thank you in advance.