thanks , how to fine tune?
...
Hi there,
Thank you for your interest in the Phi-4-multimodal.
There are some example finetuning script in the repo, for example
https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_finetune_speech.py
https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_finetune_vision.py
I hope you find they are helpful.
Thanks, is this training only the LLM or also the speech adapter? If not, then how to fine tune the speech adapter for a new spoken language?
@SamuelAzran
This example focuses on finetuning the LLM (Speech LoRA) only. If you would like to finetune the speech encoder and adapter for new spoken languages, you may unfreeze the parameters of model.embed_tokens_extend.audio_embed
by setting requires_grad
to True
.
Thank you for your informative and quick response! I will try it.
I found that during evaluation loop inside trainer, the GPU consumption incrementally increases. Maybe the cuda cache or memory is not handled properly. I created my own evaluation loop to override hf's one in case anyone needs
class CustomTrainer(Trainer):
def __init__(self, stopping_criteria_list=None, processor=None,*args, **kwargs):
super().__init__(*args, **kwargs)
self.processor = processor
stop_tokens = ["<|end|>", self.processor.tokenizer.eos_token]
stop_tokens_ids = self.processor.tokenizer(stop_tokens, add_special_tokens=False, padding="longest", return_tensors="pt")["input_ids"]
stop_tokens_ids = stop_tokens_ids.to(f'cuda:0')
self.stop_tokens_ids = stop_tokens_ids
self.stopping_criteria_list=stopping_criteria_list
def evaluation_loop(
self,
dataloader,
description: str,
prediction_loss_only: Optional[bool] = None,
ignore_keys: Optional[List[str]] = None,
metric_key_prefix: str = "eval",
) -> EvalLoopOutput:
"""
Optimized evaluation loop that only runs the model once per input.
"""
model = self.model
processor = self.processor
accelerator = self.accelerator
# Ensure the model is in evaluation mode
model.eval()
all_generated_texts = []
all_labels = []
total_eval_loss = 0
num_eval_steps = 0
# Progress bar for main process only
progress_bar = tqdm(
enumerate(dataloader),
disable=not accelerator.is_local_main_process,
total=len(dataloader),
desc=f"Evaluation ({metric_key_prefix})"
)
for step, inputs in progress_bar:
with torch.no_grad():
# Move inputs to appropriate device
inputs = self._prepare_inputs(inputs)
# Set up stopping criteria for generation
if not self.stopping_criteria_list:
stop_criteria = MultipleTokenBatchStoppingCriteria(
self.stop_tokens_ids,
batch_size=inputs["input_ids"].size(0)
)
self.stopping_criteria_list = StoppingCriteriaList([stop_criteria])
# Run generation with return_dict_in_generate=True to get scores
generation_outputs = model.generate(
**inputs,
eos_token_id=processor.tokenizer.eos_token_id,
max_new_tokens=500,
stopping_criteria=self.stopping_criteria_list,
return_dict_in_generate=True,
output_scores=True,
)
# Calculate loss from the generation outputs' scores
# This is model-dependent and might require adjustments
# if your model doesn't return logits in a compatible format
generated_ids = generation_outputs.sequences
# Get the actual labels for loss calculation
labels = inputs["labels"].detach().clone()
# Process the generated output for evaluation
if hasattr(self.stopping_criteria_list[0], "stop_tokens_idx") and self.stopping_criteria_list:
stop_tokens_idx = self.stopping_criteria_list[0].stop_tokens_idx.reshape(inputs["input_ids"].size(0), -1)[:, 0]
stop_tokens_idx = torch.where(
stop_tokens_idx > 0,
stop_tokens_idx - self.stop_tokens_ids.shape[-1],
generated_ids.shape[-1],
)
generated_text = [
processor.decode(
_pred_ids[inputs["input_ids"].shape[1]:_stop_tokens_idx],
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
for _pred_ids, _stop_tokens_idx in zip(generated_ids, stop_tokens_idx)
]
else:
# Fallback if no stopping criteria with stop_tokens_idx
generated_text = processor.batch_decode(
generated_ids[:, inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
all_generated_texts.extend(generated_text)
# Process labels
labels[labels == -100] = processor.tokenizer.pad_token_id
label_text = processor.batch_decode(
labels,
skip_special_tokens=True
)
# If you have a specific suffix to remove
if hasattr(self, "ANSWER_SUFFIX"):
label_text = [text.rstrip(self.ANSWER_SUFFIX) for text in label_text]
all_labels.extend(label_text)
# Calculate loss using the original inputs
# Run a separate forward pass just for loss calculation
# This is more efficient than two full generate() calls
outputs = model(**inputs)
loss = outputs.loss
# Scale the loss
if accelerator.use_distributed:
loss = loss.mean()
total_eval_loss += loss.detach().float()
# Explicit memory cleanup after each batch
del generated_ids, generated_text, labels, label_text, outputs, generation_outputs
torch.cuda.empty_cache()
gc.collect()
num_eval_steps += 1
# Gather results from all processes if distributed
all_generated_texts = gather_object(all_generated_texts)
all_labels = gather_object(all_labels)
# Compute metrics
cer = CharErrorRate()(all_generated_texts, all_labels)
wer = WordErrorRate()(all_generated_texts, all_labels)
bleu = sacrebleu.corpus_bleu(all_generated_texts, [all_labels])
# Convert tensor metrics to native Python types
metrics = {
f"{metric_key_prefix}_loss": float(total_eval_loss.item() / num_eval_steps),
f"{metric_key_prefix}_cer": float(cer.item()) if isinstance(cer, torch.Tensor) else float(cer),
f"{metric_key_prefix}_wer": float(wer.item()) if isinstance(wer, torch.Tensor) else float(wer),
f"{metric_key_prefix}_bleu": float(bleu.score)
}
# Clean up memory
del all_generated_texts, all_labels
gc.collect()
torch.cuda.empty_cache()
# Required format for the output
return EvalLoopOutput(
predictions=None,
label_ids=None,
metrics=metrics,
num_samples=len(dataloader.dataset)
)
@nguyenbh Can you kindly inform me how to fine tune the "base weight"/text base? Also, if I fine tune the base weights, I'm guessing I need to fine tune both vision and audio loras as well?
It seems that swift also supports the training of phi4-multimodal: https://github.com/modelscope/ms-swift/pull/3350.
@ysdede
I have bumped into this repo https://huggingface.co/ysdede/Phi-4-mm-inst-asr-turkish-3
The results on finetuning for Turkish ASR looks very promissing.
Before Fine-Tuning:
+ WER: 153.84
+ CER: 82.57
After Fine-Tuning:
+ WER: 64.76
+ CER: 29.85
Also the results from
@seastar105
on extending the Phi-4multimodal to Korean ASR and En-Ko speech translation are very promissing
https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor
@nguyenbh ơi, mình muốn continual pretraining Vietnamese corpus (text only) and then finetune (text only) the LLM-base thì có guideline nào không bạn, vì Phi-4 tuyệt đối không có tiếng Việt ! Sau dó sẽ finetune image/voice mixed sau đó. Cám ơn bạn trước.
@nguyenbh
@shtpgshus
Hi, I want to add Persian to Phi-4-multimodal-instruct for text, speech, and vision. I have a few questions:
For text: Is adding extra_special_tokens to the tokenizer enough for Persian, or do I need to retrain the tokenizer?
For speech: Is setting model.embed_tokens_extend.audio_embed.requires_grad = True sufficient, or are other changes to the script needed?
For vision: For fine-tuning on Persian data (e.g., images with Persian text), also, should I unfreeze parts of the model?
Thanks for your help!
Progress Update on Turkish ASR Fine-Tuning
I am now getting 9–20% WER scores by unfreezing the audio encoder. Initially, I experimented with various approaches such as selectively unfreezing only audio-related layers, separating speech LoRA, and storing speech LoRA independently after fine-tuning. However, I still observed some unintended unfreezing of vision-related layers.
After further experimentation, I simplified the approach by unfreezing all relevant layers (see this list) and increasing the learning rate to enhance ASR performance.
For detailed benchmark results, please refer to the results page and explore the finetuning Colab notebook.
@ysdede This is awesome! Thank you for sharing the Turkish finetuning recipe and notebook with the community.
I also see other very cool finetuned models for Korean language tasks from
@junnei
https://huggingface.co/junnei/Phi-4-multimodal-instruct-ko-asr and
@daekeun-ml
https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech
@thusinh1969 We do not release base language model, therefore continual pre-training will be a challenge for Vietnamese. Phi-4-mini pretraining data does contain some Vietnamese, so I would suggest to run SFT training on Vietnamese with lots of high quality data.
I am interested in teaching the model a new language (Icelandic, somewhat related to Swedish, Danish and Norwegian which it has been trained on) and then fine-tune it for ASR in that language.
You
@nguyenbh
mention that SFT would be a better approach than continued pertaining for Vietnamese. Does the same hold for Icelandic? Furthermore, how do you suggest that the SFT prompt be structured?
The plan is to add a new LoRA for the Icelandic capability on the LM and then use it and add a new LoRA for the ASR training (or perhaps only the already defined audio LoRA). It also sounds that it might be a good idea to train the audio encoder as well, but hopefully that won't be needed.
I think this is a great model and I rather not ruin it by fully fine-tuning all parameters and throw it way off with the SFT so thoughts are welcome.
@haukurpj
Thank you for your interest in Phi-4-multimodal.
This is a single model with 3 modalities. The languages that each modal supports are the following:
- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
- Vision: English
- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
So the good news on Text is the model does support Danish, Swedish, Norwegian. Also the model has 200K vocab, therefore, I think it may work if you extend to Icelandic with SFT. TBH, I do not know the quality. I can see one of the challenges will be the model ends up to fluence on Icelandic but hallucinate a lot since the pretrain data does not contain much Icelandic.
For Audio, we can see example of extending to Korean and Turkish in this community. A writeup contribute by the community here
For Vision, it will be harder to support more languages, in general.
Having said that, I would give it a try for Icelandic on both LoRA and full finetuning, given that the text had been trained with Danish, Swedish, and Norwegian.
@nguyenbh , I was just wondering if you could suggest on how to get the bounding box information of any VAQ. So far I experimented with this model, I didn't find as such thing. In case if I want to train/finetune with some custom dataset with bounding box, will it be even possible? Looking forward. I am interested in Vision part only.
Thanks for your interest. Unfortunately Phi-4-MM does not support visual bounding boxes. There is no plan to officially support it in the near future. For vision-centric tasks, you might want to checkout Florence-2.
I am interested in finetuning Phi-4-MM for a specific ASR task and would like to make it work on audio streams. Do you have any suggestions on how to adapt the model or pipeline for realtime ASR?