File size: 3,503 Bytes
2a513ac 06324d5 2a513ac cc486cc 2a513ac cc486cc 4b15d75 cc486cc bafb581 cc486cc bafb581 06324d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
base_model: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- mllama
license: apache-2.0
language:
- en
datasets:
- unsloth/Radiology_mini
library_name: transformers
---
# Uploaded finetuned model
- **Developed by:** Haq Nawaz Malik
- **License:** apache-2.0
- **Finetuned from model :** unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit
# Documentation: Hnm_Llama3.2_(11B)-Vision_lora_model
## Overview
The **Hnm_Llama3.2_(11B)-Vision_lora_model** is a fine-tuned version of **Llama 3.2 (11B) Vision** with **LoRA-based parameter-efficient fine-tuning (PEFT)**. It specializes in **vision-language tasks**, particularly for **medical image captioning and understanding**.
This model was fine-tuned on a **Tesla T4 (Google Colab)** using **Unsloth**, a framework designed for efficient fine-tuning of large models.
---
## Features
- **Fine-tuned on Radiology Images**: Trained using the **Radiology_mini** dataset.
- **Supports Image Captioning**: Can describe medical images.
- **4-bit Quantization (QLoRA)**: Memory efficient, runs on consumer GPUs.
- **LoRA-based PEFT**: Trains only **1% of parameters**, significantly reducing computational cost.
- **Multi-modal Capabilities**: Works with both **text and image** inputs.
- **Supports both Vision and Language fine-tuning**.
---
## Model Details
- **Base Model**: `unsloth/Llama-3.2-11B-Vision-Instruct`
- **Fine-tuning Method**: LoRA + 4-bit Quantization (QLoRA)
- **Dataset**: `unsloth/Radiology_mini`
- **Framework**: Unsloth + Hugging Face Transformers
- **Training Environment**: Google Colab (Tesla T4 GPU)
---
### 2. Load the Model
```python
from unsloth import FastVisionModel
model, tokenizer = FastVisionModel.from_pretrained(
"Hnm_Llama3.2_(11B)-Vision_lora_model",
load_in_4bit=True # Set to False for full precision
)
```
---
## Usage
### **1. Image Captioning Example**
```python
import torch
from transformers import TextStreamer
FastVisionModel.for_inference(model) # Enable inference mode
# Load an image from dataset
dataset = load_dataset("unsloth/Radiology_mini", split="train")
image = dataset[0]["image"]
instruction = "Describe this medical image accurately."
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
image,
input_text,
add_special_tokens=False,
return_tensors="pt"
).to("cuda")
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128,
use_cache=True, temperature=1.5, min_p=0.1)
```
## Notes
- This model is optimized for vision-language tasks in the medical field but can be adapted for other applications.
- Uses **LoRA adapters**, meaning you can fine-tune it efficiently with very few GPU resources.
- Supports **Hugging Face Model Hub** for deployment and sharing.
---
## Citation
If you use this model, please cite:
```
@misc{Hnm_Llama3.2_11B_Vision,
author = {Haq Nawaz Malik},
title = {Fine-tuned Llama 3.2 (11B) Vision Model},
year = {2025},
url = {https://huggingface.co/Omarrran/Hnm_Llama3_2_Vision_lora_model}
}
```
---
## Contact
For any questions or support, reach out via:
- **GitHub**: [view](https://github.com/Haq-Nawaz-Malik)
- **Hugging Face**: [view](https://huggingface.co/Omarrran) |