--- base_model: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit tags: - text-generation-inference - transformers - unsloth - mllama license: apache-2.0 language: - en datasets: - unsloth/Radiology_mini library_name: transformers --- # Uploaded finetuned model - **Developed by:** Haq Nawaz Malik - **License:** apache-2.0 - **Finetuned from model :** unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit # Documentation: Hnm_Llama3.2_(11B)-Vision_lora_model ## Overview The **Hnm_Llama3.2_(11B)-Vision_lora_model** is a fine-tuned version of **Llama 3.2 (11B) Vision** with **LoRA-based parameter-efficient fine-tuning (PEFT)**. It specializes in **vision-language tasks**, particularly for **medical image captioning and understanding**. This model was fine-tuned on a **Tesla T4 (Google Colab)** using **Unsloth**, a framework designed for efficient fine-tuning of large models. --- ## Features - **Fine-tuned on Radiology Images**: Trained using the **Radiology_mini** dataset. - **Supports Image Captioning**: Can describe medical images. - **4-bit Quantization (QLoRA)**: Memory efficient, runs on consumer GPUs. - **LoRA-based PEFT**: Trains only **1% of parameters**, significantly reducing computational cost. - **Multi-modal Capabilities**: Works with both **text and image** inputs. - **Supports both Vision and Language fine-tuning**. --- ## Model Details - **Base Model**: `unsloth/Llama-3.2-11B-Vision-Instruct` - **Fine-tuning Method**: LoRA + 4-bit Quantization (QLoRA) - **Dataset**: `unsloth/Radiology_mini` - **Framework**: Unsloth + Hugging Face Transformers - **Training Environment**: Google Colab (Tesla T4 GPU) --- ### 2. Load the Model ```python from unsloth import FastVisionModel model, tokenizer = FastVisionModel.from_pretrained( "Hnm_Llama3.2_(11B)-Vision_lora_model", load_in_4bit=True # Set to False for full precision ) ``` --- ## Usage ### **1. Image Captioning Example** ```python import torch from transformers import TextStreamer FastVisionModel.for_inference(model) # Enable inference mode # Load an image from dataset dataset = load_dataset("unsloth/Radiology_mini", split="train") image = dataset[0]["image"] instruction = "Describe this medical image accurately." messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": instruction} ]} ] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer( image, input_text, add_special_tokens=False, return_tensors="pt" ).to("cuda") text_streamer = TextStreamer(tokenizer, skip_prompt=True) _ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True, temperature=1.5, min_p=0.1) ``` ## Notes - This model is optimized for vision-language tasks in the medical field but can be adapted for other applications. - Uses **LoRA adapters**, meaning you can fine-tune it efficiently with very few GPU resources. - Supports **Hugging Face Model Hub** for deployment and sharing. --- ## Citation If you use this model, please cite: ``` @misc{Hnm_Llama3.2_11B_Vision, author = {Haq Nawaz Malik}, title = {Fine-tuned Llama 3.2 (11B) Vision Model}, year = {2025}, url = {https://huggingface.co/Omarrran/Hnm_Llama3_2_Vision_lora_model} } ``` --- ## Contact For any questions or support, reach out via: - **GitHub**: [view](https://github.com/Haq-Nawaz-Malik) - **Hugging Face**: [view](https://huggingface.co/Omarrran)