natong19's picture
Upload files
3e471fe
metadata
license: apache-2.0
language:
  - en
pipeline_tag: image-text-to-text
tags:
  - multimodal
library_name: transformers

Qwen2-VL-7B-Instruct-abliterated

Introduction

Abliterated version of Qwen2-VL-7B-Instruct, an advanced multimodal large language model. Weight orthogonalization has been applied to inhibit the model's ability to express refusals while preserving the model's text and multimodal capabilities. Nonetheless, the model may still refuse your request, misunderstand your intent, or provide unsolicited advice regarding ethics or safety.

Requirements

If you encounter errors such as KeyError: 'qwen2_vl' or ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers', you can try building transformers from source with command pip install git+https://github.com/huggingface/transformers

Quickstart

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

path = "natong19/Qwen2-VL-7B-Instruct-abliterated"

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    path, torch_dtype="auto", device_map="auto"
)

min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained(path, min_pixels=min_pixels, max_pixels=max_pixels)

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

The above code can be run on 24GB VRAM. For more usage examples, such as multi-image inference, video inference and batch inference, please refer to the Qwen2-VL-7B-Instruct repo.

Evaluation

Evaluation framework: lm-evaluation-harness 0.4.2 and lmms-eval 0.2.1

Datasets Qwen2-VL-7B-Instruct Qwen2-VL-7B-Instruct-abliterated
Text benchmarks
ARC (25-shot) 57.8 57.8
MMLU (5-shot) 69.7 68.4
TruthfulQA (0-shot) 49.5 45.4
Winogrande (5-shot) 72.6 72.8
Multimodal benchmarks
AI2D (lite) 78.8 79.8
GQA (lite) 73.2 73.6
MMBench (EN dev, lite) 84.1 82.6
MMMU (val) 50.8 51.6
OCRBench 77.7 78.1
VQAv2 (val, lite) 79.9 79.8