Chatvector-llava-v1.5-plus-Houou-v3-7b Model Card

Model Details

※好奇心から生まれたモデルです。精度は保証できませんが、v1.6を用いたものよりは良い気がしています。
chatvector-llava-v1.5-plus-houou-v3-7bは日本語で画像を説明することが可能なVLMです。
Chat Vectorの手法に影響を受けています。 このモデルはChat Vectorを参考にllava-v1.5-7bhouou-instruction-7b-v3Llama-2-7b-hf の重みを以下のように加減算することで作成してみました。

houou-instruction-7b-v3 + (llava-v1.5-7b - Llama-2-7b-hf)

次のプログラムは引用させていただいたサイトにあったものをベースにしています。以下文献もぜひご覧ください。

Uses

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .
import requests
import torch
import transformers
from PIL import Image

from transformers.generation.streamers import TextStreamer
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM
from llava.mm_utils import tokenizer_image_token, process_images

model_path = "shinyice/chatvector-llava-v1.5-plus-houou-v3-7b"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LlavaLlamaForCausalLM.from_pretrained(
    model_path,
    device_map=device,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    torch_dtype=torch.float16,
).eval()
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_path,
    model_max_length=1024,
    padding_side="right",
    use_fast=False,
)
model.get_model().vision_tower.load_model()
model = model.to(device)

eos_token_id_list = [
    tokenizer.eos_token_id,
    tokenizer.bos_token_id,
]

image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')

if not isinstance(image, list):
    image = [image]

image_tensor = process_images(image, model.get_model().vision_tower.image_processor, model.config)
image_sizes = [img.size for img in image]

if isinstance(image_tensor, list):
    image_tensor = [img.to(model.device, dtype=torch.float16) for img in image_tensor]
else:
    image_tensor = image_tensor.to(device, dtype=torch.float16)

image_sizes_tensor = torch.tensor(image_sizes, dtype=torch.int32, device=device)

conv_mode = "v1" 
conv = conv_templates[conv_mode].copy()
prompt = "猫の隣には何がありますか?"
inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(
    prompt,
    tokenizer,
    IMAGE_TOKEN_INDEX,
    return_tensors='pt'
).unsqueeze(0)
if device == "cuda":
    input_ids = input_ids.to(device)

temperature = 0.0
top_p = 1.0
max_new_tokens = 256

with torch.inference_mode():
    output = model.generate(
        inputs=input_ids,
        images=image_tensor,
        image_sizes=image_sizes_tensor,
        do_sample=True if temperature > 0 else False,
        temperature=temperature,
        top_p=top_p,
        max_new_tokens=max_new_tokens,
        use_cache=True,
        eos_token_id=eos_token_id_list,
    )

print(tokenizer.decode(output[0]))

Bibliography

Downloads last month
20
Safetensors
Model size
7.06B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.