LLM-jp-3 VILA 14B

This repository provides a large vision language model (VLM) developed by the Research and Development Center for Large Language Models at the National Institute of Informatics, Japan.

Usage

Python version: 3.10.12

  1. Clone the repository and install the libraries.

    git clone [email protected]:llm-jp/llm-jp-VILA.git
    cd llm-jp-VILA
    
    python3 -m venv venv
    source venv/bin/activate
    
    pip install --upgrade pip
    wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
    pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
    pip install -e .
    pip install -e ".[train]"
    
    pip install git+https://github.com/huggingface/[email protected]
    cp -rv ./llava/train/transformers_replace/* ./venv/lib/python3.10/site-packages/transformers/
    
  2. Run the python script. You can change the image_path and query to your own.

    import argparse
    from io import BytesIO
    
    import requests
    import torch
    from PIL import Image
    
    from llava.constants import IMAGE_TOKEN_INDEX
    from llava.conversation import conv_templates
    from llava.mm_utils import (get_model_name_from_path,
                                process_images, tokenizer_image_token)
    from llava.model.builder import load_pretrained_model
    from llava.utils import disable_torch_init
    
    
    def load_image(image_file):
        if image_file.startswith("http") or image_file.startswith("https"):
            response = requests.get(image_file)
            image = Image.open(BytesIO(response.content)).convert("RGB")
        else:
            image = Image.open(image_file).convert("RGB")
        return image
    
    
    def load_images(image_files):
        out = []
        for image_file in image_files:
            image = load_image(image_file)
            out.append(image)
        return out
    
    
    disable_torch_init()
    
    model_checkpoint_path = "llm-jp/llm-jp-3-vila-14b"
    model_name = get_model_name_from_path(model_checkpoint_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_checkpoint_path, model_name)
    
    image_path = "path/to/image"
    image_files = [
        image_path
    ]
    images = load_images(image_files)
    
    query = "<image>\nこの画像について説明してください。"
    
    conv_mode = "llmjp_v3"
    conv = conv_templates[conv_mode].copy()
    conv.append_message(conv.roles[0], query)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    
    images_tensor = process_images(images, image_processor, model.config).to(model.device, dtype=torch.float16)
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
    
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=[
                images_tensor,
            ],
            do_sample=False,
            num_beams=1,
            max_new_tokens=256,
            use_cache=True,
        )
    
    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
    print(outputs)
    

Model Details

Model components Model / Architecture Parameters
Vision encoder siglip-so400m-patch14-384 428M
Projector 2-layer MLP 32M
LLM llm-jp-3-13b-instruct 13B

Datasets

The model was trained in three stages.

Step-0

We used the following data sets to tune the parameters in the projector.

Language Dataset Images
Japanese Japanese image text pairs 558K
English LLaVA-Pretrain 558K

Step-1

We used the following data sets to tune the parameters in the projector and LLM.

Language Dataset Images
Japanese Japanese image text pairs 6M
Japanese interleaved data 6M
English coyo (subset) 6M
mmc4-core (subset) 6M

Step-2

We used the following data sets to tune the parameters in the projector and LLM.

Language Dataset Images
Japanese llava-instruct-ja 156K
japanese-photos-conv 12K
ja-vg-vqa 99K
synthdog-ja (subset) 102K
English LLaVA 158K
VQAv2 53K
GQA 46K
OCRVQA 80K
TextVQA 22K

Evaluations

We evaluated our model using Heron Bench, JA-VLM-Bench-In-the-Wild, and JA-VG-VQA500. We used gpt-4o-2024-05-13 for LLM-as-a-judge.

Heron Bench

Models LLM-as-a-judge score (%)
Japanese InstructBLIP Alpha 14.0
Japanese Stable VLM 24.2
Llama-3-EvoVLM-JP-v2 39.3
LLaVA-CALM2-SigLIP 43.3
llm-jp-3-vila-14b (Ours) 57.2
GPT-4o 87.6

JA-VLM-Bench-In-the-Wild

Models ROUGE-L LLM-as-a-judge score (/5.0)
Japanese InstructBLIP Alpha 20.8 2.42
Japanese Stable VLM 23.3 2.47
Llama-3-EvoVLM-JP-v2 41.4 2.92
LLaVA-CALM2-SigLIP 47.2 3.15
llm-jp-3-vila-14b (Ours) 52.3 3.69
GPT-4o 37.6 3.85

JA-VG-VQA-500

Models ROUGE-L LLM-as-a-judge score (/5.0)
Japanese InstructBLIP Alpha -- --
Japanese Stable VLM -- --
Llama-3-EvoVLM-JP-v2 23.5 2.96
LLaVA-CALM2-SigLIP 17.4 3.21
llm-jp-3-vila-14b (Ours) 16.2 3.62
GPT-4o 12.1 3.58

Risks and Limitations

The model released in this repository is in the early stages of our research and development. It has not been tuned such that model's outputs are aligned with social norms, ethical standards, and the law.

License

The weights of this model are released under the Apache License, Version 2.0. In addition, a user of this model must comply with the OpenAI terms of use because the model used synthetic data generated by OpenAI GPT-4.

Additional information

Regarding the license of the synthdog-ja dataset, there is no explicit license statement in the dataset documentation. While we attempted to contact the main corresponding author of "OCR-free Document Understanding Transformer" for clarification, we received no response.

Based on the following considerations:

  1. The donut-base model trained on this dataset is released under the MIT license
  2. The Wikipedia articles used in the dataset are licensed under CC-BY-SA

We have determined that the synthdog-ja dataset is most likely governed by the CC-BY-SA license, and proceeded with training under this assumption.

Downloads last month
1,470
Inference API
Unable to determine this model's library. Check the docs .