--- language: - ja pipeline_tag: image-text-to-text --- # LLM-jp-3 VILA 14B This repository provides a large vision language model (VLM) developed by the [Research and Development Center for Large Language Models](https://llmc.nii.ac.jp/) at the [National Institute of Informatics](https://www.nii.ac.jp/en/), Japan. ## Usage Python version: 3.10.12 1. Clone the repository and install the libraries.
```bash git clone git@github.com:llm-jp/llm-jp-VILA.git cd llm-jp-VILA ``` ```bash python3 -m venv venv source venv/bin/activate ``` ```bash pip install --upgrade pip wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl pip install -e . pip install -e ".[train]" ``` ```bash pip install git+https://github.com/huggingface/transformers@v4.36.2 cp -rv ./llava/train/transformers_replace/* ./venv/lib/python3.10/site-packages/transformers/ ```
2. Run the python script. You can change the `image_path` and `query` to your own.
```python import argparse from io import BytesIO import requests import torch from PIL import Image from llava.constants import IMAGE_TOKEN_INDEX from llava.conversation import conv_templates from llava.mm_utils import (get_model_name_from_path, process_images, tokenizer_image_token) from llava.model.builder import load_pretrained_model from llava.utils import disable_torch_init def load_image(image_file): if image_file.startswith("http") or image_file.startswith("https"): response = requests.get(image_file) image = Image.open(BytesIO(response.content)).convert("RGB") else: image = Image.open(image_file).convert("RGB") return image def load_images(image_files): out = [] for image_file in image_files: image = load_image(image_file) out.append(image) return out disable_torch_init() model_checkpoint_path = "llm-jp/llm-jp-3-vila-14b" model_name = get_model_name_from_path(model_checkpoint_path) tokenizer, model, image_processor, context_len = load_pretrained_model(model_checkpoint_path, model_name) image_path = "path/to/image" image_files = [ image_path ] images = load_images(image_files) query = "\nこの画像について説明してください。" conv_mode = "llmjp_v3" conv = conv_templates[conv_mode].copy() conv.append_message(conv.roles[0], query) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() images_tensor = process_images(images, image_processor, model.config).to(model.device, dtype=torch.float16) input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda() with torch.inference_mode(): output_ids = model.generate( input_ids, images=[ images_tensor, ], do_sample=False, num_beams=1, max_new_tokens=256, use_cache=True, ) outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0] print(outputs) ```
## Model Details |Model components|Model / Architecture|Parameters| |:---:|:---:|:---:| |Vision encoder|[siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)|428M| |Projector|2-layer MLP|32M| |LLM|[llm-jp-3-13b-instruct](https://huggingface.co/llm-jp/llm-jp-3-13b-instruct)|13B| ## Datasets The model was trained in three stages. ### Step-0 We used the following data sets to tune the parameters in the projector. | Language | Dataset | Images| |:---|:---|---:| |Japanese|[Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs)|558K |English|[LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)|558K ### Step-1 We used the following data sets to tune the parameters in the projector and LLM. | Language | Dataset | Images | |:---|:---|:---| |Japanese|[Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs)| 6M | | |[Japanese interleaved data](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-interleaved-data)| 6M | |English |[coyo](https://github.com/kakaobrain/coyo-dataset) (subset) | 6M | | |[mmc4-core](https://github.com/allenai/mmc4) (subset) | 6M | ### Step-2 We used the following data sets to tune the parameters in the projector and LLM. | Language | Dataset | Images | |:---|:---|:---| |Japanese|[llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja)| 156K | | |[japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation)| 12K | | |[ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)| 99K | | |[synthdog-ja](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja) (subset)| 102K | |English |[LLaVA](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | 158K | | |[VQAv2](https://visualqa.org/) | 53K | | |[GQA](https://cs.stanford.edu/people/dorarad/gqa/index.html) | 46K | | |[OCRVQA](https://ocr-vqa.github.io/) | 80K | | |[TextVQA](https://textvqa.org/dataset/) | 22K | ## Evaluations We evaluated our model using [Heron Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench), [JA-VLM-Bench-In-the-Wild](https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild), and [JA-VG-VQA500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500). We used `gpt-4o-2024-05-13` for LLM-as-a-judge. ### Heron Bench | Models | LLM-as-a-judge score (%) | |---|:---:| | [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | 14.0 | | [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | 24.2 | | [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 39.3 | | [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 43.3 | | **llm-jp-3-vila-14b (Ours)** | 57.2 | | GPT-4o | 87.6 | ### JA-VLM-Bench-In-the-Wild | **Models** | ROUGE-L | LLM-as-a-judge score (/5.0) | |---|:---:|:---:| | [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | 20.8 | 2.42 | | [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | 23.3 | 2.47 | | [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 41.4 | 2.92 | | [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 47.2 | 3.15 | | **llm-jp-3-vila-14b (Ours)** | 52.3 | 3.69 | | GPT-4o | 37.6 | 3.85 | ### JA-VG-VQA-500 | **Models** | ROUGE-L | LLM-as-a-judge score (/5.0) | |---|:---:|:---:| | [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | -- | -- | | [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | -- | -- | | [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 23.5 | 2.96 | | [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 17.4 | 3.21 | | **llm-jp-3-vila-14b (Ours)** | 16.2 | 3.62 | | GPT-4o | 12.1 | 3.58 | ## Risks and Limitations The model released in this repository is in the early stages of our research and development. It has not been tuned such that model's outputs are aligned with social norms, ethical standards, and the law. ## License The weights of this model are released under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). In addition, a user of this model must comply with [the OpenAI terms of use](https://openai.com/policies/terms-of-use) because the model used synthetic data generated by OpenAI GPT-4. ## Additional information Regarding the license of the [synthdog-ja](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja) dataset, there is no explicit license statement in the dataset documentation. While we attempted to contact the main corresponding author of "OCR-free Document Understanding Transformer" for clarification, we received no response. Based on the following considerations: 1. The [donut-base](https://huggingface.co/naver-clova-ix/donut-base) model trained on this dataset is released under the MIT license 2. The Wikipedia articles used in the dataset are licensed under CC-BY-SA We have determined that the synthdog-ja dataset is most likely governed by the CC-BY-SA license, and proceeded with training under this assumption.