LLaVA Model Card

SGLang

This contains the necessary files to run LLaVA-1.6 34B on SGLang. You can run the server with the following command:

python -m sglang.launch_server --model-path dillonlaird/hf-llava-v1.6-34b --port 30000

There seems to be issues with the chat formatting when using the sglang interface so I recommend querying the server directly and formatting the string yourself:

import requests
from transformers import AutoTokenizer


def generate(image_path: str, prompt: str, tokenizer):
    chat = [
        {"role": "system", "content": "Answer the question."},
        {"role": "user", "content": "<image>\n" + prompt},
    ]
    chat_str = tokenizer.apply_chat_template(chat, tokenize=False)
    chat_str += "<|img_start|>assistant\n"
    sampling_params = {"temperature": 0.2, "max_new_tokens": 1536}
    res = requests.post(
        "http://localhost:30000/generate",
        json={
            "text": chat_str,
            "image_data": image_path,
            "sampling_params": sampling_params,
        },
    )
    return res.json()["text"]


if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b")
    image_path = "path/to/image.jpg"
    prompt = "What is the name of the mountain?"
    desc = generate(image_path, prompt, tokenizer)

Model details

Model type: LLaVA is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: NousResearch/Nous-Hermes-2-Yi-34B

Model date: LLaVA-v1.6-34B was trained in December 2023.

Paper or resources for more information: https://llava-vl.github.io/

License

NousResearch/Nous-Hermes-2-Yi-34B license.

Where to send questions or comments about the model: https://github.com/haotian-liu/LLaVA/issues

Intended use

Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 500K academic-task-oriented VQA data mixture.
  • 50K GPT-4V data mixture.
  • 40K ShareGPT data.

Evaluation dataset

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

Downloads last month
10
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.