--- license: cc datasets: - liuhaotian/LLaVA-Instruct-150K - liuhaotian/LLaVA-Pretrain language: - en --- # Model Card for LLaVA-LLaMA-3-8B A reproduced LLaVA LVLM based on Llama-3-8B LLM backbone. Not an official implementation. Please follow Haotian Liu's [official implementation]https://github.com/haotian-liu/LLaVA/tree/main) for more details and information. ## Model Details Follows LLavA-1.5 pre-train and supervised fine-tuning data. ## How to Use Please firstly install llava via ``` pip install git+https://github.com/Victorwz/LLaVA-Llama-3.git ``` You can load the model and perform inference as follows: ```python from llava.conversation import conv_templates, SeparatorStyle from llava.model.builder import load_pretrained_model from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path from PIL import Image import requests import torch from io import BytesIO # load model and processor device = "cuda" if torch.cuda.is_available() else "cpu" model_name = get_model_name_from_path("weizhiwang/LLaVA-Llama-3-8B") tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Llama-3-8B", None, model_name, False, False, device=device) # prepare inputs for the model text = '' + '\n' + "Describe the image." conv = conv_templates["llama_3"].copy() conv.append_message(conv.roles[0], text) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda() # prepare image input url = "https://upload.wikimedia.org/wikipedia/en/thumb/7/7d/Lenna_%28test_image%29.png/330px-Lenna_%28test_image%29.png" response = requests.get(url) image = Image.open(BytesIO(response.content)).convert('RGB') image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda() # autoregressively generate text with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=False, max_new_tokens=512, use_cache=True) outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True) print(outputs[0]) ``` Please refer to a forked [LLaVA-Llama-3](https://github.com/Victorwz/LLaVA-Llama-3) git repo for usage. The data loading function and fastchat conversation template are changed due to a different tokenizer.