maum-ai
/

Llama-3.2-MAAL-11B-Vision-v0.1

Model card Files Files and versions Community

Llama-3.2-MAAL-11B-Vision-v0.1 / README.md

lastdefiance20's picture

Update README.md

a4f31ed verified about 1 month ago

|

history blame contribute delete

3.51 kB

	---
	license: llama3.2
	base_model:
	- meta-llama/Llama-3.2-11B-Vision-Instruct
	language:
	- en
	- ko
	tags:
	- vlm-ko
	- meta
	- llama-3.2
	- llama-3.2-ko
	datasets:
	- maum-ai/General-Evol-VQA
	---
	<p align="left">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/646484cfb90150b2706df03b/BEOyMpnnY9VY2KXlc3V2F.png" width="20%"/>
	<p>

	# Llama-3.2-MAAL-11B-Vision-v0.1
	Llama-3.2-MAAL-11B-Vision-v0.1 is bilingual multimodal model trained for text and visual understanding across Korean and English languages. We are releasing a [model](https://huggingface.co/maum-ai/Llama-3.2-MAAL-11B-Vision-v0.1), a subset of the [training dataset](https://huggingface.co/datasets/maum-ai/General-Evol-VQA), and a [leaderboard](https://huggingface.co/spaces/maum-ai/KOFFVQA-Leaderboard) to promote and accelerate the development of Korean Vision-Language Models (VLMs).

	- Developed by: [maum.ai Brain NLP](https://maum-ai.github.io). Jaeyoon Jung, Yoonshik Kim, Yekyung Nah
	- Language(s) (NLP): Korean, English (currently, bilingual)


	## Model Description

	Version 0.1 is fine-tuned by English and Korean VQA datasets with other datasets (OCR, Math, etc)...

	- We trained this model on 8 H100-80G for 2 days with image-text pair multimodal fine-tuning dataset
	- [maum-ai/General-Evol-VQA](https://huggingface.co/datasets/maum-ai/General-Evol-VQA) is one of the datasets that we used for fine-tuning.

	## sample inference code (GPU)
	Starting with transformers >= 4.45.0 onward, you can run inference to generate text based on an image and a starting prompt you supply.

	```
	import requests
	import torch
	from PIL import Image
	from transformers import MllamaForConditionalGeneration, AutoProcessor

	model_id = "maum-ai/Llama-3.2-MAAL-11B-Vision-v0.1"

	model = MllamaForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained(model_id)

	url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	messages = [
	{"role": "user", "content": [
	{"type": "image"},
	{"type": "text", "text": "이 이미지에 대해서 시를 써줘"}
	]}
	]
	input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
	inputs = processor(
	image,
	input_text,
	add_special_tokens=False,
	return_tensors="pt"
	).to(model.device)

	output = model.generate(**inputs, max_new_tokens=200)
	print(processor.decode(output[0]))
	```

	## Evaluation Results
	As the main goal of version 0.1 is leveraging Korean VQA and OCR capabilities tailored to real-world business use cases, we select [KOFFVQA](https://huggingface.co/spaces/maum-ai/KOFFVQA-Leaderboard) as our evaluation method to assess the Korean instruction-following skills.

	\|Model\|Params (B)\|average(↑)\|
	\|-\|-\|-\|
	\|NCSOFT/VARCO-VISION-14B\|15.2b\|66.69\|
	\|Qwen/Qwen2-VL-7B-Instruct\|8.3b\|63.53\|
	\|maum-ai/Llama-3.2-MAAL-11B-Vision-v0.1\|10.7b\|61.13\|
	\|meta-llama/Llama-3.2-11B-Vision-Instruct\|10.7b\|50.36\|
	\|mistralai/Pixtral-12B-2409\|12.7b\|44.62\|
	\|llava-onevision-qwen2-7b-ov\|8b\|43.78\|
	\|InternVL2-8b\|8.1b\|32.76\|
	\|MiniCPM-V-2_6\|8.1b\|32.69\|

	Our model has achieved a 20% performance improvement compared to the previous base model.
	You can check more results in [this Leaderboard](https://huggingface.co/spaces/maum-ai/KOFFVQA-Leaderboard)

	### We will release enhanced model, v0.2 soon