EraX-VL-2B-V1.5
Introduction ๐
Hot on the heels of the popular EraX-VL-7B-V1.0 model, we proudly present EraX-VL-2B-V1.5. This enhanced multimodal model offers robust OCR and VQA capabilities across diverse languages ๐, with a significant advantage in processing Vietnamese ๐ป๐ณ. The EraX-VL-2B
model stands out for its precise recognition capabilities across a range of documents ๐, including medical forms ๐ฉบ, invoices ๐งพ, bills of sale ๐ณ, quotes ๐, and medical records ๐. This functionality is expected to be highly beneficial for hospitals ๐ฅ, clinics ๐, insurance companies ๐ก๏ธ, and other similar applications ๐. Built on the solid foundation of the Qwen/Qwen2-VL-2B-Instruct[1], which we found to be of high quality and fluent in Vietnamese, EraX-VL-2B
has been fine-tuned to enhance its performance. We plan to continue improving and releasing new versions for free, along with sharing performance benchmarks in the near future.
One standing-out feature of EraX-VL-2B-V1.5 is the capability to do multi-turn Q&A with reasonable reasoning capability at its small size of only +2 billions parameters.
NOTA BENE:
- EraX-VL-2B-V1.5 is NOT a typical OCR-only tool likes Tesseract but is a Multimodal LLM-based model. To use it effectively, you may have to twist your prompt carefully depending on your tasks.
- This model was NOT finetuned with medical (X-ray) dataset or car accidences (yet). Stay tune for updated version coming up sometime 2025.
EraX-VL-2B-V1.5 is a young and tiny member of our EraX's Lร nhGPT collection of LLM models.
- Developed by:
- Nguyแป n Anh Nguyรชn ([email protected])
- Nguyแป n Hแป Nam (BCG)
- Phแบกm Huแปณnh Nhแบญt ([email protected])
- Phแบกm ฤรฌnh Thแปฅc ([email protected])
- Funded by: Bamboo Capital Group and EraX
- Model type: Multimodal Transformer with over 2B parameters
- Languages (NLP): Primarily Vietnamese with multilingual capabilities
- License: Apache 2.0
- Fine-tuned from: Qwen/Qwen2-VL-2B-Instruct
- Prompt examples: Some popular prompt examples.
Benchmarks ๐
๐ LeaderBoard
Models | Open-Source | VI-MTVQA |
---|---|---|
EraX-VL-7B-V1.5 ๐ฅ | โ | 47.2 |
Qwen2-VL 72B ๐ฅ | โ | 41.6 |
ViGPT-VL ๐ฅ | โ | 39.1 |
EraX-VL-2B-V1.5 | โ | 38.2 |
EraX-VL-7B-V1 | โ | 37.6 |
Vintern-1B-V2 | โ | 37.4 |
Qwen2-VL 7B | โ | 30.0 |
Claude3 Opus | โ | 29.1 |
GPT-4o mini | โ | 29.1 |
GPT-4V | โ | 28.9 |
Gemini Ultra | โ | 28.6 |
InternVL2 76B | โ | 26.9 |
QwenVL Max | โ | 23.5 |
Claude3 Sonnet | โ | 20.8 |
QwenVL Plus | โ | 18.1 |
MiniCPM-V2.5 | โ | 15.3 |
The test code for evaluating models in the paper can be found in: EraX-JS-Company/EraX-MTVQA-Benchmark
API trial ๐
Please contact [email protected] for API access inquiry.
Examples ๐งฉ
1. OCR - Optical Character Recognition for Multi-Images
Example 01: Citizen identification card
Front View
Back View
Source: Google Support
{
"Sแป thแบป":"037094012351"
"Hแป vร tรชn":"TRแปNH QUANG DUY"
"Ngร y sinh":"04/09/1994"
"Giแปi tรญnh":"Nam"
"Quแปc tแปch":"Viแปt Nam"
"Quรช quรกn / Place of origin":"Tรขn Thร nh, Kim Sฦกn, Ninh Bรฌnh"
"Nฦกi thฦฐแปng trรบ / Place of residence":"Xรณm 6 Tรขn Thร nh, Kim Sฦกn, Ninh Bรฌnh"
"Cรณ giรก trแป ฤแบฟn":"04/09/2034"
"ฤแบทc ฤiแปm nhรขn dแบกng / Personal identification":"seo chแบฅm c:1cm trรชn ฤuรดi mแบฏt trรกi"
"Cแปฅc trฦฐแปng cแปฅc cแบฃnh sรกt quแบฃn lรฝ hร nh chรญnh vแป trแบญt tแปฑ xรฃ hแปi":"Nguyแป
n Quแปc Hรนng"
"Ngร y cแบฅp":"10/12/2022"
}
Example 01: Identity Card
Front View
Back View
Source: Internet
{
"Sแป":"272737384"
"Hแป tรชn":"PHแบ M NHแบฌT TRฦฏแปNG"
"Sinh ngร y":"08-08-2000"
"Nguyรชn quรกn":"Tiแปn Giang"
"Nฦกi ฤKHK thฦฐแปng trรบ":"393, Tรขn Xuรขn, Bแบฃo Bรฌnh, Cแบฉm Mแปน, ฤแปng Nai"
"Dรขn tแปc":"Kinh"
"Tรดn giรกo":"Khรดng"
"ฤแบทc ฤiแปm nhแบญn dแบกng":"Nแปt ruแปi c.3,5cm trรชn sau cรกnh mลฉi phแบฃi."
"Ngร y cแบฅp":"30 thรกng 01 nฤm 2018"
"Giรกm ฤแปc CA":"T.BรNH ฤแปNH"
}
Example 02: Driver's License
Front View
Back View
Source: Bรกo Phรกp luแบญt
{
"No.":"400116012313"
"Fullname":"NGUYแปN VฤN DลจNG"
"Date_of_birth":"08/06/1979"
"Nationality":"VIแปT NAM"
"Address":"X. Quแปณnh Hแบงu, H. Quแปณnh Lฦฐu, T. Nghแป An
Nghแป An, ngร y/date 23 thรกng/month 04 nฤm/year 2022"
"Hang_Class":"FC"
"Expires":"23/04/2027"
"Place_of_issue":"Nghแป An"
"Date_of_issue":"ngร y/date 23 thรกng/month 04 nฤm/year 2022"
"Signer":"Trแบงn Anh Tuแบฅn"
"Cรกc loแบกi xe ฤฦฐแปฃc phรฉp":"ร tรด hแบกng C kรฉo rฦกmoรณc, ฤแบงu kรฉo kรฉo sฦกmi rฦกmoรณc vร xe hแบกng B1, B2, C, FB2 (Motor vehicle of class C with a trailer, semi-trailer truck and vehicles of classes B1, B2, C, FB2)"
"Mรฃ sแป":""
}
Example 03: Vehicle Registration Certificate
Source: Bรกo Vietnamnet
{
"Tรชn chแปง xe":"NGUYแปN TรN NHUแบฌN"
"ฤแปa chแป":"KE27 Kp3 P.TTTรขy Q7"
"Nhรฃn hiแปu":"HONDA"
"Sแป loแบกi":"DYLAN"
"Mร u sฦกn":"Trแบฏng"
"Sแป ngฦฐแปi ฤฦฐแปฃc phรฉp chแป":"02"
"Nguแปn gแปc":"Xe nhแบญp mแปi"
"Biแปn sแป ฤฤng kรฝ":"59V1-498.89"
"ฤฤng kรฝ lแบงn ฤแบงu ngร y":"08/06/2004"
"Sแป mรกy":"F03E-0057735"
"Sแป khung":"5A04F-070410"
"Dung tรญch":"152"
"Quแบฃn lรฝ":"TRฦฏแปNG CA QUแบฌN"
"Thฦฐแปฃng tรก":"Trแบงn Vฤn Hiแปu"
}
Example 04: Birth Certificate
Source: https://congchung247.com.vn
{
"name": "NGUYแปN NAM PHฦฏฦ NG",
"gender": "Nแปฏ",
"date_of_birth": "08/6/2011",
"place_of_birth": "Bแปnh viแปn Viแปt - Phรกp Hร Nแปi",
"nationality": "Viแปt Nam",
"father_name": "Nguyแป
n Ninh Hแปng Quang",
"father_dob": "1980",
"father_address": "309 nhร E2 Bแบกch Khoa - Hai Bร Trฦฐng - Hร Nแปi",
"mother_name": "Phแบกm Thรนy Trang",
"mother_dob": "1984",
"mother_address": "309 nhร E2 Bแบกch Khoa - Hai Bร Trฦฐng - Hร Nแปi",
"registration_place": "UBND phฦฐแปng Bแบกch Khoa - Quแบญn Hai Bร Trฦฐng - Hร Nแปi",
"registration_date": "05/8/2011",
"registration_ralation": "cha",
"notes": None,
"certified_by": "Nguyแป
n Thแป Kim Hoa"
}
Quickstart ๐ฎ
Install the necessary packages:
python -m pip install git+https://github.com/huggingface/transformers accelerate
python -m pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
Then you can use EraX-VL-2B-V1.5
like this:
import os
import base64
import json
import cv2
import numpy as np
import matplotlib.pyplot as plt
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path = "erax/EraX-VL-2B-V1.5"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="eager", # replace with "flash_attention_2" if your GPU is Ampere architecture
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# processor = AutoProcessor.from_pretrained(model_path)
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
model_path,
min_pixels=min_pixels,
max_pixels=max_pixels,
)
image_path ="image.jpg"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
decoded_image_text = encoded_image.decode('utf-8')
base64_data = f"data:image;base64,{decoded_image_text}"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": base64_data,
},
{
"type": "text",
"text": "Trรญch xuแบฅt thรดng tin nแปi dung tแปซ hรฌnh แบฃnh ฤฦฐแปฃc cung cแบฅp."
},
],
}
]
# Prepare prompt
tokenized_text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[ tokenized_text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Generation configs
generation_config = model.generation_config
generation_config.do_sample = True
generation_config.temperature = 1.0
generation_config.top_k = 1
generation_config.top_p = 0.9
generation_config.min_p = 0.1
generation_config.best_of = 5
generation_config.max_new_tokens = 2048
generation_config.repetition_penalty = 1.06
# Inference
generated_ids = model.generate(**inputs, generation_config=generation_config)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
References ๐
[1] Qwen team. Qwen2-VL. 2024.
[2] Bai, Jinze, et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv preprint arXiv:2308.12966 (2023).
[4] Yang, An, et al. "Qwen2 technical report." arXiv preprint arXiv:2407.10671 (2024).
[5] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[6] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." arXiv preprint arXiv:2404.16821 (2024).
[7] Tran, Chi, and Huong Le Thanh. "LaVy: Vietnamese Multimodal Large Language Model." arXiv preprint arXiv:2404.07922 (2024).
Contact ๐ค
- For correspondence regarding this work or inquiry for API trial, please contact Nguyแป n Anh Nguyรชn at [email protected].
- Follow us on EraX Github
- Downloads last month
- 968