--- license: mit language: - en - ar base_model: - qwen2-VL-7B pipeline_tag: image-text-to-text tags: - LMM - Arabic ---
Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN - the Arabic Inclusive Multimodal Model- designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities.
Models | VQA | OCR | Video | RS | CDT | Agro. | Cult. | Med. | Total |
---|---|---|---|---|---|---|---|---|---|
GPT-4o | π₯55.15 | π₯54.98 | π₯69.65 | π₯27.36 | π₯62.35 | π₯80.75 | π₯80.86 | π₯49.91 | π₯60.13 |
GPT-4o-mini | 48.83 | 39.38 | π₯66.28 | 16.93 | 56.37 | 78.80 | 65.92 | π₯47.37 | 52.49 |
Gemini-1.5-Pro | 46.68 | 28.68 | 42.95 | 17.07 | 47.06 | 72.14 | 56.24 | 33.78 | 52.38 |
Gemini-1.5-flash | 45.59 | 27.58 | 53.31 | 14.95 | 48.26 | 76.07 | 46.54 | 42.87 | 44.40 |
InternVL-8B | 30.41 | 15.91 | 51.42 | 5.36 | 30.27 | 44.47 | 20.88 | 29.48 | 28.52 |
InternVL2.5-1B | 27.22 | 19.45 | 38.20 | 3.39 | 30.75 | 39.53 | 35.68 | 21.27 | 26.94 |
Qwen-VL-2B | 41.02 | 22.93 | 38.90 | 12.56 | 27.83 | 52.02 | 34.28 | 29.12 | 32.33 |
AIN-7B (ours) | π₯56.78 | π₯72.35 | 64.09 | π₯45.92 | π₯64.10 | π₯85.05 | π₯78.09 | 43.77 | π63.77 |