--- license: other license_name: krutrim-community-license-agreement-version-1.0 license_link: LICENSE.md language: - hi - bn - ta - te - gu - or - en - as - ml - mr - kn pipeline_tag: image-text-to-text --- # Chitrarth: Bridging Vision and Language for a Billion People [Paper Link👁️](https://arxiv.org/abs/2502.15392) [](https://huggingface.co/krutrim-ai-labs/chitrarth) [](https://github.com/ola-krutrim/Chitrarth) [](https://cloud.olakrutrim.com/console/inference-service?section=models&modelName=Krutrim&artifactName=chitrarth&artifactType=model) [](https://ai-labs.olakrutrim.com/models/Chitrarth-1) ## 1. Introduction Chitrarth (Chitra: Image; Artha: Meaning) is a multilingual VLM that integrates a state-of-the-art multilingual Large Language Model (LLM) with a vision module. This model is trained primarily on multilingual image-text data and is designed to work across 10 prominent Indian languages, including Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Kannada, Malayalam, Odia, and Assamese, as well as English [](https://www.youtube.com/watch?v=TmzEweLIgsc) ## 2. Model Summary ### Key Features - **Model:** Krutrim-1 as the base LLM, SigLIP as the visual encoder with 2 layer MLP - **Languages Supported:** 10 Indic languages - Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Kannada, Malayalam, Odia, and Assamese, as well as English - **Usage:** General purpose VLM  ## 3. API Platform Visit [Chitrarth Online](https://cloud.olakrutrim.com/console/inference-service?section=models&modelName=Krutrim&artifactName=chitrarth&artifactType=model) to access the model via the web interface. ## 4. Inference code ``` git clone https://github.com/ola-krutrim/Chitrarth.git conda create --name chitrarth python=3.10 conda activate chitrarth cd Chitrarth pip install -e . python chitrarth/inference.py --model-path "krutrim-ai-labs/chitrarth" --image-file "assets/govt_school.jpeg" --query "Explain the image. " ``` ## 5. Evaluation Results  Performance against SOTA VLMs on different academic multimodal tasks. Our model consistently outperforms IDEFICS 2 (7B) and PALO 7B on different benchmarks while remaining competitive on TextVQA and Vizwiz. We introduce **BharatBench**, a comprehensive evaluation benchmark suite designed for **10 under-resourced Indic languages** across **3 tasks**. The performance of **Chitrarth** on the BharatBench Evaluation framework sets a strong baseline for future research in this domain. Our model is unique in its ability to handle all included languages. Below are the performance results of **Chitrarth** on BharatBench across three evaluation tasks: **POPE**, **LLaVA-Bench**, and **MMVet**. | **Language** | **POPE** | **LLaVA-Bench** | **MMVet** | |----------------|----------|-----------------|-----------| | **Telugu** | 79.9 | 54.8 | 43.76 | | **Hindi** | 78.68 | 51.5 | 38.85 | | **Bengali** | 83.24 | 53.7 | 33.24 | | **Malayalam** | 85.29 | 55.5 | 25.36 | | **Kannada** | 85.52 | 58.1 | 46.19 | | **Assamese** | 55.59 | 59.1 | 37.29 | | **Tamil** | 83.28 | 58.3 | 34.31 | | **Marathi** | 79.17 | 52.8 | 40.96 | | **Gujarati** | 84.75 | 55.9 | 39.03 | | **Odia** | 82.03 | 62.8 | 19.67 | | **English** | 87.63 | 67.9 | 30.49 | ## 6. License This code repository and the model weights are licensed under the [Krutrim Community License.](LICENSE.md) ## 7. Citation ``` @inproceedings{ khan2024chitrarth, title={Chitrarth: Bridging Vision and Language for a Billion People}, author={Shaharukh Khan, Ayush Tarun, Abhinav Ravi, Ali Faraz, Praveen Kumar Pokala, Anagha Bhangare, Raja Kolla, Chandra Khatri, Shubham Agarwal}, booktitle={NeurIPS Multimodal Algorithmic Reasoning}, year={2024}, } ``` ## 8. Contact Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub. ## 9. Acknowledgement Chitrarth is built with reference to the code of the following projects: [Transformers](https://github.com/huggingface/transformers), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!