--- datasets: - FoundationVision/groma_instruct language: - en pipeline_tag: image-text-to-text library_name: transformers --- This repository contains the model of the paper [Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models](https://huggingface.co/papers/2404.13013).