A novel paradigm of MLLMs. Instead of relying on LLMs or external modules for localization, Groma exploits the spatial under- standing capability of the visual tokenizer. This perceive-then-understand design also resembles human vision process.
Framework. Groma encodes the image into both global image tokens and local region tokens. Specifically, a general-purpose region proposer is introduced to discover regions of interest, followed by a light-weight encoder for region tokenization. By integrating region tokens into user instructions and model responses, Groma unlocks the referring and grounding abilities.
Highlights:
Groma Instruct. We curate 30k visually grounded conversations for instruction finetuning, which is the first grounded chat dataset constructed with both visual and textual prompts, leveraging the powerful GPT-4V for data generation.
@misc{ma2024groma,
title={Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models},
author={Chuofan Ma and Yi Jiang and Jiannan Wu and Zehuan Yuan and Xiaojuan Qi},
year={2024},
eprint={2404.13013},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
}