Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Method

A novel paradigm of MLLMs. Instead of relying on LLMs or external modules for localization, Groma exploits the spatial under- standing capability of the visual tokenizer. This perceive-then-understand design also resembles human vision process.

A conceptual overview of different grounded MLLMs. (a) LLM for localization (e.g., Kosmos-2, Shikra); (b) External modules for localization (e.g., Lisa); and (c) Visual tokenier for localization (Ours).

Framework. Groma encodes the image into both global image tokens and local region tokens. Specifically, a general-purpose region proposer is introduced to discover regions of interest, followed by a light-weight encoder for region tokenization. By integrating region tokens into user instructions and model responses, Groma unlocks the referring and grounding abilities.

Highlights:

Balancing computation and input resolution. High-resolution images are critical for accurate localization, but processing such images is computationally intensive for LLMs. We address this issue by shifting localization to image tokenization. This allows us to utilize high-resolution image inputs for the image tokenizer while downsampling the image tokens for the LLM, which saves computation without sacrificing localization accuracy.
Decoupled design for specialised training. To obtain robust and precise localization capability, we pretrain Groma on large-scale detection data. Thanks to the decoupled design of perception and cognition within Groma, we circumvent the need to involve the LLM during detection pretraining. This allows Groma to benefit from pretraining on millions of bounding box annotations — a task that would be computationally prohibitive for classic MLLMs.
Unified interface for referring and grounding. Referring and grounding are just like two sides of a coin - although different in task forms, they demand the same type of knowledge, i.e., localized understanding. Therefore, instead of having separate designs for referring and grounding, Groma seamlessly unifies the two capabilities with region tokens.

Data

Groma Instruct. We curate 30k visually grounded conversations for instruction finetuning, which is the first grounded chat dataset constructed with both visual and textual prompts, leveraging the powerful GPT-4V for data generation.

Qualitative Results

Referring Expression Comprehension

Region Description

Referential Dialogue

Grounded Chat

Quantitative Results

BibTeX

@misc{ma2024groma,
      title={Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models},
      author={Chuofan Ma and Yi Jiang and Jiannan Wu and Zehuan Yuan and Xiaojuan Qi},
      year={2024},
      eprint={2404.13013},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }
}

Groma: Localized Visual Tokenization
for Grounding Multimodal Large Language Models

Abstract

Groma is a multimodal large language model with exceptional region understanding and visual grounding capabilities. It can take user-defined region inputs (boxes) as well as generate long-form responses that are grounded to visual context.