Groma: Localized Visual Tokenization
for Grounding Multimodal Large Language Models

Chuofan Ma1,    Yi Jiang2†,    Jiannan Wu1,    Zehuan Yuan2,    Xiaojuan Qi1†,   
1The University of Hong Kong,    2ByteDance Inc.

Abstract

We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization.

Teaser image.

Groma is a multimodal large language model with exceptional region understanding and visual grounding capabilities. It can take user-defined region inputs (boxes) as well as generate long-form responses that are grounded to visual context.



Method

A novel paradigm of MLLMs. Instead of relying on LLMs or external modules for localization, Groma exploits the spatial under- standing capability of the visual tokenizer. This perceive-then-understand design also resembles human vision process.

A conceptual overview of different grounded MLLMs. (a) LLM for localization (e.g., Kosmos-2, Shikra); (b) External modules for localization (e.g., Lisa); and (c) Visual tokenier for localization (Ours).

Framework. Groma encodes the image into both global image tokens and local region tokens. Specifically, a general-purpose region proposer is introduced to discover regions of interest, followed by a light-weight encoder for region tokenization. By integrating region tokens into user instructions and model responses, Groma unlocks the referring and grounding abilities.

Highlights:

  • Balancing computation and input resolution. High-resolution images are critical for accurate localization, but processing such images is computationally intensive for LLMs. We address this issue by shifting localization to image tokenization. This allows us to utilize high-resolution image inputs for the image tokenizer while downsampling the image tokens for the LLM, which saves computation without sacrificing localization accuracy.
  • Decoupled design for specialised training. To obtain robust and precise localization capability, we pretrain Groma on large-scale detection data. Thanks to the decoupled design of perception and cognition within Groma, we circumvent the need to involve the LLM during detection pretraining. This allows Groma to benefit from pretraining on millions of bounding box annotations — a task that would be computationally prohibitive for classic MLLMs.
  • Unified interface for referring and grounding. Referring and grounding are just like two sides of a coin - although different in task forms, they demand the same type of knowledge, i.e., localized understanding. Therefore, instead of having separate designs for referring and grounding, Groma seamlessly unifies the two capabilities with region tokens.

Data

Groma Instruct. We curate 30k visually grounded conversations for instruction finetuning, which is the first grounded chat dataset constructed with both visual and textual prompts, leveraging the powerful GPT-4V for data generation.

Qualitative Results

Referring Expression Comprehension

Region Description

Referential Dialogue

Grounded Chat

Quantitative Results

BibTeX

@misc{ma2024groma,
      title={Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models},
      author={Chuofan Ma and Yi Jiang and Jiannan Wu and Zehuan Yuan and Xiaojuan Qi},
      year={2024},
      eprint={2404.13013},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }
}