CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

Integrated Vision and Language Lab, KAIST
*Indicates Equal Contribution

NeurIPS 2024
gif

We introduce a novel decoding method, CODE: COuntering DEscription Contrastive Decoding.

Abstract

Large Multi-modal Models (LMMs) have recently demonstrated remarkable abilities in visual context understanding and coherent response generation. However, alongside these advancements, the issue of hallucinations has emerged as a significant challenge, producing erroneous responses that are unrelated to the visual contents. In this paper, we introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE), which leverages self-generated descriptions as contrasting references during the decoding phase of LMMs to address hallucination issues. CODE utilizes the comprehensive descriptions from model itself as visual counterpart to correct and improve response alignment with actual visual content. By dynamically adjusting the information flow and distribution of next-token predictions in the LMM's vocabulary, CODE enhances the coherence and informativeness of generated responses. Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. Our method provides a simple yet effective decoding strategy that can be integrated to existing LMM frameworks without additional training.

CODE Overview

Overview

After LMMs generate a comprehensive description for the visual content by themselves, the model recursively outputs logits from each the visual content and the description. By contrasting between two log-likelihoods, CODE produces more contextual and correct responses that match the given visual content suppressing inconsistent words (“catching” → “hit”).

Performance Comparison

Comparison

Comprehensive experimental results across on 6 baseline LMMs, 6 decoding method, and 6 hallucination benchmarks in spider chart format.

Qualitative Results on MMVP

Qualitative Results on Realworld-QA

Qualitative Results on LLaVA-Bench (In-the-Wild)

BibTeX

@article{kim2024code,
  title={CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models},
  author={Kim, Junho and Kim, Hyunjun and Kim, Yeonju and Ro, Yong Man},
  journal={arXiv preprint arXiv:2406.01920},
  year={2024}
}