EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Shanghai Jiao Tong University, Shanghai AI Lab
Arxiv Preprint

Abstract

Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowledge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge. In this paper, we introduce EchoSight, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large language models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8% on Encyclopedic VQA and 31.3% on InfoSeek.

VQA with Multimodal Reranking

Teaser

For visual questions such as “When was the 1st ascent of this mountain?”, visual-only search methods consider image similarity only, ignoring the textual details of the accompanying article. By incorporating multimodal reranking, the correct entry, accounting for both visual and textual information, can be accurately identified.

Method

Overall Structure

The overall view of our proposed EchoSight. (i) Given a visual question with an image, the retriever searches the reference image in the knowledge base for top k similar images to get their corresponding Wikipedia Entries. (ii) After changing the granularity to sections, all the sections of retrieved entries are then reranked with the maximum pairwise similarity of their textual embeddings and the reference image+question's Q-Former query tokens. (iii) The top reranked section will be utilized as RAG prompt for the LLM to generate the ultimate answer.

Results

VQA Results

VQA Accuracy comparison with the SOTA methods. Google Lens method can be considered as the upperbound. Vanilla method indicates the LLM directly generate answers with textual questions only. BLIP-2 and LLaVA are strong vision language models yet with no retrieval augmented. Wiki-LLaVA and DPRV+T* are recent works focusing on retrieval-augmented answer generation. Our proposed EchoSight is reported without and with multimodal reranking.

Qualitative Results

VQA Results

Demo Examples

Demo1
Demo2

BibTeX

@misc{yan2024echosightadvancingvisuallanguagemodels,
        title={EchoSight: Advancing Visual-Language Models with Wiki Knowledge}, 
        author={Yibin Yan and Weidi Xie},
        year={2024},
        eprint={2407.12735},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2407.12735}, 
  }