RealWorldQA, What's New?

Community Article Published April 25, 2024

This is a short blog that introduce the RealWorldQA Benchmark.

What is RealWorldQA?

RealWorldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models, contributed by XAI. It assesses how well these models comprehend physical environments. The benchmark consists of 700+ images, each accompanied by a question and a verifiable answer. These images are drawn from real-world scenarios, including those captured from vehicles. The goal is to advance AI models' understanding of our physical world.

Statistics & Info

Name Type #Questions Data Quality* (manually verified 10% samples) Finegrained Classes
RealWorldQA MCQ 765 > 97% No

TL;DR: **RealWorldQA **is a benchmark that requires VLMs to have the capability of:

  1. Recognize details in high-res images (1080p, etc.).
  2. Perform reasoning based on recognition results (may require commonsense knowledge).

*Data Quality: We perform manual verification on 10% samples and check if each sample is correct and unambiguous. Most samples (>97%) in RealWorldQA are good and clear.

Some cases I found ambiguous like:

image/png

  • Question: Where is the dog in relation to the door?
  • Choices: A. The dog is behind the door; B. The dog is next to the door; C. The dog is in front of the door.
  • Answer: A
  • Why ambiguous: The dog is actually between two doors.

image/png

  • Question: How far from the camera is the rightmost vehicle?
  • Choices: A. 15 meters; B. 35 meters; C. 55 meters.
  • Answer: C
  • Why ambiguous: Is the rightmost car that far?

Performance

Questions in RealWorldQA have 2 - 4 candidate choices (the majority have 3 choices), the expectation of RandomGuess Top-1 accuracy would be 37.7%.

We perform the evaluation using VLMEvalKit and list the performance of representative VLMs (proprietarty or opensource) below:

Proprietary Models Acc Proprietary Models Acc
GPT-4v (0409, low-res) 61.4 GPT-4v (0409, high-res) 68.0
GeminiPro-V (1.0) 60.4 QwenVLMax 61.3
OpenSource Models Acc OpenSource Models Acc
InternLM-XComposer2 63.8 InternVL-Chat-V1.5 65.6
IDEFICS2-8B 60.8 LLaVA-NeXT (Yi-34B) 66.0
LLaVA-v1.5 (7B) 54.8 LLaVA-v1.5 (13B) 55.3

Grok-v1.5 is not included since it's not publicly available.

Among the evaluated VLMs, GPT-4v (0409, high-res) achieves the best performance and significantly outperforms its low-res version (remember that RealWorldQA requires fine-grained recognition in high-res images). Meanwhile, top OpenSource VLMs also display competitive performace.

Hard Cases

We select a subset of questions that cannot be correctly answered by all of the Top-3 VLMs (GPT-4v (0409, high-res), InternVL-Chat-V1.5, LLaVA-NeXT (Yi-34B)). The subset includes 101 samples. We visualize several random samples in the subset below.

image/png

  • Question: Is the car closest to us driving in the same direction as us or in the opposite direction from us.
  • Choices: A. Same direction; B. Opposite direction.
  • Answer: B
  • Requirement: 1. Locate the closest car and find its direction; 2. Locate the lane we are in and infer the direction of us.

image/png

  • Question: In which direction is the one-way sign in this scene facing?
  • Choices: A. Left; B. Right
  • Answer: B
  • Requirement: Localize the one-way sign and find its direction

image/png

  • Question: Are there some STOP signs?
  • Choices: A. Yes; B. No
  • Answer: A
  • Requirement: Localize the stop sign (which is extremely small)

image/png

  • Question: How many arrows are pointing right?
  • Choices: A. 2; B. 3; C. 4
  • Answer: B
  • Requirement: Find all arrows on the road sign and recognize their directions

Takeaway

  • RealWorldQA is a benchmark that requires VLMs to have the capability of: 1. Recognize details in high-res images (1080p, etc.); 2. Perform **reasoning based on recognition results **(may require commonsense knowledge)
  • Performance Numbers: Random Guess - 37.7%; Best Proprietary VLM evaluated: GPT-4v (0409, high-res), 68%; Best OpenSource VLM evaluated: LLaVA-NeXT (Yi-34B), 66%
  • You can use VLMEvalKit to evaluate your own VLM on RealWorldQA. Full evaluation results are available at Open VLM Leaderboard.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment