HallusionBench Results

HallusionBench is a benchmark to evaluate hallucination of VLMs. It asks a set of visual questions with one original image and one modified image (the answers for a question can be different, considering the image content).

Examples in HallusionBench:

Original Figure	Modified Figure

Q1. Is the right orange circle the same size as the left orange circle? A1. Yes	Q1. Is the right orange circle the same size as the left orange circle? A1. No
Q2. Is the right orange circle larger than the left orange circle? A2. No	Q2. Is the right orange circle larger than the left orange circle? A2. Yes
Q3. Is the right orange circle smaller than the left orange circle? A3. No	Q3. Is the right orange circle smaller than the left orange circle? A3. No

**Metrics: **

aAcc: The overall accuracy of all atomic questions.

qAcc: The mean accuracy of unique questions. One question can be asked multiple times with different figures, we consider VLM correctly solved a unique question only if it succeeds in all <question, figure> pairs for this unique question.

fAcc: The mean accuracy of all figures. One figure is associated with multiple questions, we consider VLM correct on a figure only if it succeeds to solve all questions of this figure.

**Evaluation Setting: **

No-visual Questions (questions asked without the associated figure) in HallusionBench are skipped during evaluation.

When we failed to extract Yes / No from the VLM prediction, we adopt GPT-3.5-Turbo-0613 as the answer extractor.

We report aAcc, qAcc, and fAcc for all evaluated VLMs.

Evaluation Results

Models are sorted by the descending order of qAcc.

Model	aAcc	fAcc	qAcc
GPT-4v (detail: low)	65.8	38.4	35.2
GeminiProVision	63.9	37.3	34.3
Qwen-VL-Chat	56.4	27.7	26.4
MiniGPT-4-v1-7B	52.4	17.3	25.9
CogVLM-17B-Chat	55.1	26.3	24.8
InternLM-XComposer-VL	57	26.3	24.6
MiniGPT-4-v1-13B	51.3	16.2	24.6
SharedCaptioner	55.6	22.8	24.2
MiniGPT-4-v2	52.6	16.5	21.1
InstructBLIP-7B	53.6	20.2	19.8
Qwen-VL	57.6	12.4	19.6
OpenFlamingo v2	52.7	17.6	18
mPLUG-Owl2	48.9	22.5	16.7
VisualGLM	47.2	11.3	16.5
IDEFICS-9B-Instruct	50.1	16.2	15.6
ShareGPT4V-7B	48.2	21.7	15.6
LLaVA-InternLM-7B (LoRA)	49.1	22.3	15.4
InstructBLIP-13B	47.9	17.3	15.2
LLaVA-v1.5-7B	48.3	19.9	14.1
LLaVA-v1.5-13B (LoRA, XTuner)	46.9	17.6	14.1
LLaVA-v1.5-7B (LoRA, XTuner)	46.2	16.2	13.2
LLaVA-v1.5-13B	46.7	17.3	13
IDEFICS-80B-Instruct	46.1	13.3	11
TransCore-M	44.7	16.5	10.1
LLaVA-v1-7B	44.1	13.6	9.5
PandaGPT-13B	43.1	9.2	7.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HallusionBench.md

HallusionBench.md

HallusionBench Results

Evaluation Results

Files

HallusionBench.md

Latest commit

History

HallusionBench.md

File metadata and controls

HallusionBench Results

Evaluation Results