VLMs are Blind

Abstract

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Sonnet-3.5 performs the best at 74.01% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together.

Overview of All Tasks

Model									Mean
Random	33.33	50.00	5.77	20.00	20.00	25.00	4.55	33.33	24.00
	41.61	72.67	70.18	42.50	17.50	55.83	39.58	47.89	48.47
	66.94	92.78	92.81	87.08	19.37	80.00	39.39	41.60	65.00
	43.41	84.52	73.34	31.66	9.79	65.00	36.17	23.24	45.89
	75.36	91.66	89.22	44.16	77.29	92.08	74.26	55.53	74.94
Mean	56.84	85.41	81.39	51.35	30.99	73.29	47.35	42.06	58.57

Accuracy (%) of each model over 7 tasks. The mean accuracy over all four models is 58.57%, substantially better than random chance (24%), which is computed considering each task as a single-label, N-way classification problem. Sonnet-3.5 is the best (74.94% accuracy) but still far from the 100% expected accuracy.

Task 1 Line Intersections

Task 2 Two Circles

Task 3 Circled Letter

Task 4 Overlapping Shapes

Task 5 Nested Squares

Task 6 Counting Grid

Task 7 Subway Map

Task 1: Counting line intersections

Given the impressive accuracy of VLMs on answering questions on diagrams and charts (e.g., Sonnet-3.5 scoring 94.7% on AI2D and 90.8% on ChartQA) [1], a reasonable hypothesis is that VLMs must be able to see whether two graphs intersect in a chart. Here, we test this hypothesis by asking VLMs to count the number of intersections between two 2-segment piece-wise linear functions.

Images

We create 1800 images of 2D line plots drawn on a white canvas. Each line plot consists of two line segments, defined by three points whose x-coordinates are fixed and equally spaced. The y-coordinates are randomly sampled to create two plots that intersect at exactly 0, 1 or 2 points. See Appendix A for more details.

2D line plot example 1 — 0 intersections

2D line plot example 4 — 2 intersections

Fig. 1: Examples of 2D line plots used in the task, showing different numbers of intersections.

Prompts

We ask each question using two different wordings:

"How many times do the blue and red lines touch each other? Answer with a number in curly brackets, e.g., {5}."
"Count the intersection points where the blue and red lines meet. Put your answer in curly brackets, e.g., {2}."

Groundtruth

Answers are ∈ {0, 1, 2} (random-baseline accuracy: 33%).

Results

The following table shows the performance of the four models on the task of counting line intersections.

Line width	GPT-4o	Gemini-1.5 Pro	Sonnet-3	Sonnet-3.5
0.005 × C	45.00	67.55	45.22	75.83
0.01 × C	38.22	66.33	41.61	74.88
Mean	41.61	66.94	43.41	75.36

Qualitative samples


1✗	0✓	2✗	2✗	4✗	1✗
1✗	1✗	1✓	2✗	1✗	1✗
4✗	1✗	2✗	1✓	4✗	1✗
1✗	0✓	2✗	1✓	3✗	2✓

GPT-4o

Gemini-1.5 Pro

Sonnet-3

Sonnet-3.5

Fig. 2: VLMs cannot reliably count the intersections.

Task 2: Two circles

In contrast to Task 1 where we tested VLMs on thin lines, here we evaluate their ability to perceive interactions between larger objects - specifically, two same-sized filled circles. This task assesses VLMs' capability to detect (1) small gaps between circles and (2) overlapping circles.