Vision language models are blind

*Equal contribution
1Auburn University, 2University of Alberta,
17th Asian Conference on Computer Vision (ACCV 2024)
Accepted for Oral Presentation

Abstract

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Sonnet-3.5 performs the best at 74.01% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together.


Overview of All Tasks

Model
Line Intersect
Two Circles
Circled Letter
Olympic Rings
Pentagon
Nested Squares
Grid
Path Following
Mean
Random 33.33 50.00 5.77 20.00 20.00 25.00 4.55 33.33 24.00
GPT-4o 41.61 72.67 70.18 42.50 17.50 55.83 39.58 47.89 48.47
Gemini-1.5 66.94 92.78 92.81 87.08 19.37 80.00 39.39 41.60 65.00
Sonnet-3 43.41 84.52 73.34 31.66 9.79 65.00 36.17 23.24 45.89
Sonnet-3.5 75.36 91.66 89.22 44.16 77.29 92.08 74.26 55.53 74.94
Mean 56.84 85.41 81.39 51.35 30.99 73.29 47.35 42.06 58.57
Accuracy (%) of each model over 7 tasks. The mean accuracy over all four models is 58.57%, substantially better than random chance (24%), which is computed considering each task as a single-label, N-way classification problem. Sonnet-3.5 is the best (74.94% accuracy) but still far from the 100% expected accuracy.

Abstract

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Sonnet-3.5 performs the best at 74.01% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together.


Overview of All Tasks

Model
Line Intersect
Two Circles
Circled Letter
Olympic Rings
Pentagon
Nested Squares
Grid
Path Following
Mean
Random 33.33 50.00 5.77 20.00 20.00 25.00 4.55 33.33 24.00
GPT-4o 41.61 72.67 70.18 42.50 17.50 55.83 39.58 47.89 48.47
Gemini-1.5 66.94 92.78 92.81 87.08 19.37 80.00 39.39 41.60 65.00
Sonnet-3 43.41 84.52 73.34 31.66 9.79 65.00 36.17 23.24 45.89
Sonnet-3.5 75.36 91.66 89.22 44.16 77.29 92.08 74.26 55.53 74.94
Mean 56.84 85.41 81.39 51.35 30.99 73.29 47.35 42.06 58.57
Accuracy (%) of each model over 7 tasks. The mean accuracy over all four models is 58.57%, substantially better than random chance (24%), which is computed considering each task as a single-label, N-way classification problem. Sonnet-3.5 is the best (74.94% accuracy) but still far from the 100% expected accuracy.

Task 1: Counting line intersections Two intersecting lines

Given the impressive accuracy of VLMs on answering questions on diagrams and charts (e.g., Sonnet-3.5 scoring 94.7% on AI2D and 90.8% on ChartQA) [1], a reasonable hypothesis is that VLMs must be able to see whether two graphs intersect in a chart. Here, we test this hypothesis by asking VLMs to count the number of intersections between two 2-segment piece-wise linear functions.

Images

We create 1800 images of 2D line plots drawn on a white canvas. Each line plot consists of two line segments, defined by three points whose x-coordinates are fixed and equally spaced. The y-coordinates are randomly sampled to create two plots that intersect at exactly 0, 1 or 2 points. See Appendix A for more details.

2D line plot example 1
0 intersections
2D line plot example 2
1 intersection
2D line plot example 3
1 intersection
2D line plot example 4
2 intersections
Fig. 1: Examples of 2D line plots used in the task, showing different numbers of intersections.

Prompts

We ask each question using two different wordings:

  1. "How many times do the blue and red lines touch each other? Answer with a number in curly brackets, e.g., {5}."
  2. "Count the intersection points where the blue and red lines meet. Put your answer in curly brackets, e.g., {2}."

Groundtruth

Answers are โˆˆ {0, 1, 2} (random-baseline accuracy: 33%).

Results

The following table shows the performance of the four models on the task of counting line intersections.

Line width
GPT-4o
Gemini-1.5 Pro
Sonnet-3
Sonnet-3.5
0.005 ร— C 45.00 67.55 45.22 75.83
0.01 ร— C 38.22 66.33 41.61 74.88
Mean 41.61 66.94 43.41 75.36

Qualitative samples

How many times do the blue and red lines intersect?

Graph 1 Graph 2 Graph 3 Graph 4 Graph 5 Graph 6
1โœ— 0โœ“ 2โœ— 2โœ— 4โœ— 1โœ—
1โœ— 1โœ— 1โœ“ 2โœ— 1โœ— 1โœ—
4โœ— 1โœ— 2โœ— 1โœ“ 4โœ— 1โœ— 1โœ—
1โœ— 0โœ“ 2โœ— 1โœ“ 3โœ— 2โœ“
GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 2: VLMs cannot reliably count the intersections.

Task 2: Two circles Two intersecting lines

In contrast to Task 1 where we tested VLMs on thin lines, here we evaluate their ability to perceive interactions between larger objects - specifically, two same-sized filled circles. This task assesses VLMs' capability to detect (1) small gaps between circles and (2) overlapping circles.

Images

We generate 672 images of two circles on a white canvas. The circles vary in size, distance, and orientation:

  • Circle diameters: 1/4, 1/5, 1/6, or 1/7 of the canvas size
  • Distances between circle perimeters: -0.15 to 0.5 times the diameter
  • Orientations: 90ยฐ, 0ยฐ, -45ยฐ, and 45ยฐ angles with the x-axis
  • Canvas sizes: 384, 769, and 1155 pixels
Overlapping circles
Overlapping and touching
Touching circles
Non-overlapping but touching
Separated circles
Non-overlapping and non-touching
Diagonal orientation
Different orientation
Fig. 3: Examples of two-circle images used in the task, showing different configurations.

Prompts

We ask each question using two different wordings:

  1. "Are the two circles touching each other? Answer with Yes/No."
  2. "Are the two circles overlapping? Answer with Yes/No."

Groundtruth

Answers are based on the distance d between circle perimeters:

  • d < 0: Overlapping and touching
  • d = 0: Non-overlapping but touching
  • d > 0: Non-overlapping and non-touching

Random-baseline accuracy: 50%.

Results

The following table shows the performance of the four models on the task of counting line intersections.

GPT-4o
Gemini-1.5 Pro
Sonnet-3
Sonnet-3.5
Overlapping 71.27 93.30 88.09 88.83
Touching 74.10 92.26 80.95 94.49
Average 72.69 92.78 84.52 91.66

Qualitative samples

Are the two circles overlapping? Answer with Yes/No.

Circle 1 Circle 2 Circle 3 Circle 4 Circle 5 Circle 6
Yesโœ“ Yesโœ“ Yesโœ— Yesโœ— Noโœ“ Yesโœ—
Noโœ— Yesโœ“ Yesโœ— Noโœ“ Noโœ“ Noโœ“
Yesโœ“ Yesโœ“ Yesโœ— Yesโœ— Yesโœ— Noโœ“
Noโœ— Noโœ— Noโœ“ Noโœ“ Noโœ“ Noโœ“
GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 4: VLMs consistently fail at smaller distances. However, at a large gap, GPT-4o remains unreliable (rightmost). Sonnet-3.5 tends to conservatively answer "No" regardless of the actual distance between the two circles.

Task 3: The circled letter Two intersecting lines

Consistent with prior reports [2][3][4], we find that VLMs can 100% accurately identify a primitive shape (e.g., a red circle โญ•)[2] and can perfectly read an English word (e.g., Subdermatoglyphic) alone. Here, we superimposed the red circle on every letter, one at a time, in the word, and ask VLMs to identify which letter is being circled. While the task is easy to humans, our hypothesis is that if a VLM's vision is "blurry", it might not be able to identify the exact letter being circled since there is tiny spacing between the adjacent letters.

Images

We choose three strings Acknowledgement, Subdermatoglyphic, and tHyUiKaRbNqWeOpXcZvM because they contain characters of variable widths and heights. Furthermore, all four tested VLMs can read out all characters in these strings when they are input to the models as an image. While Acknowledgement is a common English word, Subdermatoglyphic is the longest word without repetitive letters. We also test VLMs on the random string tHyUiKaRbNqWeOpXcZvM to estimate how much model accuracy is due to its familiarity with the word.

For each (string, circled-letter) pair, we render a 512ร—512 image by choosing among 3 red oval line-thickness levels, 2 font sizes, and 4 random positions in the canvas for a total of 24 images. That is, we generate 360, 408, and 480 images for Acknowledgement (15 letters), Subdermatoglyphic (17 letters), and tHyUiKaRbNqWeOpXcZvM (20 letters), respectively. We ensure each letter to be circled fits completely the oval.

Circled letter example 1
Acknowledgement with 'n' circled
Circled letter example 2
tHyUiKaRbNqWeOpXcZvM with 't' circled
Circled letter example 3
tHyUiKaRbNqWeOpXcZvM with 'X' circled
Circled letter example 4
Subdermatoglyphic with 'u' circled
Fig. 5: Examples of circled letter images used in the task, showing different words and circled letters.

Prompts

We ask each question using two different wordings:

  1. "Which letter is being circled?"
  2. "Which character is being highlighted with a red oval?"

Groundtruth

Letters need to match predicted letters exactly (case-insensitive).

Results

The following table shows the performance of the four models on the task of identifying the circled letter.

Word
GPT-4o
Gemini-1.5 Pro
Sonnet-3
Sonnet-3.5
Acknowledgement 69.03 97.50 82.64 91.11
Subdermatoglyphic 63.60 91.05 71.45 94.49
tHyUiKaRbNqWeOpXcZvM 77.92 89.90 65.94 82.08
Average 70.18 92.81 73.34 89.22

Qualitative samples

Which letter is being circled?

Circled Letter 1 Circled Letter 2 Circled Letter 3 Circled Letter 4 Circled Letter 5 Circled Letter 6
oโœ— eโœ— tโœ— oโœ— oโœ— zโœ—
wโœ— mโœ“ nโœ“ pโœ“ oโœ— vโœ“
oโœ— eโœ— eโœ— yโœ— aโœ— tโœ—
lโœ“ eโœ— tโœ— hโœ— tโœ“ mโœ—
GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 6: Identifying the letter being circled is non-trivial for VLMs across both English words (Acknowledgement & Subdermatoglyphic) and a random string (tHyUiKaRbNqWeOpXcZvM). When making mistakes, VLMs tend to predict letters adjacent to the circled one.

Task 4: Counting overlapping shapes Two intersecting lines

Aligned with prior research [4], we also find VLMs to be able to count disjoint circles. Yet, here, we test VLMs on counting circles that are intersecting like in the Olympic logoโ€”a common cognitive development exercise for preschoolers [5][6]. Our hypothesis is that a "blurry" vision may not see the intersection between two circles clearly and therefore unable to trace circles and count them. For generalization of our findings, we repeat the experiment with pentagons as well.

Images

In an image of size Cร—C, where C โˆˆ {384, 769, 1155}px, we draw N โˆˆ {5, 6, 7, 8, 9} overlapping, same-sized circles arranged in two rows like the Olympic logo. A circle diameter ฯ† โˆˆ {C/5, C/10}. We repeat the images with two different line thickness for rendering circles. This procedure renders 3 resolutions ร— 5 ร— 2 diameters = 60 images. We repeat for pentagons in addition to circles, resulting in 60 ร— 2 shapes = 120 images in total. For pentagons, their side length d โˆˆ {C/5, C/10}.

Olympic-like logo example 1
5 circles, small diameter
Olympic-like logo example 2
6 circles, large diameter
Olympic-like logo example 3
8 colored circles
Olympic-like logo example 4
9 colored pentagons
Fig. 7: Examples of Olympic-like logo images used in the task, showing different numbers of shapes, sizes, and colors.

Prompts

We ask each question using two different wordings:

  1. "How many {shapes} are in the image? Answer with only the number in numerical format."
  2. "Count the {shapes} in the image. Answer with a number in curly brackets e.g. {3}."

Where {shapes} is either "circles" or "pentagons" depending on the image.

Groundtruth

Answers are โˆˆ {5, 6, 7, 8, 9} (random-baseline accuracy: 20%).

Results

The following table shows the performance of the four models on the task of identifying the circled letter.

GPT-4o
Gemini-1.5 Pro
Sonnet-3
Sonnet-3.5
Circles 42.50 20.83 31.66 44.16
Pentagons 19.16 9.16 11.66 75.83

Qualitative samples

How many circles are in the image? Answer with only the number in numerical format.

Circle 1 Circle 2 Circle 3 Circle 4 Circle 5 Circle 6
5โœ“ 5โœ— 7โœ— 12โœ— 11โœ— 5โœ—
5โœ“ 5โœ— 5โœ— 5โœ— 5โœ— 5โœ—
3โœ— 5โœ— 5โœ— 10โœ— 10 5โœ—
4โœ— 6โœ“ 6โœ“ 10โœ— 9โœ“ 7โœ“
GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 8: Gemini-1.5 Pro often predicts "5" circles.

Task 5: Counting the nested squares Two intersecting lines

Motivated by the findings that VLMs struggle in counting the intersected circles (Task 4), here, we arrange the shapes differently so that their edges do not intersect. That is, each shape is nested entirely inside another. For completeness, we test squares in this task.

Images

In a canvas of size Cร—C, we render N โˆˆ {2, 3, 4, 5} nested squares. The outermost square is rendered first using a random edge length d and a line thickness โˆˆ {2, 3, 4}px. The remaining N-1 squares are drawn using a size reduction factor, 0.75 ร— d and placed at a random coordinate that ensures they do not touch outer squares. For each line thickness, we generate 10 images (where squares have different, random locations) to create 3 ร— 10 = 30 images. Repeating the process for all N values results in 4 ร— 30 = 120 images.

2 nested squares
2 nested squares
3 nested squares
3 nested squares
4 nested squares
4 nested squares
5 nested squares
5 nested squares
Fig. 9: Examples of nested square images used in the task, showing different numbers of squares.

Prompts

We ask each question using the following wording:

  1. "How many squares are in the image? Please answer with a number in curly brackets e.g., {10}."
  2. "Count total number of squares in the image. Answer with only the number in numerical format in curly brackets e.g. {3}."

Where {shapes} is either "circles" or "pentagons" depending on the image.

Groundtruth

Answers are โˆˆ {2, 3, 4, 5} (random-baseline accuracy: 25%).

Results

The following table shows the performance of the four models on the task of counting nested squares.

GPT-4o
Gemini-1.5 Pro
Sonnet-3
Sonnet-3.5
Squares 55.83 87.08 65.00 92.08

Qualitative samples

How many squares are in the image? Please answer with a number in curly brackets e.g. {10}.

Nested Squares 1 Nested Squares 2 Nested Squares 3 Nested Squares 4 Nested Squares 5 Nested Squares 6 Nested Squares 7 Nested Squares 8
3โœ— 5โœ— 3โœ“ 5โœ— 5โœ— 5โœ— 6โœ— 6โœ—
2โœ“ 3โœ“ 2โœ— 3โœ“ 5โœ— 5โœ— 5โœ“ 4โœ—
2โœ“ 4โœ— 2โœ— 4โœ— 5โœ— 4โœ“ 4โœ— 5โœ“
2โœ“ 3โœ“ 3โœ“ 3โœ“ 4โœ“ 4โœ“ 4โœ— 5โœ“
GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 10: Only Sonnet-3.5 can count the squares in a majority of the images.

Task 6: Counting the rows and columns of a grid Two intersecting lines

The results from prior tasks show VLMs cannot always count shapes that are overlapping (Task 4) or nested (Task 5). What about adjacent shapes? Here, we tile up shapes (specifically, squares) into a grid and challenge VLMs to countโ€”a task that is supposedly simple to VLMs given their remarkable performance (โ‰ฅ 90% accuracy) on DocVQA, which includes many questions with tables. To simplify the task, we ask models to count the number of rows and columns in a given table.

Images

A grid may have Nร—N, Nร—N', or N'ร—N cells, where Nโˆˆ{3, 4, 5, 6, 7, 8, 9} and N' = N + 1. Each grid is rendered with two different line-thicknesses on a canvas of size Cร—C where Cโˆˆ{500, 1250, 2000}px. Besides empty grids, we also replicate the procedure to make grids contain text (which is more common in real-world tables) where each cell contains a single random word. Two versions combined have 2ร—222 = 444 images.

Text grid 3x3
Text grid (3x3)
Text grid 3x4
Text grid (3x4)
Empty grid 4x4
Empty grid (4x4)
Empty grid 4x5
Empty grid (4x5)
Fig. 9: Examples of grid images used in the task, showing text-filled and empty grids with various dimensions.

Prompts

We ask each question using two different wordings:

  1. "Count the number of rows and columns and answer with numbers in curly brackets. For example, rows={5} columns={6}"
  2. "How many rows and columns are in the table? Answer with only the numbers in a pair (row, column), e.g., (5,6)"

Groundtruth

Answers include both the number of rows and columns. An answer is correct when both column and row counts are correctly predicted.

Results

The following table shows the performance of the four models on the task of counting rows and columns in grids.

Grid type
GPT-4o
Gemini-1.5 Pro
Sonnet-3
Sonnet-3.5
Mean
Empty 26.13 26.51 25.00 59.84 34.37
Text 53.03 52.27 47.34 88.68 60.33
Mean 39.58 39.39 36.17 74.26 47.35

Count the number of rows and columns and answer with numbers in curly brackets. For example, rows={5} columns={6}.

Grid 1 Grid 2 Grid 3 Grid 4 Grid 5 Grid 6
GT 4ร—5 6ร—7 7ร—6 8ร—7 3ร—4 6ร—7
4ร—4โœ— 6ร—6โœ— 7ร—7โœ— 6ร—6โœ— 3ร—4โœ“ 7ร—7โœ—
5ร—5โœ— 6ร—6โœ— 7ร—7โœ— 10ร—10โœ— 3ร—4โœ“ 7ร—8โœ—
5ร—5โœ— 7ร—8โœ— 6ร—6โœ— 9ร—9โœ— 4ร—4โœ— 7ร—7โœ—
4ร—5โœ“ 6ร—7โœ“ 7ร—7โœ— 8ร—7โœ“ 3ร—4โœ“ 7ร—7โœ—
GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 12: When text is included in the cells of the grid, the performance of all VLMs improves, especially Sonnet-3.5.

Task 7: Following single-colored paths Two intersecting lines

It is important for VLMs to be able to follow paths in order to read maps or charts, interpret graphs, and understand user notations (e.g., arrows) in input images. To assess path-following capability, this task asks models to count the unique-color paths between two given stations in a simplified subway map. This is another easy-to-humans task that challenges VLMs significantly.

Images

We create each subway map on an image of size Cร—C, where C โˆˆ {512, 1024}px. We write 4 station names (A, B, C, D) at 4 fixed coordinates. We divide the canvas into an invisible grid of 18ร—18 cells and initialize 3 path-starting points C/18px away from each station. We draw a path, using the depth-first search algorithm starting from a random station and a random starting point, where a valid move is one cell in any direction: North, south, east or west. We repeat the process so that each station has exactly N โˆˆ {1, 2, 3} outgoing paths, for a total of 180 maps.

Station with 1 path
1 path, 10px width
Station with 2 paths
2 paths, 20px width
Station with 2 paths
2 paths, 20px width
Station with 3 paths
3 paths, 10px width
Fig. 14: Examples of subway map images used in the task, showing different numbers of paths and variations in path thickness.

Prompts

We ask each question using two different wordings:

  1. "How many single-colored paths go from A to C? Answer with a number in curly brackets, e.g., {3}"
  2. "Count the one-colored routes that go from A to C. Answer with a number in curly brackets, e.g., {3}."

Groundtruth

Answers are โˆˆ {0, 1, 2, 3} (random-baseline accuracy: 25%).

Results

The following table shows the performance of the four models on the task of counting single-colored paths between stations.

Paths
GPT-4o
Gemini-1.5 Pro
Sonnet-3
Sonnet-3.5
Mean
1 56.25 64.58 22.91 92.91 59.16
2 47.44 38.35 28.69 48.29 40.69
3 40.00 21.87 18.12 25.41 26.35
Mean 47.89 41.60 23.24 55.53 42.06

Qualitative samples

How many single-color paths go from A to D? Answer with a number in curly brackets e.g. {3}

Subway Map 1 Subway Map 2 Subway Map 3 Subway Map 4 Subway Map 5 Subway Map 6
2โœ— 0โœ— 2โœ— 3โœ— 3โœ— 1โœ“
2โœ— 2โœ— 4โœ— 1โœ“ 2โœ“ 5โœ—
2โœ— 2โœ— 3โœ— 2โœ— 3โœ— 3โœ—
1โœ“ 1โœ“ 3โœ— 3โœ— 2โœ“ 1โœ“
GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 13: Some VLMs (Gemini-1.5, Sonnet-3) surprisingly fail in even extremely easy cases (leftmost). As the number of paths exiting each station increases, VLMs tend to perform worse.