Top AI Models Compared: Coding & Reasoning Scores (2025)

Benchmark comparison of 15 leading AI models from 10 organizations, with coding scores, reasoning scores, context windows, and open-weights status.
# model
org
release_date
context_window_k
params_b
coding_score
reasoning_score
multimodal
open_weights
+
1 GPT-4o OpenAI 2024-05-13 128 unknown 82 88 yes no
2 Claude 3.5 Sonnet Anthropic 2024-10-22 200 unknown 85 91 yes no
3 Claude 3.7 Sonnet Anthropic 2025-02-24 200 unknown 88 94 yes no
4 Gemini 2.0 Flash Google 2025-01-15 1000 unknown 80 86 yes no
5 Gemini 2.5 Pro Google 2025-03-25 1000 unknown 86 93 yes no
6 Llama 3.3 70B Meta 2024-12-06 128 70 76 82 no yes
7 DeepSeek-V3 DeepSeek 2024-12-26 128 671 82 87 no yes
8 DeepSeek-R1 DeepSeek 2025-01-20 128 671 84 95 no yes
9 Grok-3 xAI 2025-02-17 131 unknown 83 92 yes no
10 Mistral Large 2 Mistral 2024-07-24 128 123 78 80 no yes
11 Qwen2.5-72B Alibaba 2024-09-19 128 72 79 83 no yes
12 Phi-4 Microsoft 2024-12-12 16 14 77 81 no yes
13 o3-mini OpenAI 2025-01-31 200 unknown 87 96 no no
14 o1 OpenAI 2024-09-12 200 unknown 85 95 no no
15 Command R+ Cohere 2024-04-04 128 104 74 78 no no

Top AI Models Compared: Coding & Reasoning Scores (2025) — AI Analysis

OpenAI's o3-mini tops reasoning (96) while Claude 3.7 Sonnet leads coding (88) — no single model dominates both

DeepSeek-R1, an open-weights model, matches o1's reasoning score of 95, closing the open/closed gap

Key Findings

  • Reasoning-specialist models (o3-mini, o1, DeepSeek-R1) dominate reasoning benchmarks with scores of 95-96, but trail on coding
  • Claude 3.7 Sonnet is the only model scoring 88+ on coding, edging out o3-mini (87) and Gemini 2.5 Pro (86)
  • Open-weights models average 5.3 points lower on coding and 6.9 lower on reasoning than closed models
  • Google's Gemini models offer 1M-token context — 5x larger than the next tier (200k) — at competitive benchmark scores
  • Only 6 of 15 models support multimodal input, all from closed-source providers (OpenAI, Anthropic, Google, xAI)

Visualizations

Context Window Size by Model
Open-Weights vs Closed: Average Scores
Coding Score by Model
Reasoning Score by Model

The DeepSeek-R1 Surprise

DeepSeek-R1 is open-weights and scores 95 on reasoning — tied with OpenAI's o1 and just 1 point behind o3-mini. It's the only open model in the top 5 for reasoning, suggesting open-source is catching up fastest in reasoning-heavy tasks.

This dataset contains 15 records across 9 fields: model, org, release_date, context_window_k, params_b, coding_score, and 3 more.

15 rows · 9 columns · 2026-03-24

Embed this data story

Embedding launches soon. You'll be able to embed interactive charts and data tables on any website.