$AI model diagram showing a generator model proposing a proof and a verifier model checking each step$

ArticleNovember 29, 2025

DeepSeek Math V2: The Open-Source Reasoner Achieving Gold-Level IMO Performance

CN

@Zakariae BEN ALLALCreated on Sat Nov 29 2025

Overview

AI reasoning has taken an exciting turn. DeepSeek, a Chinese research lab that previously astonished the industry with its affordable, high-performance models, has unveiled DeepSeek Math V2—a specialized open-source math reasoner. This model claims to achieve gold-level performance on 2025 International Mathematical Olympiad (IMO) problems, nearly matching Google DeepMind’s Gemini Deep Think on rigorous proof benchmarks, and scoring an impressive 118 out of 120 on the 2024 Putnam exam. Unlike proprietary alternatives, DeepSeek is providing weights, code, and a training recipe for the community to explore, fine-tune, and enhance the model.

This development is significant for more than just accolades. Making high-quality mathematical reasoning available for download has the potential to accelerate advancements in engineering, science, and safety research. However, it also presents new challenges regarding verification, training data integrity, and the responsible use of such systems.

In this article, we delve into what has been released, its performance, the inner workings of the model, comparisons to Google and OpenAI, and best practices for safe usage.

What DeepSeek Released

Model Family: DeepSeek Math V2 is built on the DeepSeek V3.2 experimental base, featuring a 685 billion parameter MoE configuration according to its Hugging Face card. The repository includes a brief paper, figures, and outputs. It operates under the Apache 2.0 license for code and model usage.
Training Recipe: The training process involves a verifier to assess proof steps, which is then utilized as a reward model to teach a generator how to produce rigorous proofs that can withstand scrutiny. This generator-verifier loop is designed to ensure verification stays one step ahead, focusing on detailed, step-by-step rigor.
Release Artifacts: Available on GitHub is the paper along with outputs, and there’s a Hugging Face listing for downloads and a quick start. Be aware that reproducing their top results will require substantial computational resources due to the system’s demands at test-time.

Performance Highlights

DeepSeek boasts remarkable results:

IMO 2025 Problems: The model successfully solved 5 out of 6 problems, qualifying as gold-level according to competition grading. DeepSeek categorizes this as achieving gold-level performance based on their evaluations.
Putnam 2024: Achieved 118 out of 120, with nearly full scores on most problems, particularly when employing significant computational resources.
CMO 2024: Met gold-level standards.
IMO-ProofBench: Scored 61.9% on the Advanced Proof set and around 99% on the Basic set, placing it in close proximity to Google’s Gemini Deep Think.

It’s essential to note the benchmarking context. Google’s DeepMind has documented official graded results for its Gemini Deep Think model, which achieved gold-level performance under natural language constraints. In contrast, DeepSeek’s IMO claims are documented in its own literature rather than through official grading, which, while ambitious, calls for careful interpretation of the results.

Comparison with Google and OpenAI

Google DeepMind: Gemini Deep Think not only reached gold-level performance by solving 5 out of 6 problems but also did so within the competition’s time limits and fully in natural language—a notable advancement from previous formal-language approaches.
ProofBench Scores: On the Advanced IMO-ProofBench, Gemini Deep Think is rated at approximately 65.7%, while more generic models are at lower rates. DeepSeek Math V2’s score of 61.9% highlights its competitive position within this stringently evaluated framework.

Overall, this data suggests that specialized models like Deep Think can vastly improve performance in complex mathematical tasks, and open-source models like DeepSeek are making strides in this domain through rigorous verification methods.

Why This is a Breakthrough

Open Access to Advanced Reasoning: While Google and OpenAI have demonstrated what proprietary systems can achieve, DeepSeek is pioneering by releasing an open-source package that performs comparably, paving the way for reproducibility and independent evaluations.
Emphasis on Proofs Rather than Just Answers: DeepSeek’s methodology prioritizes self-verification and correctness of proofs, which aligns more closely with traditional mathematical practices compared to other benchmarks.
Community-Driven Innovation: With publicly available weights and resources, researchers are equipped to experiment with various verification objectives, enhance test-time scaling, and integrate the model into existing formal systems.

Thank You for Reading this Blog and See You Soon! 🙏 👋

Let's connect 🚀

Share this article

Latest Insights

Deep dives into AI, Engineering, and the Future of Tech.

Featured

Collage of five AI browsers - Chrome Gemini, Edge Copilot, ChatGPT Atlas, Perplexity Comet, and Dia - displayed on a laptop screen in a workspace

I Tried 5 AI Browsers So You Don’t Have To: Here’s What Actually Works in 2025

I explored 5 AI browsers—Chrome Gemini, Edge Copilot, ChatGPT Atlas, Comet, and Dia—to find out what works. Here are insights, advantages, and safety recommendations.

Read Article

Must Read

AWS Nova 2 and Nova Forge announced onstage at re:Invent 2025, highlighting enterprise AI customization

AWS’s Nova 2 and Nova Forge Empower Tailored Enterprise AI Solutions

Discover AWS's Nova 2 and Nova Forge, which empower builders to create custom "Novellas" by integrating your data in earlier training phases for enhanced control, reliability, and scale.

View of a modern UK supercomputing facility representing AI compute and data infrastructure

AI Week in Review: UK’s Science-Driven Strategy and Global Trends, Nov 15-22, 2025

The UK launches its AI for Science Strategy, expands AI Growth Zones, and unveils a national data facility while global AI adoption accelerates and OpenAI partners with Foxconn.

Andrej Karpathy discussing AI and education at a tech event

Karpathy’s Verdict on AI Homework: Stop Policing, Start Redesigning School

Andrej Karpathy argues the war on AI homework is lost. Learn how schools can adapt: shift grading in-class, teach AI literacy, and design fair assessments.

Three Years of ChatGPT: How a Quiet Demo Transformed Tech, Work, and Markets

Three years after ChatGPT’s launch, discover how it reshaped tech, work, and markets—from GPT-4 to GPT-4o and 800M weekly users, plus what’s next.