From Emerging to Superhuman: Making Sense of DeepMind’s Six Levels of AGI

@Zakariae BEN ALLALCreated on Sat Aug 23 2025

From Emerging to Superhuman: Making Sense of DeepMind’s Six Levels of AGI

Published: 2025-08-23

TL;DR: Google DeepMind proposes six capability levels on the path to artificial general intelligence (AGI), defined along capability, generality, and autonomy. The framework is meant to standardize how we talk about progress, not to forecast timelines. For builders, buyers, and policymakers, it’s a useful map—but it’s not a safety scale and doesn’t replace risk evaluations. Here’s what the levels mean, how they compare to Anthropic and OpenAI’s safety frameworks, what we can and can’t measure today, and the practical implications for governance and product strategy.

What DeepMind actually published

Google DeepMind outlined a taxonomy—“Levels of AGI”—to make progress toward general-purpose AI more measurable and comparable. The proposal defines six bands of capability and generality, paired with a notion of autonomy (how independently a system can pursue tasks and goals). It emphasizes that models can straddle levels depending on the task, and that reliability and safety are separate dimensions that must be measured with dedicated evaluations.

A mental model of the taxonomy: capability (from novice to superhuman), generality (number and breadth of domains), and autonomy (ability to plan, act, and use tools). Levels are bands, not hard thresholds.

The six levels at a glance

DeepMind’s levels describe what a system can do across tasks (capability and generality) and how independently it can do them (autonomy). Summarized:

Level 1 — Emerging: Narrow capabilities with noticeable gaps. Useful but inconsistent performance; low autonomy and heavy human steering.
Level 2 — Competent: Solid performance across a broader set of tasks and domains. Can follow multi-step instructions and use tools with guidance; still requires close oversight.
Level 3 — Expert: Reaches skilled human levels across many tasks in multiple domains. Can coordinate subtasks, use external tools, and work semi-independently; reliability and robustness still vary by context.
Level 4 — Virtuoso (human-expert+): Performs at or above top human experts across most tasks in many domains; higher reliability and autonomy, including planning and orchestrating other tools/agents.
Level 5 — Superhuman: Exceeds the best humans on nearly all relevant tasks; demonstrates strong generalization, reliability, and autonomous problem solving.
Level 6 — Strongly Superhuman: Far beyond human capabilities across science, engineering, and complex open-ended tasks; can autonomously decompose goals, discover strategies, and operate with minimal oversight.

Where are today’s frontier models? Depending on the task, they typically span Levels 1–3, with meaningful gains in tool use and planning but clear limits in reliability and autonomy in open-ended, real-world settings.

How this differs from safety scales

It’s easy to conflate capability levels with safety. DeepMind’s taxonomy is not a safety rating; it won’t tell you whether a model is secure against misuse or robust under adversarial pressure. It complements (but doesn’t replace) safety-oriented frameworks such as:

Anthropic’s AI Safety Levels (ASL): A “responsible scaling” policy that sets concrete safety evaluations and guardrails tied to capability growth (e.g., bio, cyber, and model manipulation risk thresholds).
OpenAI’s Preparedness Framework: Risk evaluations focused on categories like cyber offense, CBRN, persuasion, and autonomy, with deployment gating based on measured risk.
NIST AI Risk Management Framework (RMF): A general enterprise risk framework for AI systems (govern, map, measure, manage) that organizations can use regardless of model provider.

Bottom line: DeepMind’s levels help us talk about capability trajectory. Safety frameworks govern whether and how to deploy as capability rises.

Measuring “levels” in practice

Turning the taxonomy into numbers requires careful evaluations:

Benchmarks: Suites like MMLU, GPQA, HumanEval, and domain-specific tasks provide capability snapshots, but they can saturate, overfit, or fail to reflect real-world complexity.
Generalization and reliability: Out-of-distribution tests, long-horizon tasks, tool-use evaluations, and multi-agent settings are critical to distinguish Levels 2–4.
Autonomy: Measuring planning, tool orchestration, and safe continuous operation requires scenario-based evals (e.g., constrained web agents or sandboxed workflows) and rigorous red teaming.
Independent testing: Third-party audits and shared evaluation protocols are needed so the same model doesn’t score differently across labs.

DeepMind’s paper explicitly calls for standardized, externally credible evaluations and notes that models can sit at different levels across domains—so any single score will be a simplification.

Implications for policy and product teams

Roadmapping: Use the levels to express capability goals (e.g., “reach Level 3 in customer support tasks”) while separately tracking safety gates and incident response readiness.
Procurement: Ask vendors to characterize where their model sits by task, plus provide safety evaluation results (bio/cyber/persuasion, jailbreak resistance, etc.).
Governance triggers: Tie internal controls to both capability and risk—for example, require additional review when models cross a target capability band in sensitive workflows.
Evaluation debt: Invest early in domain-relevant evals (reliability under shift, tool-use robustness, long-horizon autonomy) to avoid shipping based on saturated leaderboards.

Open questions and healthy skepticism

Fuzzy boundaries: Levels are bands, not bright lines; organizations must define operational thresholds relevant to their use cases.
Benchmark brittleness: Progress can look rapid on static tests but stall in messy real-world environments.
Autonomy vs. oversight: Many deployments intentionally limit autonomy; “lower autonomy” may be a design choice, not a capability cap.
Safety isn’t implied: Climbing capability levels increases both utility and risk surface; dedicated safety evals and mitigations remain essential.

The bottom line

DeepMind’s six levels won’t settle the AGI debate, but they provide a shared vocabulary for tracking capability, generality, and autonomy. Treat them as a map for progress and a prompt to strengthen safety and evaluation practices—not a scoreboard. If your organization pairs this taxonomy with rigorous, independent safety evaluations and clear governance gates, you’ll be better positioned to harness frontier AI while keeping risks in check.

Sources

Thank You for Reading this Blog and See You Soon! 🙏 👋

Let's connect 🚀

Latest Blogs

Read My Latest Blogs about AI

Featured

Illustration of Google Gemini powering coding tools and AI-generated podcasts

Google Gemini in 2025: Smarter Coding Tools and AI-Powered Podcasts Transforming Creation

Explore Gemini's 2025 upgrades: enhanced coding tools across Google Cloud and AI-generated, source-grounded podcasts with NotebookLM. Discover what's new and why it matters.

Must Read

Stage image from Google I/O 2025 showcasing Gemini 2.5 updates by Google DeepMind

Gemini 2.5 at Google I/O 2025: Enhanced Multimodal AI, Extended Context, and Improved Agents

Explore the Gemini 2.5 updates from Google I/O 2025: improved multimodal AI, extended context, smarter agents, and new tools tailored for Android, Workspace, and developers.

Students and faculty collaborating with AI tools in a modern university setting

AI in Higher Education 2025: Practical and Ethical Transformation

In 2025, higher education will transition from AI experiments to established practices, focusing on responsible use and ethical guidelines. Explore actionable steps and case studies.

Meta's investments in AI, including GPUs, data centers, and Llama models

Inside Meta’s Bold AI Strategy: Ambitious Goals, Hiring Surge, and Massive Investments

Meta is making significant strides in AI, focusing on general intelligence, a hiring blitz, and enormous compute investments. Discover what's real, what's next, and why it matters.

Collage illustrating AI breakthroughs in 2025, including multimodal models, AI agents, on-device chips, and data center innovations

August 2025 AI Breakthroughs Explained: Multimodal Models, Agents, Chips, and Real-World Impact

Understand the AI breakthroughs as of August 2025: multimodal models, agents, on-device AI, new chips, and regulatory frameworks. An informative guide with credible references.

From Emerging to Superhuman: Making Sense of DeepMind’s Six Levels of AGI

From Emerging to Superhuman: Making Sense of DeepMind’s Six Levels of AGI

What DeepMind actually published

The six levels at a glance

How this differs from safety scales

Measuring “levels” in practice

Implications for policy and product teams

Open questions and healthy skepticism

The bottom line

Sources

Latest Blogs

Read My Latest Blogs about AI

Google Gemini in 2025: Smarter Coding Tools and AI-Powered Podcasts Transforming Creation

Gemini 2.5 at Google I/O 2025: Enhanced Multimodal AI, Extended Context, and Improved Agents

AI in Higher Education 2025: Practical and Ethical Transformation

Inside Meta’s Bold AI Strategy: Ambitious Goals, Hiring Surge, and Massive Investments

August 2025 AI Breakthroughs Explained: Multimodal Models, Agents, Chips, and Real-World Impact

Newsletter

Your Weekly AI Blog Post

Subscribe to our newsletter.