From Emerging to Superhuman: Making Sense of DeepMind’s Six Levels of AGI

CN
@Zakariae BEN ALLALCreated on Sat Aug 23 2025
From Emerging to Superhuman: Making Sense of DeepMind’s Six Levels of AGI

From Emerging to Superhuman: Making Sense of DeepMind’s Six Levels of AGI

Published: 2025-08-23

TL;DR: Google DeepMind proposes six capability levels on the path to artificial general intelligence (AGI), defined along capability, generality, and autonomy. The framework is meant to standardize how we talk about progress, not to forecast timelines. For builders, buyers, and policymakers, it’s a useful map—but it’s not a safety scale and doesn’t replace risk evaluations. Here’s what the levels mean, how they compare to Anthropic and OpenAI’s safety frameworks, what we can and can’t measure today, and the practical implications for governance and product strategy.

What DeepMind actually published

Google DeepMind outlined a taxonomy—“Levels of AGI”—to make progress toward general-purpose AI more measurable and comparable. The proposal defines six bands of capability and generality, paired with a notion of autonomy (how independently a system can pursue tasks and goals). It emphasizes that models can straddle levels depending on the task, and that reliability and safety are separate dimensions that must be measured with dedicated evaluations.

A mental model of the taxonomy: capability (from novice to superhuman), generality (number and breadth of domains), and autonomy (ability to plan, act, and use tools). Levels are bands, not hard thresholds.

The six levels at a glance

DeepMind’s levels describe what a system can do across tasks (capability and generality) and how independently it can do them (autonomy). Summarized:

  • Level 1 — Emerging: Narrow capabilities with noticeable gaps. Useful but inconsistent performance; low autonomy and heavy human steering.
  • Level 2 — Competent: Solid performance across a broader set of tasks and domains. Can follow multi-step instructions and use tools with guidance; still requires close oversight.
  • Level 3 — Expert: Reaches skilled human levels across many tasks in multiple domains. Can coordinate subtasks, use external tools, and work semi-independently; reliability and robustness still vary by context.
  • Level 4 — Virtuoso (human-expert+): Performs at or above top human experts across most tasks in many domains; higher reliability and autonomy, including planning and orchestrating other tools/agents.
  • Level 5 — Superhuman: Exceeds the best humans on nearly all relevant tasks; demonstrates strong generalization, reliability, and autonomous problem solving.
  • Level 6 — Strongly Superhuman: Far beyond human capabilities across science, engineering, and complex open-ended tasks; can autonomously decompose goals, discover strategies, and operate with minimal oversight.

Where are today’s frontier models? Depending on the task, they typically span Levels 1–3, with meaningful gains in tool use and planning but clear limits in reliability and autonomy in open-ended, real-world settings.

How this differs from safety scales

It’s easy to conflate capability levels with safety. DeepMind’s taxonomy is not a safety rating; it won’t tell you whether a model is secure against misuse or robust under adversarial pressure. It complements (but doesn’t replace) safety-oriented frameworks such as:

  • Anthropic’s AI Safety Levels (ASL): A “responsible scaling” policy that sets concrete safety evaluations and guardrails tied to capability growth (e.g., bio, cyber, and model manipulation risk thresholds).
  • OpenAI’s Preparedness Framework: Risk evaluations focused on categories like cyber offense, CBRN, persuasion, and autonomy, with deployment gating based on measured risk.
  • NIST AI Risk Management Framework (RMF): A general enterprise risk framework for AI systems (govern, map, measure, manage) that organizations can use regardless of model provider.

Bottom line: DeepMind’s levels help us talk about capability trajectory. Safety frameworks govern whether and how to deploy as capability rises.

Measuring “levels” in practice

Turning the taxonomy into numbers requires careful evaluations:

  • Benchmarks: Suites like MMLU, GPQA, HumanEval, and domain-specific tasks provide capability snapshots, but they can saturate, overfit, or fail to reflect real-world complexity.
  • Generalization and reliability: Out-of-distribution tests, long-horizon tasks, tool-use evaluations, and multi-agent settings are critical to distinguish Levels 2–4.
  • Autonomy: Measuring planning, tool orchestration, and safe continuous operation requires scenario-based evals (e.g., constrained web agents or sandboxed workflows) and rigorous red teaming.
  • Independent testing: Third-party audits and shared evaluation protocols are needed so the same model doesn’t score differently across labs.

DeepMind’s paper explicitly calls for standardized, externally credible evaluations and notes that models can sit at different levels across domains—so any single score will be a simplification.

Implications for policy and product teams

  • Roadmapping: Use the levels to express capability goals (e.g., “reach Level 3 in customer support tasks”) while separately tracking safety gates and incident response readiness.
  • Procurement: Ask vendors to characterize where their model sits by task, plus provide safety evaluation results (bio/cyber/persuasion, jailbreak resistance, etc.).
  • Governance triggers: Tie internal controls to both capability and risk—for example, require additional review when models cross a target capability band in sensitive workflows.
  • Evaluation debt: Invest early in domain-relevant evals (reliability under shift, tool-use robustness, long-horizon autonomy) to avoid shipping based on saturated leaderboards.

Open questions and healthy skepticism

  • Fuzzy boundaries: Levels are bands, not bright lines; organizations must define operational thresholds relevant to their use cases.
  • Benchmark brittleness: Progress can look rapid on static tests but stall in messy real-world environments.
  • Autonomy vs. oversight: Many deployments intentionally limit autonomy; “lower autonomy” may be a design choice, not a capability cap.
  • Safety isn’t implied: Climbing capability levels increases both utility and risk surface; dedicated safety evals and mitigations remain essential.

The bottom line

DeepMind’s six levels won’t settle the AGI debate, but they provide a shared vocabulary for tracking capability, generality, and autonomy. Treat them as a map for progress and a prompt to strengthen safety and evaluation practices—not a scoreboard. If your organization pairs this taxonomy with rigorous, independent safety evaluations and clear governance gates, you’ll be better positioned to harness frontier AI while keeping risks in check.

Sources

  1. AI Business: Google DeepMind’s Six Levels of AGI (seed article)
  2. Google DeepMind Blog: Levels of AGI — Operationalizing progress on the path to AGI
  3. Paper: Levels of AGI — Operationalizing Progress on the Path to AGI (arXiv)
  4. MIT Technology Review: DeepMind proposes a six-level scale to track progress to AGI
  5. Anthropic: Responsible Scaling Policy (AI Safety Levels)
  6. OpenAI: Preparedness Framework
  7. NIST: AI Risk Management Framework 1.0

Thank You for Reading this Blog and See You Soon! 🙏 👋

Let's connect 🚀

Newsletter

Your Weekly AI Blog Post

Subscribe to our newsletter.

Sign up for the AI Developer Code newsletter to receive the latest insights, tutorials, and updates in the world of AI development.

Weekly articles
Join our community of AI and receive weekly update. Sign up today to start receiving your AI Developer Code newsletter!
No spam
AI Developer Code newsletter offers valuable content designed to help you stay ahead in this fast-evolving field.