What Counts As AGI? Inside DeepMind’s Push To Set Common Levels And Benchmarks

What Counts As AGI? Inside DeepMind’s Push To Set Common Levels And Benchmarks
Artificial general intelligence, or AGI, gets invoked in product launches, policy debates, and investor calls. But ask five experts what AGI means and you will likely hear five different answers. Google DeepMind is trying to change that by proposing a shared way to define and measure progress toward AGI. Here is what they are suggesting, why it matters, and how it fits into emerging AI safety and governance efforts.
Why the definition of AGI matters
AGI is commonly described as AI that can learn and perform a wide range of tasks at or beyond human level, across domains. That is different from today’s narrow or task-specific AI, which excels in bounded settings. Without a shared yardstick, it is hard to separate real progress from hype, compare systems fairly, or set safety policies that actually map to capabilities.
Benchmarks used widely today are already saturating for frontier models, and performance can vary wildly from lab to real-world use. A clearer, operational definition would help researchers track capability trends, help companies market responsibly, and help policymakers target guardrails where they are most needed.
DeepMind’s proposal at a glance
DeepMind researchers have outlined a framework to define and evaluate progress toward AGI with more precision. As covered by MIT Technology Review, the proposal aims to move the conversation beyond vague claims and toward testable criteria and levels of capability (MIT Technology Review).
The high-level idea: describe AI systems along multiple dimensions and group them into levels that reflect how capable, general, and autonomous they are in practice, not just on cherry-picked tests. The company has discussed this direction publicly and in research venues, including work on proposed levels of AGI (Google DeepMind blog) and coverage by tech media (The Verge).
Three core dimensions
- Capability – How well does the system perform on tasks of varying difficulty, from novice to expert to cutting-edge levels? Performance would be measured with standardized evaluations and real-world tasks.
- Generality – How broadly does the system perform across domains and modalities (language, code, vision, audio, robotics)? Can it handle novel tasks and transfer skills, not just repeat patterns from training data?
- Autonomy – How independently can the system plan, act, and pursue goals over time? This considers tool use, multi-step planning, and the ability to operate with limited human oversight.
From levels to claims
Organizing systems into levels gives teams a way to make specific, testable claims about progress. For example, a developer might say a model is expert-level on open-ended software engineering tasks across multiple languages, but only novice-level in robotics planning and low-autonomy overall. That is more meaningful than a blanket claim of being an AGI.
The goal is to build a consistent yardstick so labs, independent evaluators, and regulators can talk about the same thing when they say AGI.
How would we measure it?
Defining levels is the easy part. Measuring them fairly is harder. DeepMind and others point to a few principles that would make AGI evaluations more robust:
- Multi-domain testing – Mix tasks across disciplines (STEM, humanities, law, healthcare, software, robotics) and modalities, not just text-only benchmarks.
- Adversarial and dynamic evals – Use tests that evolve over time and include adversarial probes to reduce benchmark overfitting and data leakage.
- System-level tasks – Go beyond multiple-choice questions. Evaluate long-horizon planning, tool use, and team coordination in realistic workflows.
- Human-in-the-loop validation – Combine automated scoring with expert review for nuanced judgments like scientific reasoning or safety-critical behavior.
- Transparent reporting – Publish evaluation cards that document methods, uncertainties, and known failure cases.
These ideas echo broader best practices from the NIST AI Risk Management Framework, which recommends context-specific, lifecycle evaluations and clear documentation to manage AI risks (NIST AI RMF 1.0).
Why a shared AGI yardstick matters for safety and policy
Regulators are moving quickly, and many new rules hinge on how capable a model is. A working definition and levels of AGI would make it easier to align safety policies with real capabilities:
- EU AI Act – The new law introduces special obligations for general-purpose AI models and extra requirements for models with systemic risk. Clear capability levels could help determine when those stricter rules apply (EU AI Act).
- US Executive Order 14110 – Developers training large dual-use foundation models above certain compute thresholds must report key safety information to the US government. Shared metrics would sharpen what to test and report (White House EO 14110).
- UK Frontier AI policy – The UK has emphasized evaluation of dangerous capabilities and hosted the 2023 AI Safety Summit. A common leveling system could support consistent testing across labs (UK Frontier AI policy).
- Industry safety programs – Companies like Anthropic and OpenAI have published safety levels and preparedness frameworks to assess catastrophic risks. An AGI levels framework could complement these with broader capability coverage (Anthropic AI Safety Levels), (OpenAI Preparedness Framework).
Benefits and risks of standardizing AGI
Potential benefits
- Clarity – Replaces hype with testable claims about what systems can and cannot do.
- Comparability – Enables apples-to-apples comparisons across labs and model versions.
- Better governance – Ties safety obligations to capability thresholds rather than brand names or marketing terms.
- Research focus – Highlights neglected areas such as long-horizon autonomy and cross-domain transfer.
Potential pitfalls
- Goodhart’s law – Over-optimizing for the metric can distort behavior and hide weaknesses.
- Gaming and leakage – Public benchmarks can leak into training data or be gamed by specialized prompting.
- Narrow scope – A levels framework might underweight societal impacts like labor shifts, misinformation, or fairness.
- Overconfidence – A neat taxonomy can create false certainty about systems that remain unpredictable in the wild.
How this connects to the broader AGI debate
There is longstanding debate about whether the term AGI is even useful. Critics argue it obscures limits of current systems and encourages speculative marketing. For example, researchers have warned against extrapolating language model competence beyond what is measured, and have called for rigorous, context-aware evaluations before deploying high-stakes systems in the real world (Bender et al., 2021). DeepMind’s push for explicit levels and benchmarks aims to bring the conversation back to measurable reality.
What to watch next
- Independent evaluation hubs – Expect more third-party auditing and public leaderboards that track capability and autonomy across domains.
- Safety-linked thresholds – Policymakers may tie obligations like red-teaming, secure compute, and incident reporting to capability levels rather than model scale alone.
- Real-world trials – More system-level pilots in areas like software engineering, scientific discovery, and robotics will test whether benchmark gains translate into practical reliability.
- Open measurement science – Shared, rotating test sets and community-run evaluations will be key to fighting benchmark leakage and gaming.
Conclusion
DeepMind’s effort to define what counts as AGI is not about declaring victory. It is about building a common yardstick so researchers, companies, and regulators can talk about progress in the same language. No taxonomy will be perfect, but one that is multi-dimensional, transparent, and tested in the wild can help the field move from hype to accountable progress.
FAQs
What is AGI in simple terms?
AGI refers to AI that can learn and perform well across many different tasks and domains, at or beyond human level, rather than being specialized for one task.
What is DeepMind proposing exactly?
A framework with levels of capability that considers how strong, how general, and how autonomous a system is, evaluated with standardized and real-world tests.
How is this different from existing benchmarks?
Most current benchmarks are narrow and static. The proposal emphasizes multi-domain, dynamic, and system-level evaluations that track real-world usefulness and risks.
Will this affect AI regulation?
Likely yes. Clear capability levels would make it easier for laws like the EU AI Act or US reporting rules to target obligations at models that cross risk thresholds.
Is AGI around the corner?
No one knows. Capability progress has been fast, but generality and safe autonomy are harder. A shared yardstick can clarify where we truly are and what gaps remain.
Sources
- MIT Technology Review – Google DeepMind wants to define what counts as AGI
- Google DeepMind blog – Research and perspectives on AGI and levels
- The Verge – Coverage of DeepMind’s proposed levels of AGI
- NIST AI Risk Management Framework 1.0
- European Parliament – EU AI Act overview
- White House – Executive Order 14110 on AI
- UK Government – Frontier AI capabilities and risks
- Anthropic – AI Safety Levels
- OpenAI – Preparedness Framework
- Bender et al. – On the Dangers of Stochastic Parrots
Thank You for Reading this Blog and See You Soon! 🙏 👋
Let's connect 🚀
Latest Blogs
Read My Latest Blogs about AI

Google Gemini in 2025: Smarter Coding Tools and AI-Powered Podcasts Transforming Creation
Explore Gemini's 2025 upgrades: enhanced coding tools across Google Cloud and AI-generated, source-grounded podcasts with NotebookLM. Discover what's new and why it matters.
Read more