What Counts As AGI? Inside DeepMind’s Push To Set Common Levels And Benchmarks

@Zakariae BEN ALLALCreated on Mon Aug 25 2025

What Counts As AGI? Inside DeepMind’s Push To Set Common Levels And Benchmarks

Artificial general intelligence, or AGI, gets invoked in product launches, policy debates, and investor calls. But ask five experts what AGI means and you will likely hear five different answers. Google DeepMind is trying to change that by proposing a shared way to define and measure progress toward AGI. Here is what they are suggesting, why it matters, and how it fits into emerging AI safety and governance efforts.

Why the definition of AGI matters

AGI is commonly described as AI that can learn and perform a wide range of tasks at or beyond human level, across domains. That is different from today’s narrow or task-specific AI, which excels in bounded settings. Without a shared yardstick, it is hard to separate real progress from hype, compare systems fairly, or set safety policies that actually map to capabilities.

Benchmarks used widely today are already saturating for frontier models, and performance can vary wildly from lab to real-world use. A clearer, operational definition would help researchers track capability trends, help companies market responsibly, and help policymakers target guardrails where they are most needed.

DeepMind’s proposal at a glance

DeepMind researchers have outlined a framework to define and evaluate progress toward AGI with more precision. As covered by MIT Technology Review, the proposal aims to move the conversation beyond vague claims and toward testable criteria and levels of capability (MIT Technology Review).

The high-level idea: describe AI systems along multiple dimensions and group them into levels that reflect how capable, general, and autonomous they are in practice, not just on cherry-picked tests. The company has discussed this direction publicly and in research venues, including work on proposed levels of AGI (Google DeepMind blog) and coverage by tech media (The Verge).

Three core dimensions

Capability – How well does the system perform on tasks of varying difficulty, from novice to expert to cutting-edge levels? Performance would be measured with standardized evaluations and real-world tasks.
Generality – How broadly does the system perform across domains and modalities (language, code, vision, audio, robotics)? Can it handle novel tasks and transfer skills, not just repeat patterns from training data?
Autonomy – How independently can the system plan, act, and pursue goals over time? This considers tool use, multi-step planning, and the ability to operate with limited human oversight.

From levels to claims

Organizing systems into levels gives teams a way to make specific, testable claims about progress. For example, a developer might say a model is expert-level on open-ended software engineering tasks across multiple languages, but only novice-level in robotics planning and low-autonomy overall. That is more meaningful than a blanket claim of being an AGI.

The goal is to build a consistent yardstick so labs, independent evaluators, and regulators can talk about the same thing when they say AGI.

How would we measure it?

Defining levels is the easy part. Measuring them fairly is harder. DeepMind and others point to a few principles that would make AGI evaluations more robust:

Multi-domain testing – Mix tasks across disciplines (STEM, humanities, law, healthcare, software, robotics) and modalities, not just text-only benchmarks.
Adversarial and dynamic evals – Use tests that evolve over time and include adversarial probes to reduce benchmark overfitting and data leakage.
System-level tasks – Go beyond multiple-choice questions. Evaluate long-horizon planning, tool use, and team coordination in realistic workflows.
Human-in-the-loop validation – Combine automated scoring with expert review for nuanced judgments like scientific reasoning or safety-critical behavior.
Transparent reporting – Publish evaluation cards that document methods, uncertainties, and known failure cases.

These ideas echo broader best practices from the NIST AI Risk Management Framework, which recommends context-specific, lifecycle evaluations and clear documentation to manage AI risks (NIST AI RMF 1.0).

Why a shared AGI yardstick matters for safety and policy

Regulators are moving quickly, and many new rules hinge on how capable a model is. A working definition and levels of AGI would make it easier to align safety policies with real capabilities:

EU AI Act – The new law introduces special obligations for general-purpose AI models and extra requirements for models with systemic risk. Clear capability levels could help determine when those stricter rules apply (EU AI Act).
US Executive Order 14110 – Developers training large dual-use foundation models above certain compute thresholds must report key safety information to the US government. Shared metrics would sharpen what to test and report (White House EO 14110).
UK Frontier AI policy – The UK has emphasized evaluation of dangerous capabilities and hosted the 2023 AI Safety Summit. A common leveling system could support consistent testing across labs (UK Frontier AI policy).
Industry safety programs – Companies like Anthropic and OpenAI have published safety levels and preparedness frameworks to assess catastrophic risks. An AGI levels framework could complement these with broader capability coverage (Anthropic AI Safety Levels), (OpenAI Preparedness Framework).

Benefits and risks of standardizing AGI

Potential benefits

Clarity – Replaces hype with testable claims about what systems can and cannot do.
Comparability – Enables apples-to-apples comparisons across labs and model versions.
Better governance – Ties safety obligations to capability thresholds rather than brand names or marketing terms.
Research focus – Highlights neglected areas such as long-horizon autonomy and cross-domain transfer.

Potential pitfalls

Goodhart’s law – Over-optimizing for the metric can distort behavior and hide weaknesses.
Gaming and leakage – Public benchmarks can leak into training data or be gamed by specialized prompting.
Narrow scope – A levels framework might underweight societal impacts like labor shifts, misinformation, or fairness.
Overconfidence – A neat taxonomy can create false certainty about systems that remain unpredictable in the wild.

How this connects to the broader AGI debate

There is longstanding debate about whether the term AGI is even useful. Critics argue it obscures limits of current systems and encourages speculative marketing. For example, researchers have warned against extrapolating language model competence beyond what is measured, and have called for rigorous, context-aware evaluations before deploying high-stakes systems in the real world (Bender et al., 2021). DeepMind’s push for explicit levels and benchmarks aims to bring the conversation back to measurable reality.

What to watch next

Independent evaluation hubs – Expect more third-party auditing and public leaderboards that track capability and autonomy across domains.
Safety-linked thresholds – Policymakers may tie obligations like red-teaming, secure compute, and incident reporting to capability levels rather than model scale alone.
Real-world trials – More system-level pilots in areas like software engineering, scientific discovery, and robotics will test whether benchmark gains translate into practical reliability.
Open measurement science – Shared, rotating test sets and community-run evaluations will be key to fighting benchmark leakage and gaming.

Conclusion

DeepMind’s effort to define what counts as AGI is not about declaring victory. It is about building a common yardstick so researchers, companies, and regulators can talk about progress in the same language. No taxonomy will be perfect, but one that is multi-dimensional, transparent, and tested in the wild can help the field move from hype to accountable progress.

FAQs

What is AGI in simple terms?

AGI refers to AI that can learn and perform well across many different tasks and domains, at or beyond human level, rather than being specialized for one task.

What is DeepMind proposing exactly?

A framework with levels of capability that considers how strong, how general, and how autonomous a system is, evaluated with standardized and real-world tests.

How is this different from existing benchmarks?

Most current benchmarks are narrow and static. The proposal emphasizes multi-domain, dynamic, and system-level evaluations that track real-world usefulness and risks.

Will this affect AI regulation?

Likely yes. Clear capability levels would make it easier for laws like the EU AI Act or US reporting rules to target obligations at models that cross risk thresholds.

Is AGI around the corner?

No one knows. Capability progress has been fast, but generality and safe autonomy are harder. A shared yardstick can clarify where we truly are and what gaps remain.

Sources

Thank You for Reading this Blog and See You Soon! 🙏 👋

Let's connect 🚀

Latest Blogs

Read My Latest Blogs about AI

Featured

Illustration of Google Gemini powering coding tools and AI-generated podcasts

Google Gemini in 2025: Smarter Coding Tools and AI-Powered Podcasts Transforming Creation

Explore Gemini's 2025 upgrades: enhanced coding tools across Google Cloud and AI-generated, source-grounded podcasts with NotebookLM. Discover what's new and why it matters.

Must Read

Stage image from Google I/O 2025 showcasing Gemini 2.5 updates by Google DeepMind

Gemini 2.5 at Google I/O 2025: Enhanced Multimodal AI, Extended Context, and Improved Agents

Explore the Gemini 2.5 updates from Google I/O 2025: improved multimodal AI, extended context, smarter agents, and new tools tailored for Android, Workspace, and developers.

Students and faculty collaborating with AI tools in a modern university setting

AI in Higher Education 2025: Practical and Ethical Transformation

In 2025, higher education will transition from AI experiments to established practices, focusing on responsible use and ethical guidelines. Explore actionable steps and case studies.

Meta's investments in AI, including GPUs, data centers, and Llama models

Inside Meta’s Bold AI Strategy: Ambitious Goals, Hiring Surge, and Massive Investments

Meta is making significant strides in AI, focusing on general intelligence, a hiring blitz, and enormous compute investments. Discover what's real, what's next, and why it matters.

Collage illustrating AI breakthroughs in 2025, including multimodal models, AI agents, on-device chips, and data center innovations

August 2025 AI Breakthroughs Explained: Multimodal Models, Agents, Chips, and Real-World Impact

Understand the AI breakthroughs as of August 2025: multimodal models, agents, on-device AI, new chips, and regulatory frameworks. An informative guide with credible references.

What Counts As AGI? Inside DeepMind’s Push To Set Common Levels And Benchmarks

Why the definition of AGI matters

DeepMind’s proposal at a glance

Three core dimensions

From levels to claims

How would we measure it?

Why a shared AGI yardstick matters for safety and policy

Benefits and risks of standardizing AGI

Potential benefits

Potential pitfalls

How this connects to the broader AGI debate

What to watch next

Conclusion

FAQs

What is AGI in simple terms?

What is DeepMind proposing exactly?

How is this different from existing benchmarks?

Will this affect AI regulation?

Is AGI around the corner?

Sources

Latest Blogs

Read My Latest Blogs about AI

Google Gemini in 2025: Smarter Coding Tools and AI-Powered Podcasts Transforming Creation

Gemini 2.5 at Google I/O 2025: Enhanced Multimodal AI, Extended Context, and Improved Agents

AI in Higher Education 2025: Practical and Ethical Transformation

Inside Meta’s Bold AI Strategy: Ambitious Goals, Hiring Surge, and Massive Investments

August 2025 AI Breakthroughs Explained: Multimodal Models, Agents, Chips, and Real-World Impact

Newsletter

Your Weekly AI Blog Post

Subscribe to our newsletter.