From Bottleneck to Breakthrough: How AI-Written Empirical Software Accelerates Science

@Zakariae BEN ALLALCreated on Mon Sep 29 2025

AI system generating and optimizing empirical software to accelerate scientific discovery across multiple benchmarks.

From Bottleneck to Breakthrough: How AI-Written Empirical Software Accelerates Science

Artificial intelligence (AI) is evolving from a useful tool into a genuine research collaborator. We’ve seen AI tackle significant challenges in biology, including large-scale protein structure predictions that garner prestigious acclaim. Such achievements hint at a transformative trend: integrating AI into scientific processes can significantly accelerate discoveries.

Recently, Google Research unveiled an AI system built using Gemini, designed to assist scientists in writing, testing, and optimizing what they term empirical software—code aimed at maximizing a quantifiable metric linked to a scientific goal. During early tests across six diverse benchmark tasks, the system achieved expert-level results, dramatically reducing exploration cycles from months to mere days.

In this article, we will explore the nuances of this system tailored for a curious, non-expert audience. We will define empirical software, explain how the system functions, highlight its significance, and discuss the early results across fields ranging from genomics to geospatial analysis.

What is Empirical Software?

Empirical software consists of code evaluated based on its ability to optimize a set score for a specific, measurable task. Rather than focusing on proving formal correctness, the code’s effectiveness is assessed through real data execution and score analysis. Google Research frames scientific challenges fitting this model as scorable tasks—problems with a clear objective, a scoring metric, and data divisions for training, validation, and evaluation.

Why is This Useful?

Modern scientific research increasingly relies on computational experiments—simulating processes, analyzing datasets, and evaluating hypotheses. Each new idea often requires developing bespoke code, a process that can be prohibitively time-consuming and become a bottleneck. An AI system capable of proposing, implementing, and iterating solutions while continuously measuring progress against a score can transform this bottleneck into an efficient search problem where machines excel.

The Basic Loop: Define, Generate, Search, Improve

At a high level, the system functions like an indefatigable co-investigator, fusing a large language model (LLM) with a guided search approach:

Define a Scorable Task
Provide a problem statement, scoring metric, and data. Optionally, offer context, such as suggested methods or relevant literature.
Generate Ideas and Code
The LLM (Gemini) suggests strategies and generates executable code for each approach. Each suggestion becomes a clear, testable program.
Explore Using Tree Search
The system engages in a tree search guided by an upper-confidence strategy akin to AlphaZero. It evaluates numerous code variations, selects promising branches, and delves deeper.
Rewrite for Improvement
With feedback from scores, the LLM refines and optimizes the code, experimenting with new architectures, training frameworks, or data processing methodologies to enhance the score.
Preserve the Best, Iterate
As outputs consist of code, results remain transparent, reproducible, and comparable. This loop can scale to thousands of contenders without fatigue.

If you tuned in to Google I/O this year, you may recognize the concept of an AI co-scientist. In parallel efforts, Google and partners have been developing multi-agent AI tools to synthesize literature, generate hypotheses, and propose experiments in fields like biomedicine. The empirical software system discussed here serves as a vital component of that framework by concentrating on the coding and measurement aspects inherent in many computational investigations.

Why This Matters

Faster Iteration: Transforming months of manual trial-and-error into days of automation allows researchers to evaluate numerous ideas.
Reproducibility by Default: The output—working code linked to a benchmark and score—facilitates simple sharing of code and evaluation setups.
Broad Applicability: The system can adapt to various domains because it processes multiple methods and coding techniques effectively, from single-cell genomics to brain activity predictions.

Achievements of the System Thus Far

Google Research assessed this approach using six benchmarks representing various aspects of modern computational science. Here are some key findings:

1. Genomics: Integrating Single-Cell RNA-seq Across Batches

Single-cell RNA sequencing (scRNA-seq) allows precise gene expression measurement at the cellular level, yet consolidating data from different labs poses challenges. The OpenProblems initiative maintains rigorous benchmarks for this difficulty.

On the OpenProblems v2.0.0 batch integration benchmark, Google reported the discovery of 40 novel methods that surpassed leading expert-designed techniques. The top solution improved the overall score by about 14% over previous best practices, combining elements from two existing methods (ComBat and BBKNN).

2. Public Health: Forecasting U.S. COVID-19 Hospitalizations

The COVID-19 Forecast Hub, in collaboration with the U.S. CDC, aggregates predictions from multiple teams to create an ensemble that has achieved gold-standard status. Forecasts are assessed via the weighted interval score (WIS).

In retrospective evaluations, the AI-generated models outperformed the established CovidHub ensemble concerning weekly hospital admissions across U.S. jurisdictions, a significant achievement given the ensemble’s frequent top performance under varying conditions.

3. Geospatial Analysis: Semantic Segmentation of Remote Sensing Images

Precisely labeling pixels in high-resolution aerial and satellite imagery is crucial for land-use monitoring, environmental assessments, and disaster response. Utilizing the DLRSD benchmark and mean Intersection-over-Union (mIoU) as a metric, the system produced three solutions slightly superior to prior benchmarks (mIoU above 0.80).

4. Neuroscience: Predicting Whole-Brain Activity in Zebrafish

ZAPBench is a benchmark that challenges models to predict neural activity across an entire zebrafish brain based on 4D recordings. Google’s system outperformed all existing baselines on ZAPBench, revealing hybrid models that incorporate a biophysical neuron simulator, suggesting the potential for interpretable forecasts.

5. Numerical Analysis: Evaluating Challenging Integrals

The system addressed a series of difficult definite integrals, reportedly solving 17 out of 19 challenging cases where traditional numerical methods faltered.

6. Time Series: Building a General-Purpose Forecasting Library

Ultimately, on the GIFT-Eval benchmark, which spans 28 datasets, seven domains, and ten frequencies, the system developed a unified forecasting library optimized for robust error metrics.

The Power of Pairing Tree Search with LLMs

Classic search methodologies akin to AlphaZero balance exploration and exploitation through principled criteria. When paired with a proficient code-writing model and clear scoring metrics, this approach enables a comprehensive loop that generates varied ideas, tests them expeditiously, and focuses efforts where scores improve most rapidly. Humans often find it challenging to sustain this process at scale; however, machines can navigate this naturally.

Recognizing Strengths and Limitations

Reproducibility: Outputs consist of code and score results, which enhances sharing and auditing. However, reproducibility still hinges on fixed seeds, pinned dependencies, and reliable data access.
Compute Budget: Large-scale exploration can be resource-intensive. Teams will need practical controls over budgetary, temporal, and environmental impacts.
Benchmark Overfitting: While scores on public leaderboards can motivate, the ultimate goal is real-world applicability. Benchmarks help by offering diverse and transparent evaluations.
Human Oversight: Even with excellent scores, human reasoning remains crucial. Researchers must still validate assumptions, interpret findings, and decide the next steps.

The Broader Picture: AI in Scientific Research

In recent years, AI has evolved from narrow applications to systems that support the complete scientific process: reading literature, forming hypotheses, designing experiments, running simulations, and analyzing results. Google’s AI co-scientist initiatives and AI-harnessed empirical software all aim to augment scientific exploration with robust, testable tools based on code and benchmarks, making progress easier to validate.

The growing acknowledgment of AI-driven scientific contributions within the broader research community is also notable. Breakthroughs, such as those in protein structure prediction, have already reshaped biology and biomedicine, laying a foundation for how AI advancements can transform entire fields.

Practical Takeaways for Researchers

Start with a Scorable Task: Articulate a clear objective metric and a comprehensive evaluation split, establishing the framework for your scientific inquiry.
Include Prior Art: Supply the system with relevant research papers and known methods for exploration, which enriches its output.
Embrace Baselines: Incorporate reliable baselines, with ensembles when appropriate, to assess whether new gains are meaningful.
Seek Robust Patterns: Focus on improvements replicating across datasets or settings, as revealed by benchmarks.
Maintain Human Involvement: Use the system to broaden your search and hasten iteration while applying domain expertise to interpret results.

FAQs

What exactly is a scorable task?

A scorable task encompasses a precise problem statement, a dataset, and a metric for scoring solutions. Code execution yields a numerical score, facilitating comparisons across methods.

How is this different from Copilot-style code generation?

Code assistants expedite coding but do not autonomously explore thousands of alternatives or optimize for scientific objectives, whereas this system functions as a research engine focused solely on code optimization.

Why use tree search instead of just prompting the LLM repeatedly?

Tree search structures exploration by balancing novelty with focused refinements, employing concepts similar to AlphaZero’s strategic planning. This ensures computational resources are invested where they are likely to yield results.

What about safety, errors, or hallucinations?

Most issues manifest as low scores during evaluation, leading to their elimination. However, safeguarding measures like sandboxes and human review remain essential.

Can I access methods and results?

Yes, Google’s paper and interactive pages present solution trees and top-scoring code for the benchmarks discussed, enabling reproducibility and examination of results.

Conclusion

The main premise of AI-powered empirical software is both simple and profound: establish a scientific objective, let an LLM generate and implement a range of solutions, and use a principled search approach along with hard metrics to guide enhancements. Early findings indicate that this strategy can match or exceed expert baselines across various disciplines—genomics, public health, geospatial analysis, neuroscience, numerical methods, and time-series forecasting—while significantly reducing the time needed for testing concepts.

The broader narrative transcends improved scores; it’s about equipping researchers with a scalable approach to effectively explore solution spaces, enabling the testing of more ideas, faster and more rigorously. This shift hints at a future where scientific fields transition from bottlenecks to breakthroughs.

Thank You for Reading this Blog and See You Soon! 🙏 👋

Let's connect 🚀

Latest Blogs

Read My Latest Blogs about AI

Featured

From Bottleneck to Breakthrough: How AI-Written Empirical Software Accelerates Science

Google Research has developed an AI system that writes and optimizes empirical software, delivering expert-level results on six benchmarks and turning months of work into days.

Must Read

A classroom sign reading 'No AI Allowed' beside an old 'No Internet' notice, symbolizing shifting tech rules in education.

Generative AI in Education: A Guide to Creating Durable AI Policies

Today's hot takes on banning generative AI in education will age poorly. Learn how to build a clear, flexible, and human-centered AI academic policy that stands the test of time.

An illustration of a human hand and a robot hand collaborating over a laptop computer with glowing AI icons.

AI Is Everywhere: What It Really Does, Where It Falls Short, and How to Use It Well

AI is everywhere. This guide explains what it really does, its limits, and how to use it well. Get clear, practical insights on AI productivity, risks, and policy.

From Cameos to Chat: Inside the AI Stan Lee Avatar At L.A. Comic Con

An AI Stan Lee avatar debuted at L.A. Comic Con. Here’s how it works, who built it, the legal guardrails, and why it matters for digital replicas and fans.

AI chatbot behind a government service desk reviewing a procurement bid

When Chatbots Run City Hall: How Governments Can Use AI Without Losing Accountability

AI chatbots are moving from helpdesks to public decisions. Here’s how governments can use them to speed service and boost transparency without losing accountability.

From Bottleneck to Breakthrough: How AI-Written Empirical Software Accelerates Science

From Bottleneck to Breakthrough: How AI-Written Empirical Software Accelerates Science

What is Empirical Software?

Why is This Useful?

The Basic Loop: Define, Generate, Search, Improve

Why This Matters

Achievements of the System Thus Far

1. Genomics: Integrating Single-Cell RNA-seq Across Batches

2. Public Health: Forecasting U.S. COVID-19 Hospitalizations

3. Geospatial Analysis: Semantic Segmentation of Remote Sensing Images

4. Neuroscience: Predicting Whole-Brain Activity in Zebrafish

5. Numerical Analysis: Evaluating Challenging Integrals

6. Time Series: Building a General-Purpose Forecasting Library

The Power of Pairing Tree Search with LLMs

Recognizing Strengths and Limitations

The Broader Picture: AI in Scientific Research

Practical Takeaways for Researchers

FAQs

What exactly is a scorable task?

How is this different from Copilot-style code generation?

Why use tree search instead of just prompting the LLM repeatedly?

What about safety, errors, or hallucinations?

Can I access methods and results?

Conclusion

Latest Blogs

Read My Latest Blogs about AI

From Bottleneck to Breakthrough: How AI-Written Empirical Software Accelerates Science

Generative AI in Education: A Guide to Creating Durable AI Policies

AI Is Everywhere: What It Really Does, Where It Falls Short, and How to Use It Well

From Cameos to Chat: Inside the AI Stan Lee Avatar At L.A. Comic Con

When Chatbots Run City Hall: How Governments Can Use AI Without Losing Accountability

Newsletter

Your Weekly AI Blog Post

Subscribe to our newsletter.