No, Google Doesn’t Own the Internet: What AI Data Lawsuits Mean for Your Business
ArticleAugust 23, 2025

No, Google Doesn’t Own the Internet: What AI Data Lawsuits Mean for Your Business

CN
@Zakariae BEN ALLALCreated on Sat Aug 23 2025

In a new lawsuit, plaintiffs take aim at Google’s use of web data to train AI, declaring, “Google does not own the internet.” It’s a snappy line with a serious question behind it: who gets to decide how public web content—and even personal data—are used to build AI models? The answer will shape how creators, publishers, and every business with a website operate in the AI era.

This guide unpacks what’s at stake, why the lawsuits matter, and the practical steps you can take now—whether you publish online, deploy AI in your products, or simply care how your data is used.

What the new lawsuit is about

According to reporting on the case, the complaint argues that Google scraped massive amounts of publicly accessible content and personal information to train its AI systems, and did so without meaningful consent—hence the punchy line that Google “does not own the internet.” The plaintiffs are essentially asking courts to draw new lines around what counts as fair and lawful AI training data. See The Register’s coverage for the headline framing of the suit and its central claim (The Register).

Why it matters beyond Google

The outcome won’t just affect one company. It could influence:

  • How any AI developer sources training data
  • What publishers and creators can do to control use of their content
  • Whether businesses face new risks when they deploy AI tools

The bigger picture: AI training data fights are everywhere

Google is not the only company under legal scrutiny for how it uses data to train AI models:

  • The New York Times sued OpenAI and Microsoft, alleging that their systems were trained on Times content without permission and can reproduce it in ways that harm the Times’s business (Reuters).
  • Google updated its privacy policy in mid-2023 to say it can use publicly available information to help train its AI models—fueling concerns that personal data posted online could be swept into training sets (The Verge).
  • Courts have also weighed in on scraping more generally. In a landmark case, the Ninth Circuit held that scraping publicly available web pages doesn’t violate the U.S. anti-hacking law (CFAA), complicating efforts to rely on computer-crime statutes to stop scraping of public content (EFF summary of hiQ v. LinkedIn).

Key legal questions the courts will wrestle with

1) Is training AI on public web content “fair use”?

In the U.S., some developers argue that training on publicly available content is transformative and non-substitutional—making it fair use. Critics say generative systems can output near-verbatim copies or close paraphrases, undermining that defense, and that large-scale copying itself should require permission. The Times lawsuit, for example, points to outputs that closely tracked its articles (Reuters).

2) Does “public” equal “permission”?

Not necessarily. Courts have recognized that publicly viewable pages can be scraped without violating the CFAA, but that ruling does not settle copyright or contract claims (for example, violations of a website’s terms of service), nor does it resolve privacy issues when personal data is involved (EFF on hiQ v. LinkedIn).

3) What about personal data?

Privacy regulators are sharpening their focus on training data. The U.S. Federal Trade Commission has warned that companies can’t “sanitize” unlawfully collected data via AI training and should expect enforcement if they use data in ways that violate privacy promises or laws (FTC).

4) Do robots.txt directives or opt-outs control AI training?

In 2023, Google introduced a mechanism called Google-Extended to let websites tell Google whether their content can be used to train certain AI models. That’s helpful, but it’s voluntary and product-specific—not a universal legal shield. Other crawlers and datasets may ignore it (Google-Extended announcement).

5) How will non-U.S. rules change the calculus?

The EU’s AI Act, approved in 2024, requires transparency from providers of general-purpose AI models, including information about training data sources and how they handled copyrighted content. It also operates alongside EU “text and data mining” rules that allow some scraping, with opt-outs for rightholders (European Parliament: EU AI Act).

What Google and others argue

Google, like many AI developers, typically argues it uses a mix of licensed, human-created, and publicly available data to train models, and that established web norms (such as crawlers respecting robots.txt and, now, Google-Extended) allow publishers to opt out. Google also points to research and user benefits: better search experiences, safer systems through broad exposure, and new productivity tools. The updated privacy policy stated that public information can be used to “help” train AI tools like Bard (now Gemini) (The Verge).

Bottom line: A key tension is whether long-standing web crawling practices for indexing content carry over to the AI era, or whether AI training is a qualitatively different use that demands new consent, compensation, or both.

Implications for creators, publishers, and startups

While courts sort out where the lines are, businesses shouldn’t wait on the sidelines. Here’s how to stay proactive.

1) Decide your stance on AI training—and express it clearly

  • Use robots.txt and HTTP headers to signal your preferences to major crawlers. If you want to opt out of Google AI training, implement the Google-Extended directive per Google’s guidance (docs).
  • Update your site’s terms of use to prohibit scraping and model training without a license. While not bulletproof, clear terms improve your footing for contract claims.
  • Publish an AI use policy that tells users and partners how you expect your content to be used.

2) Treat personal data as radioactive

  • Minimize the personal data you expose publicly. If you don’t need to publish it, don’t.
  • Keep your privacy notices up to date and consistent with how you actually use data. Don’t rely on vague “publicly available” justifications if you’re training or fine-tuning models.
  • For products that ingest user data, implement opt-outs and data deletion controls. Regulators expect it (FTC).

3) If you build with AI, manage your vendor risk

  • Ask vendors for disclosures about training data sources and copyright safeguards. If you’re deploying a model that can generate content at scale, you inherit reputational and legal risk.
  • Negotiate indemnities and clear usage warranties. Many enterprise AI providers offer them—make sure you understand the carve-outs.
  • Turn on filters, citation features, and guardrails to reduce the chance of copyright-regurgitation or privacy leaks.

4) If you train your own models, get your data house in order

  • Prefer licensed datasets, public domain materials, and content with clear reuse permissions.
  • Log provenance. Track where data came from, when it was collected, and under what terms. It’s essential for compliance and for responding to takedown or deletion requests.
  • Implement dataset governance: sensitive-data filters, age gates, deduplication, and memorization tests.

5) Consider compensation models

Some AI developers are striking licensing deals with publishers and stock libraries. If your business depends on content, explore licensing and syndication. Even if the law ultimately permits certain training under fair use, a license can be simpler and more predictable.

A practical checklist

  • Pick your policy: opt-in, opt-out, or neutral toward AI training
  • Implement robots.txt and Google-Extended (if opting out of Google AI)
  • Update Terms of Use and Privacy Policy for clarity on scraping and AI use
  • Classify and minimize public exposure of personal data
  • Audit AI vendors for data provenance, indemnities, and safeguards
  • Establish internal dataset governance and logging
  • Educate teams on acceptable AI use and copyright basics

What to watch next

Expect more cases to test the bounds of fair use, contract enforcement via terms of service, and privacy regulators’ appetite for action. In the EU, the AI Act’s transparency and governance requirements will start to bite, forcing model makers to explain how they handled copyrighted data and opt-outs. In the U.S., absent new legislation, courts will be the main arena.

Examples and scenarios

Example: A media startup

You publish investigative articles and see AI chatbots summarizing your scoops without sending traffic. Consider an AI use policy, implement robots.txt rules for AI crawlers, and explore licensing deals. If you find close regurgitation of paywalled content, document it and seek counsel—similar allegations are central to the Times case (Reuters).

Example: An ecommerce brand

Your product pages appear in training sets and chatbots now answer product questions. Decide if that’s acceptable. If not, adjust your robots directives, add terms restricting AI training, and monitor traffic patterns to assess any impact.

Example: A SaaS startup building with AI APIs

You’re shipping features using a third-party LLM. Tighten your data-processing agreements, disable training on your inputs where possible, and log what data leaves your system. Regulators and customers will ask.

Common misconceptions

  • “Public equals free.” Public visibility does not waive copyright or privacy rights, and contract terms can still apply.
  • “Robots.txt is a legal shield.” It’s a signal, not a statute. Some crawlers comply; others may not.
  • “Fair use is a get-out-of-jail-free card.” It’s a fact-intensive defense decided case by case. Outputs that replicate protected works can tip the scales against you.

FAQs

Is scraping public websites legal?

Sometimes. U.S. courts have said scraping public pages isn’t a CFAA crime, but other claims—copyright, contract, and privacy—may still apply. Context and jurisdiction matter (EFF on hiQ v. LinkedIn).

Does robots.txt or Google-Extended stop AI training?

They can stop compliant crawlers like Google’s from using your content to train specific models, but they don’t bind other actors. Think of them as practical controls, not legal guarantees (Google-Extended).

Is training on copyrighted content fair use?

It’s unsettled. Some argue it’s transformative; others point to output that closely tracks copyrighted works. Courts are actively considering this question, including in the Times v. OpenAI case (Reuters).

What should startups do right now?

Map your data flows, tighten terms and privacy notices, vet AI vendors, enable available opt-outs, and log provenance. Those steps reduce risk no matter how the legal landscape evolves.

Can individuals opt out of AI training?

There’s no universal opt-out, but you can minimize your public footprint, use site-specific controls (where available), and exercise rights under privacy laws in your jurisdiction. Expect more tools as regulations mature.

Final take

The line “Google does not own the internet” captures a broader reality: the web was built on open access, but generative AI adds new stakes. Until legislatures and courts draw clearer lines, your best move is to be explicit about your policies, selective with your data, and diligent with your vendors.

Further reading on building responsibly with AI

Explore practical developer guidance and tutorials at AI Developer Code.

Sources

  1. The Register: coverage of the lawsuit asserting “Google does not own the internet”
  2. The Verge: Google updates privacy policy to allow using public data to train AI
  3. Reuters: The New York Times sues OpenAI and Microsoft over AI training
  4. EFF case page: hiQ Labs v. LinkedIn (scraping public websites and the CFAA)
  5. Google: Announcing Google-Extended to control AI training access
  6. FTC Business Blog: AI training data and unfair/deceptive practices
  7. European Parliament: EU AI Act overview and transparency obligations

Thank You for Reading this Blog and See You Soon! 🙏 👋

Let's connect 🚀

Share this article

Stay Ahead of the Curve

Join our community of innovators. Get the latest AI insights, tutorials, and future-tech updates delivered directly to your inbox.

By subscribing you accept our Terms and Privacy Policy.