Internet infrastructure company Cloudflare has updated the robots.txt files for millions of websites, aiming to force Google to change its crawling practices to support its AI product plans. Cloudflare calls this change the “Content Signal Policy,” which comes after publishers and other companies dependent on web traffic have expressed strong dissatisfaction with Google’s AI Overviews and other AI answer engines. These companies claim that these AI tools are severely impacting their revenue streams because they do not direct traffic back to the source of the information.
“Almost every existing, sane AI company is saying, ‘Look, if this is a level playing field, we’re happy to pay for content,'” said Cloudflare CEO Matthew Prince. “The problem is, they’re all terrified of Google because if Google can get the content for free and they have to pay, they’ll always be at a natural disadvantage.”
This is happening because Google is leveraging its position in the search space to ensure web publishers allow their content to be used in ways they might not otherwise accept.
Since 2023, Google has offered website administrators an option to opt out of their content being used to train Google’s large language models, such as Gemini. However, allowing a page to be crawled by Google’s search bot and displayed in search results means accepting that it will also be used, through a process called Retrieval-Augmented Generation (RAG), to generate the AI Overviews at the top of search results pages. This has become a major pain point for many website administrators, from news publishing websites to investment banks that produce research reports.
A July study by the Pew Research Center, which analyzed data from 900 U.S. adults, found that AI Overviews reduced referral traffic by nearly half. Specifically, on pages with AI Overviews, the rate of users clicking on a link was just 8%, compared to 15% on search engine result pages without those summaries.
In August, Google’s head of search, Liz Reid, questioned the validity and applicability of research and publisher reports about declining link clicks in search. “Overall, the total volume of natural clicks from Google Search to websites has remained relatively stable year over year,” she wrote, adding that reports of significant declines “are often based on flawed methodologies, isolated examples, or traffic changes that occurred before the AI feature was launched in search.”
Publishers are not convinced. Penske Media Group, which owns brands like The Hollywood Reporter and Rolling Stone, sued Google in September over AI Overviews. The lawsuit claims that affiliate link revenue has fallen by more than a third over the past year, largely due to Google’s Overviews.
The Penske lawsuit specifically points out that because Google bundles traditional search engine indexing with RAG usage, the company has no choice but to allow Google to continue summarizing its articles, as completely cutting off Google search referral traffic would be financially fatal.
Since the early days of digital publishing, referral traffic has, to some extent, been a pillar of the web economy. Content could be provided for free to human readers and crawlers, and a set of norms was applied across the web so that information could be traced back to its source, giving that source the opportunity to monetize its content to sustain itself. Now, as content summaries provided by RAG become more common, there is a widespread concern that the old system is broken. Cloudflare and other players are trying to update these norms to reflect the current reality.
Massive Update to robots.txt
Cloudflare’s “Content Signal Policy,” announced on September 24, is an effort to leverage the company’s influential market position to change how crawlers use content. It involves updating the robots.txt files for millions of websites.
Since 1994, websites have begun placing a file called “robots.txt” in the root directory of their domain to tell automated web crawlers which parts of the domain should be crawled and indexed and which should be ignored. Over the years, this standard has become almost universal; adhering to it has been a key part of how Google’s web crawlers operate.
Historically, robots.txt only contained a list of paths marked as “allowed” or “disallowed.” It was not technically mandatory, but it became an effective gentleman’s agreement because it benefits both website owners and crawler owners: website owners can regulate access for various business reasons, while also helping crawlers avoid processing irrelevant data.
However, robots.txt only tells crawlers whether they can access something, not what they can use it for. For example, Google supports disallowing the “Google-Extended” agent to block crawlers that are scraping content to train future Gemini large language models, but that rule cannot stop the training Google did before Google-Extended was launched, nor can it stop crawling for RAG and AI Overviews.
The “Content Signal Policy” initiative is a new proposed format for robots.txt aimed at addressing these issues. It allows website operators to choose whether to consent to the following use cases, as stated in the policy:
search: Build a search index and provide search results (e.g., return hyperlinks and short excerpts of your site's content). Search does not include providing AI-generated search summaries.
ai-input: Input content into one or more AI models (e.g., Retrieval-Augmented Generation, foundation anchoring, or real-time scraping of content to generate generative AI search answers).
ai-train: Train or fine-tune AI models.
Cloudflare has provided a quick path for all its customers to set these values according to their specific needs. Additionally, it has automatically applied the robots.txt files on 3.8 million domains it hosts, with search defaulted to “yes,” ai-train defaulted to “no,” and ai-input left empty, indicating a neutral stance.
Threat of Potential Litigation
By making this policy somewhat resemble a terms of service agreement, Cloudflare’s clear goal is to apply legal pressure on Google to change its policy of bundling traditional search crawlers with AI Overviews.
“Lawyers have a voice inside Google,” so Cloudflare is trying to design tools “so that it makes it very clear to them that there’s an explicit license agreement if they’re going to crawl any of these sites. If they don’t respect it, it’s going to put them at risk,” Prince said.
The Next Web Paradigm
Only a company of Cloudflare’s scale could attempt this and hope to have an impact. If only a few websites made this change, Google would find it easier to ignore it, or worse, it could simply stop crawling those sites to avoid problems. Because Cloudflare is connected to millions of websites, doing so would substantially impact the quality of the search experience.
Cloudflare has its own interest in the overall health of the web, but there are also other strategic considerations at play. The company has been working with Bing, a Microsoft-owned competitor to Google, to develop tools to help client websites handle RAG, and it has tried to build a marketplace where websites could charge crawlers to scrape their content for AI, although its final form is still unclear.
Whatever the motivation, most people seem to agree on one thing: Google should not simply win by default in the future answer-engine-driven web paradigm just because of its existing dominance in the current search-engine-driven one.
For this new robots.txt standard, it looks like Google is allowing content to be available in search but not in AI Overviews. Whatever the long-term vision, and whether this change is driven by pressure from Cloudflare’s “Content Signal Policy” or other forces, most people agree that it will be a good start.
(Source: https://arstechnica.com/ai/2025/10/inside-the-web-infrastructure-revolt-over-googles-ai-overviews/)