A recent investigation by the Atlantic is raising questions about how some of the world's largest AI systems are trained and who is benefiting from the journalism that fuels them.
The Common Crawl Foundation, which is funded in part by major tech companies, has been collecting and distributing billions of web pages — including paywalled news articles — to AI developers such as OpenAI, Anthropic, and Meta.
What's happening?
According to the Atlantic, the organization's archives contain millions of articles from news organizations. They include the New York Times, the Economist, the New Yorker, and the Atlantic itself.
Common Crawl claims it only collects freely available content. But its scraper bypasses paywall mechanisms by never executing the browser code that checks subscription status.
The foundation's executive director, Rich Skrenta, defended the practice.
"The robots are people too," said Skrenta to the Atlantic, arguing that they should be allowed to "read the books" for free.
Multiple publishers have requested content removal, but the Atlantic found that archive files haven't been modified since 2016.
Why is AI data scraping concerning?
Quality investigative reporting requires significant resources, and paywalls help sustain them. When AI companies train their models on this content without compensation, it undermines the business models that fund original reporting.
The issue also connects to broader environmental concerns surrounding AI. Training and deploying large language models requires enormous computational power, leading to increased energy consumption and harmful carbon pollution.
The U.N. Environment Programme reported that the data centers housing AI servers have large, polluting footprints. They generate electronic waste, consume large volumes of water, and rely on minerals and rare earth elements.
|
Do you worry about companies having too much of your personal data? Click your choice to see results and speak your mind. |
A U.N. Conference on Trade and Development report found that making a 2-kilogram computer can require extracting about 800 kilograms of materials. According to the MIT Technology Review, data centers consume 4.4% of the energy in the U.S.
However, AI technology offers potential environmental benefits when applied thoughtfully. It can optimize more affordable energy systems and improve logistics. AI could support and even accelerate up to 80% of Sustainable Development Goals, per the U.N.
What's being done about AI data scraping?
Organizations are taking action to regulate the use of AI systems.
Beyond Fossil Fuels published a joint statement featuring over 100 organizations detailing their demands. They included phasing out dirty energy sources when powering data centers and keeping AI systems compatible with "planetary boundaries."
Individuals can also do their part by pushing for transparency and ethical practices in their respective organizations.
Online, Redditors expressed their views.
"Swartz was facing 35 years, I wonder what these guys will get," wrote one user, referencing the case of internet activist Aaron Swartz.
"I mean, all the human readers just bypass the paywall too," commented another.
💰Join TCD's exclusive Rewards Club to earn up to $5,000 toward clean upgrades that will help you slash your bills and future-proof your home.









