Major news sites are increasingly blocking AI web crawlers, says study

February 25, 2024
AI web crawler

A study from the Reuters Institute for the Study of Journalism at the University of Oxford found that more news sites worldwide are blocking AI web crawlers

The study, authored by Dr. Richard Fletcher, Director of Research at the Reuters Institute for the Study of Journalism, found that nearly half (48%) of the most popular news sites worldwide are now inaccessible to OpenAI’s crawlers, with Google’s AI crawlers being blocked by 24% of sites.


AI crawlers are designed to comb the internet to collect data for AI models like ChatGPT and Gemini. This ensures a steady supply of up-to-date information, pivotal to keeping AI responses accurate and relevant.

Without fresh data, AI models will become locked in time and unable to adapt to the advancements of the real world. If models consume too much poor-quality, synthetic, and AI-generated data rather than new, high-quality, human-produced data, they could even face model collapse

So, why are news sites blocking AI web crawlers? They’re primarily concerned about copyright and fair compensation, fears of spreading misinformation, and the potential loss of direct traffic to news sites. 

The New York Times is suing OpenAI and Microsoft for copyright infringement, joining a host of authors, artists, and businesses who allege AI developers used their data unlawfully.

AI companies understand the problem. That’s why they’re striking licensing deals with media companies like OpenAI’s deal with Axel Springer last year.

Content behemoth Reddit is the latest company to tempt AI companies with multi-million dollar content licensing deals. 

Key insights

Here are some key insights from the report:

  • As of late 2023, 48% of prominent news platforms internationally had restricted access to OpenAI’s crawlers, with a lesser 24% doing the same for Google’s AI crawler.
  • Notably, 97% of sites blocking Google’s AI were also found to block OpenAI’s crawlers.
  • The likelihood of websites blocking AI crawlers varied significantly by country, with the highest rates observed in the USA (79%) and the lowest in Mexico and Poland (20%).
  • Throughout 2023, no instances of websites reversing their decision to block AI crawlers were recorded.
  • Larger news outlets demonstrated a slightly higher propensity to block AI crawlers than smaller ones.
  • The tendency to block varies across different types of news organizations. Legacy print outlets (57%) lead in blocking, compared to digital-born outlets (31%)

News companies are evidently fortifying their defenses against AI web crawlers, and AI companies will probably need to deal their way out to keep their models convincingly updated. 

The alternative is dire. AI model performance will improve, but knowledge will become slowly outdated to the point of unsatisfactory hallucination rates, inaccuracy, redundancy, and irrelevancy.

Join The Future


SUBSCRIBE TODAY

Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Sam Jeans

Sam is a science and technology writer who has worked in various AI startups. When he’s not writing, he can be found reading medical journals or digging through boxes of vinyl records.

×
 
 

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI


 

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.



 
 

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions