Reddit blocks the Internet Archive from crawling its data - here's why


ZDNET's key takeaways
- The Internet Archive can now only crawl Reddit's homepage.
- Reddit's goal is to block AI firms from scraping Reddit user data.
- Publishers (and others) are suing AI companies for copyright infringement.
Reddit is defending its privacy from AI companies that are taking roundabout approaches to scraping its content.
The social media platform, known as a resource where users can post anonymously and find information about virtually any subject, will block the Internet Archive's Wayback Machine from indexing its online data, according to a Monday report from The Verge. The move is in response to the discovery that AI firms, unable to scrape data from Reddit directly due to the platform's prohibitive policies, have instead been retrieving its data from indexed content on the Internet Archive and using it to train models.
The Wayback Machine will now only be able to scrape data from Reddit's homepage, according to The Verge, while access to user profiles, comments, and post detail pages will be blocked.
Launched in 1996, the Internet Archive is a non-profit that operates an enormous digital database of web content. The archive is maintained in part by the Wayback Machine, a piece of web-crawling software that gathers web pages and preserves them as they appeared when they were collected, like digital flies in amber. This serves as a resource for researchers studying the evolution of online culture and digital forensic evidence for law enforcement, among other uses.
What Reddit's move means
Reddit has previously flagged concerns related to the scraping of its content with the Internet Archive, according to The Verge. The non-profit was also reportedly notified before the web-crawling restrictions started going into effect yesterday.
"We have a longstanding relationship with Reddit and continue to have ongoing discussions about this matter," Wayback Machine director Mark Graham said in a statement to ZDNET.
Growing tension
Reddit's reported decision to block Wayback Machine from scraping the majority of its content arrives during a moment of mounting tension between AI companies and digital publishers, though Reddit is the first tech company to wade into the debate. The company sued Anthropic in June after discovering that the AI company was illegally scraping its data, but it has also previously signed licensing deals with both Google and OpenAI.
(Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)
AI developers require access to gargantuan troves of information to train generative AI models, which are designed to identify and replicate subtle mathematical patterns gleaned from those training datasets.
Many of those companies have scraped training data from publicly available websites, including social media sites and news outlets, claiming legal immunity under a concept known in copyright law as fair use. (The courts are still untangling the legitimacy of that argument, and will likely be doing so for some time.)
Many of the organizations whose content has been copiously scraped -- along with a cohort of authors and other artists -- have responded with lawsuits.
Others, meanwhile, have signed content licensing agreements with the likes of OpenAI, Anthropic, and Google, consenting to the use of their organizations' data in exchange for increased visibility in the responses generated by chatbots, or other benefits.