Scrape like a pro... but not like an AI company

Why the LLM's hunger for web data is damaging the scraping industry

Jul 29, 2024

∙ Paid

Web scraping has always been perceived as a sort of gray area and sometimes companies prefer not to publicly say they’re doing it, even if it’s obvious.

Do you want the latest proof for this theory?

Go to LinkedIn and look for people working at OpenAI, Perplexity, Mistral, or whatever other company training LLMs today. How many of them have scraping in the job title?

Who’s the “Head of web scraping/web data acquisition” of these companies?

They have a team for scraping for sure, it’s enough to look at their job board to see also some open positions on them (if you apply and get hired, a gift is accepted).

Web scraping is just a tool

As said multiple times on this blog, web scraping is just a tool, like a hammer: the actions made using them make your behavior illegal or not.

If scraping were illegal at all, it would mean that thousands of companies are selling and marketing tools for infringing the law: proxies, APIs, datasets, and so on.

In this evergreen post by Sanaea Daruwalla of Zyte, she explains what are the legal risks related to web scraping.

The Web Scraping Club

Legal Zyte-geist #1: Step-by-Step Guide to Compliant Web Scraping

Welcome to the monthly column about web scraping and legal themes by Sanaea Daruwalla. She is the Chief Legal & People Officer at Zyte. Sanaea has over 15 years of experience representing a wide variety of clients and is one of the leading experts on web data extraction laws…

2 years ago · 4 likes · 5 comments · Sanaea Daruwalla

We can summarize them in these points:

Scraping personal data is bad
Scraping copyrighted material can be bad, it depends on the usage you make of it. While scraping news or videos to post them in the same format on other websites is clearly a bad idea, there’s not a clear ruling for the case when copyrighted data is used to train LLMs, which provide new content based on it.
Scrape only public data, so avoid data behind login and paywalls.
Scraping publicly available data with no personal info and copyrighted material is good.

Additionally, there are ethical rules that should be applied when doing web scraping:

do not harm the target server by sending too many requests. Try to be as light as possible on the target server, in order to minimize your scraper footprint
follow the robots.txt file. You’re not obliged, not following is not legally enforceable, but being respectful of the target website doesn’t hurt.
do not interfere with the business of the target website. Let’s say a scraper requires adding items to your cart to understand their inventory levels, if these items are not available for other users during the scraper execution it can be an issue for the website.

Given these rules in mind, over the past ten years, we’ve seen more and more players in the scraping industry growing and creating cool stuff.

In fact, web data is fueling thousands of companies creating market intelligence tools, pricing monitoring and comparison, revenue management, dynamic pricing, web data marketplaces, and so on.

When this AI frenzy started, I was sure that it would have helped in making scraping even more mainstream, so that the industry could have a full legitimization from the tech landscape.

As of today, I’m hearing more and more people talking about web scraping, but not for the reasons I’d like to listen.

Getting web data to fuel the AI race

We all were astonished, or at least surprised, when OpenAI released their first version of ChatGPT. We have all this knowledge that could be retrieved by only writing the correct prompt! Wow!

They had been scraping the web for years, but no one cared about it, the results were great!

Well, almost no one: media outlets, journalists and book authors started suing OpenAI for copyright infringement.

With more LLMs in training, issues expanded from copyrighted material to scraping practices.

In fact, following the OpenAI example, every AI company declares its scraper’s user agent and gives instructions to website owners on how to compile their robots.txt file to avoid being scraped by them.

A great sign of maturity, if only worked.

As of today, the robots.txt file is more a matter of “Netiquette” than something legally enforceable but ignoring it is not something to be proud of, especially if you publicly stated that you would have followed the website’s instructions.

But this is only the beginning of bad web scraping practices used by AI companies that have been noticed in these months.

Kyle Wiens, IFixit CEO, complained that the AnthropicAI bot hit their servers one million times in 24 hours, despite being blocked by the robots.txt file.

This complaint makes the pair with others I’ve read around Hacker News and X, where website owners complain that these bots are pulling TB of data from their websites with badly engineered scrapers, running multiple times on the same pages or restarting from zero once blocked instead of resuming from the last page crawled.

This way of scraping targets aggressively and without any respect for them is making the tech world hate web scraping again. Websites need to handle all this traffic and, eventually, buy a bot mitigation solution, making the maintenance of the website more expensive. On the other side, companies who always scraped these websites by being respectful and under the radar will face more costs to keep doing it, creating even more inefficiency in the industry.

On top of that, a way to monetize data for website owners has not been found yet: Reddit, having one of the largest databases of user-generated content, has signed an agreement with OpenAI to provide data feeds. I doubt this will prevent OpenAI’s competitors from scraping it, but at least for them a solution has been found.

On Databoutique, public web data could be traded directly from websites, to reduce the scraping while adding some revenues.

But for other social media or news websites, a solution is far from being found.

Scraping Insights video recording

I’m continuing to record video interviews with key people in the web scraping industry and I hope to publish the first videos during August.

This week I’ll interview:

Andrea Squatrito - CEO of Databoutique Tuesday, 30th July - 3:00 PM (GMT+2)
We‘ll talk about the challenges of the web data economy, from a seller's and a buyer's perspectives

Or Lenchner- CEO @ Bright Data Wednesday, 31st July - 1:30 PM (GMT+2) The proxy industry: its history, present, and future seen from one of the historical companies in the landscape

If you’re a paying reader of The Web Scraping Club, you can join these conversations during the recording and ask your questions during it.

After the paywall, you will find the link to join us.

Keep reading with a 7-day free trial

Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.