The Dirty Little Secret of Internet's Data

Everyone Scrapes It, But Very Few Admit It

Jan 07, 2025

Image generated by AI trained on nobody-will-tell-you-what data

The first rule of web scraping is… do not talk about web scraping

There is one technology that has been widely used since the Internet started that is not given the attention it deserves, swimming far from the usual hype. While AI seems to be on every startup pitch deck and every tech product description, often with a thin, almost nonexistent layer of truth in it, just like so many companies used “blockchain” in 2021, this piece of tech under most of everyday’s tools we use, and no one seems fond openly admitting it: Web scraping.

For new readers of this newsletter, our mission is to stimulate open conversations about the technology and ethics surrounding the inevitable practice of data collection.

Yet among web scraping practitioners, a stigma prevails. This sentiment is perfectly summarized in the about-section of Reddit’s r/webscraping subreddit:

“The first rule of web scraping is… do not talk about web scraping”.

I guess the success of this newsletter, co-founded with my longtime business partner

Pierluigi Vinciguerra

, stems from breaking that silence. Everyone needed guidance, but no one was willing to speak openly about it.

Here's a fun fact about secrecy: Only 16 people currently working at OpenAI have web scraping in their LinkedIn job description (not even in the job title). It’s as if only 16 people at Shell knew about drilling (Spoiler: Shell employs over 6,200 drilling specialists, and they happily say it in their LinkedIn job titles).

Only 16 people in OpenAI have web scraping in their job description…

Why the Secrecy?

I like to boil the reasons for this behavior down to these two:

1. Legal ambiguity

People worry—often rightly—that web scraping may break the law. This fear doesn’t deter them from scraping; it just stops them from speaking about it. Whether it’s copyright infringement, violating terms of service, or misunderstanding what a robots.txt file allows, many companies simply want the data without grappling with the ethical or legal implications.

There’s a whisper of doubt about the practice, but the idea that “everyone else does it” soothes the conscience. The mere possibility of being pursued by website owners keeps these activities relegated to back offices. Compliance teams—if they exist—handle these risks quietly.

Famous OpenAI CTO interview on where did Sora get the data from

In reality, much of web data collection falls within a legal framework as long as basic rules are followed. For further reading, check out this earlier article on the legal aspects of web scraping:

The Web Scraping Club

Web Scraping Legal Context

About Data Boutique…

2 years ago · 2 likes · Pierluigi Vinciguerra

Nevertheless, fears about stepping outside that framework keep practitioners using anonymous nicknames in public forums and discussions, when asking how to technically do it.

2. Weak Competitive Edge

The other main reason I see for companies not to say they scrape, or at least try to sweep it under the rug, is that they fear their competitive edge might get eroded if they did.

This is unfortunately true when the tech layer on top of data is made of thin air, or when the defensibility of their model is extremely weak. I went to a dozen retail tech conferences the past six months around Europe, and I must have seen maybe twenty or thirty companies, technically indistinguishable one another, all selling price optimization or competitive monitoring to retailers.

While some advanced companies add significant value beyond data collection, many rely on the illusion of proprietary algorithms. Their reluctance to disclose data sources stems from fear that clients would bypass their services to access raw data themselves. This misconception underestimates the skills and costs required to process raw data effectively.

Instead, companies market “magical algorithms” while staying vague about their data origins.

The side effects of secrecy: Lower Data Quality And Over Costs

What happens when a widely used technique remains secretive? Professionals lack access to proper training, mentors, and examples. They learn through trial and error, leading to inefficiencies.

Experienced scrapers, raise your hands if you’ve faced these issues: blocked websites, missing content, interrupted feeds, misinterpreted data, or constant rework to maintain stability. Service level agreements become hard to uphold, and data quality suffers. More often than not, when you hear shouting “more proxy servers, we need more proxy servers!”, you understand how quickly this knowledge debt transforms into over costs.

Yes, web scraping has become increasingly tech-complicated, and yes, often money can balance out a poorly tuned architecture. But this will end up in your product cost and either eat out your business margin or that of your client (which is still your margin if you were able to have the same data cost 80% less with a proper architecture).

Is There a Way Out?

A controversial practice, performed by many but discussed by few—can web scraping move beyond its “Wild West” phase?

For some, the question may sound like: Can web scraping be stopped at all? From a technical perspective, all current efforts have failed: Web scraping still resists anti-bot measures; it just got more expensive. From a regulatory perspective, the EU proved it can be quite hostile to data markets, but it seems quite an uphill route to take.

Let’s have a look at what industries are functionally dependent on web scraping:

LLMs: None of what we know today, which started on the 30th of November 2022 with the first launch on ChatGPT would have been technically feasible without web scraping. Blocking web scraping would be like setting the time machine back in time (which I don’t think it’s possible, but I’m no quantum physicist)
E-commerce: The entire modern market is based on customer (and consequently retailers) price awareness of the market, so products can have a fair market price. Prohibiting price collection, comparison, and usage would create distortions in the free market.
Travel: Can you imagine booking a flight or a hotel in the absence of price comparison or aggregators?
Market Research: Web scraping is a pillar of market understanding. Without it, the general level of decisions that would be taken, from investment to strategy, would regress by decades.
Lead Generation: The customers’ most hated industry of all—selling people phone numbers and emails—today heavily relies on scraping LinkedIn. GDPR (EU regulation on privacy data) didn’t stop it; LinkedIn policies didn’t either. Money always finds the way.

The way I see it

If there is a half-a billion USD market for web scraping tools, a 3.6 Billion market for Proxy Servers, there is an underlying data market that is off the maps and is kept inefficient by the lack of a proper, stable interface for transferring data from those who have it (the websites) and those who want it (the scrapers), but transaction costs (mainly extraction costs) are kept high by keeping it a grey market, and not an efficient public exchange.

In an ideal world, websites (like Shopee) should sell their data and stop trying to make it inaccessible for two main reasons:

There is a market for it, with willingly paying buyers (just now, they are paying heavily on people and tech to get around your antibot systems).
The data gets out anyway, regardless of your antibots or robots.txt file. You’re only making it harder, not impossible to get. They are going to scrape you. If you openly sell the data to them at a price meeting the ROI (we built Data Boutique exactly for that), you are going to make money from this, instead of spending it on building walls that won’t keep them out.

I am a realist, though, and I see how the path to transparency is complicated.

Still, I hope this conversation—through platforms like this newsletter—helps elevate web scraping from a secretive, frowned-upon practice to a legitimate profession akin to data engineering.

No, not all of those who scrape are stealing data.