Is web scraping a profitable industry?

And also: is there a way to make it more efficiently

Oct 13, 2024

I have been in the web scraping industry since 2014, more or less a geological era ago. The iPhone 6 was just launched, people were throwing ice cubes on their heads, and this picture literally broke the internet.

Also, my friend and business partner Andrea and I could scrape hundreds of websites in C++ and libCurl.

We started our data company, Re Analytics, that year without knowing anything about web scraping; we learned by doing. Our vision was, and still is that there’s such a vast amount of public data out on the web, and few people understand its potential and use cases. Today, the situation is much better than in 2014; most of the people we talk with at least tried to do some scraping for fun or business needs, but it took ten years for web scraping to become recognized as a common practice.

The company is still in place, even if we’re shifting to the Data Boutique business model: instead of a data company, we’re building a marketplace for web data. Both businesses have pros and cons, but we firmly believe that Data Boutique could have a bigger impact on the industry.

I interviewed Andrea in the latest episode of Scraping Insights on YouTube.

We discussed these business models, the key elements data buyers are looking for, the challenges for sellers, and the advantages of entering a marketplace like Data Boutique.

The web data company business model

These ten years of Re Analytics have been a wild ride. Things were not always smooth, but we can say we had more joys than sorrows. The business was good but not great.

During these years, I’ve talked with hundreds of peers in other similar businesses, maybe collecting the same data as ours but for different use cases with varying revenues, but I think we could feel the same pains.

Web data companies cannot scale: more customers = more people to hire = low margins forever, even if you’re selling a SaaS service based on web data. I still haven’t met a company selling web-scraped data (or services on top of it) that had huge success. Of course, this is my experience, but I’d be glad if you could mention some companies hitting more than 500M USD of revenues in this space. I’d be happy to be wrong.

Let’s look at some examples of these companies' biggest growth challenges.

Example 1: alternative data companies selling to hedge funds

We have direct experience with this world, as we sold (and continue to sell) data to hedge funds. Some of them are okay with “simple” datasets, like pricing data about listed companies in certain industries of interest. However, buyers in this landscape are flooded by data providers proposing any kind of data, so you should make it easy for them to figure out if your dataset is worth buying.

This means having all the docs and papers in place for them, like a due diligence questionnaire on the data collection techniques and so on. However, one key factor is having a backtest of several years to demonstrate that your data is correlated with a certain stock or pool of stock of interests of the fund. This means scraping data for years until you can close the first contract. Even worse, this is the case where the information's exclusivity reflects on the data's price. If this data is sold to only one fund, which can gain an advantage (alpha) over others, this could be extremely valuable and very well-paid. But if you sell the same information to ten funds, their advantage will diminish, as will the price they’re willing to pay.

By definition, this is the contrary of scaling a company, while it could still be a remunerative business if you can find the right spot to focus. In fact, web scraping is one of the primary sources of alternative data today.

Example 2: SaaS selling APIs to monitor popular websites

This is probably the most scalable business, but still far from significant revenues (and margins, most of all).

In this basket, we have all the companies that help you monitor a limited set of products from popular websites, from e-commerce websites like Walmart and Amazon to the travel industry.

Just like hedge funds, companies in this case need to take extra steps to show the value of the data collected: it could be dashboards, APIs for data usage, additional services like product matching from different e-commerce sites, and so on. All these layers have a cost in terms of development and maintenance but are needed to differentiate from the high competition in this space and not leverage only on pricing.

While the scraping costs are directly related to the revenues since these companies usually make customers pay per request, it’s also true that it's very unlikely different customers will make the same request in the same timeframe, so you cannot use twice the same data extracted.

In this case, while the data collection could technically scale, since you have a finite amount of websites to monitor, all the services on top of them usually require more people as the number of customers grows.

Example 3: Market intelligence and similar tools

The use case of these tools is similar to the previous one but with a significant difference in the data collection. Instead of collecting single URLs from websites, they collect the whole website to provide the full picture to the final customers.

This is what we do at Re Analytics: we scrape hundreds of e-commerce websites in full to provide our customers with a unique point of view on the luxury market. Focusing on one industry allowed us to become experts in it, allowing us to offer unique insights based on the data collected. In fact, when companies want to analyze how the competitors are performing on the market, which products they’re selling, and at what price, they have a list of competitors in mind. But in this fast-paced world, new competitors could come from nowhere, and monitoring the whole market allows the brand to see them coming.

From our perspective, we decided to collect the full websites to resell the same extraction to multiple customers since we cover every major website in the industry. But despite that, every new customer brings more websites to scrape since there are tens of thousands of brands around the world, and it’s impossible to scrape all these websites for a single company. In our experience, I can say that every new customer brought a 20% increase in the number of websites to monitor.

As you can easily understand, this approach could not scale. Every new customer needs more websites, raising the overall cost of scraping, which is also higher than usual since we’re collecting the full snapshots. On top of that, we needed to add additional layers of intelligence, where costs are directly proportional to the number of customers.

In our case, and I think it’s relatable for many of you, business is good but could be better. It’s a price war with competitors, and since our services have all these layers of value, the price tag is not for every pocket, restricting the total addressable market.

Insights from these models

Despite these three models' differences, they have a common ground: They collect the data and all that goes with it and then run their business on top of that.

Starting from some target websites, most of them overlap for companies operating in a certain industry; each company starts to differentiate itself when it comes to enriching data and, in the third step, creating a product on top of it.

This approach has worked for all these years and is still working, but is it the most brilliant way to build a business based on web data in 2024?

Does it make any sense for all the companies in the same industry to scrape the same websites, facing all the same difficulties, keeping engineering busy in understanding how to bypass anti-bots instead of focusing on what really makes the difference for your customer, your product?

These are the questions that Andrea and I asked ourselves in the past years before concluding that a new way to approach web scraping is possible.

Check the TWSC YouTube Channel

Web data as a commodity

If ten companies are extracting prices from e-commerce, given for granted that they can do it completely and correctly, what are the chances that these extractions are different from one another? And what about the reviews of a product? Or the Airbnb locations in a certain city?

Depending on the website, a company could add more or fewer fields to the extraction, but the prices, reviews, or locations are the same for everyone. This is why we can say that web data is a commodity.

Nowadays, if you want to eat apples, you go to the grocery store and buy them instead of planting a tree and waiting months for them to grow unless you’re passionate about gardening. Why can’t it be the same for web data?

There will always be some peculiar cases that cannot be defined in standard datasets. Still, some common uses, like scraping prices, locations, reviews, etc., can be commoditized and sold on marketplaces.

For all these reasons, we started Databoutique.com, a marketplace for web data. Once we had enough sellers on the platform, we started using it with Re Analytics.

As every data company could do, we are both sellers for the extractions we already have and buyers when we need websites we don’t have. This makes our operations much smoother than before. If the website is not listed on Data Boutique, the pool of sellers can usually provide it in 24/48 hours, while if there’s already a dataset, we can buy it and use it in a few clicks.

Since data is available for everyone, we’re attracting new types of customers who don’t want to learn scraping but need data soon without any commitment, something that’s impossible with a traditional data model.

If I were to start my own company today based on web data, I would probably begin by buying data instead of learning scraping and trying to do it myself, which could be fun from a personal perspective but not wise for the business. If I need Airbnb data, why should I bother understanding how the website works, which proxy I need, how to bypass Akamai, and so on: there are datasets already available on the website, so I can start focusing on my app instead of feeding it with data. Data Boutique covers this.

Final remarks

Is web scraping a profitable industry? Well, if you sell shovels to gold diggers, for sure. All the tools that help companies do web scraping are in great shape, especially when web scraping has become harder and needs more advanced solutions than just some open-source tools and free IPs. These companies play a crucial role and deserve what they earn.

For the exact same reason, selling web data is more difficult today: despite more knowledge and education about web scraping than some years ago, margins are getting eroded by the need for more advanced solutions.

One possible solution to a more efficient web scraping industry is to leverage a marketplace of shared datasets like Data Boutique instead of extracting data in a siloed way. This is no different than what happened with data centers. I’m old enough to remember that every company on the earth had to buy thousands of dollars of servers and make them run in-house before the advent of cloud computing.

Today, everyone runs applications on servers shared with others in data centers. While the applications are still distinct and unique, their hosting and the maintenance of the hardware are completely delegated, so you can focus on what matters most. It took some time to convince the most skeptical, but today, it seems the most logical solution when considering your hardware options.

If you want to be part of this radical change in the industry, consider joining us at Data Boutique as a seller, buyer, or both. If you want to know more, our Discord server is there for you.

Like this article? Share it with your friends who might have missed it or just leave feedback for me about it. It’s important to understand how to improve this newsletter.

Rate This Article

You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.

Invite another web scraping person

The Web Scraping Club

Discussion about this post