Analyzing the cost of a web scraping project
The different voices that determine the total cost of a web scraping project
Estimating the costs of a web scraping project can be daunting, given the peculiarities of this industry: while there are some variables we can play with to adjust the costs, like the frequency of the extraction and the scope, some others don’t depend on us. If a website has an anti-bot installed, the running costs for our scraper will be higher. If a website changes its HTML often, our maintenance costs will skyrocket, since we need to update our scraper frequently.
Different types of costs
Just from these two examples, we can understand that there are different types of costs, so before diving into the estimation process, let’s try first to understand the different cost components for a web scraping project.
Setup cost
Broadly speaking, setup costs can be intended as the costs for creating the whole web data pipeline, from writing the scraper to the creation of quality control checks and the procedures needed to store the needed data, together with the time and money spent for a legal assessment to understand if scraping the target websites is a legitimate action or not.
In this article, we’re focusing more on the costs of the creation of a single scraper, which can be split by:
studying the website to understand if there’s an API, an anti-bot, and all the preliminary operations needed.
coding the scraper according to the previous step.
Typically it is based on the number of hours a developer spends on these tasks, increasing with the difficulty of the website (anti-bot or not, APIs available or not, the complexity of the website structure).
Maintenance cost
Maintenance costs are correlated to all the actions a developer needs to take after the creation of a scraper: does the code need to be updated? How often? The modifications to the website also change its difficulty level, since it introduced an anti-bot?
While a small fix to selectors can be almost irrelevant in terms of costs, if the website radically changes its code or introduces a new level of complexity, the cost can be equal or superior to the initial setup phase.
Per usage cost
We can define the per-usage costs as the ones that are directly related to the dimension of our scraping scope.
In this cost item fall mainly proxy services, unblockers, and computing costs, which are usually billed per GB, per number, or per hour.
Even if not strictly billed per usage, we can include in this category also the anti-detect browsers: usually they are billed per profile or concurrent sessions, which means that the more we use them, the more they cost.
Given these macro-categories of cost, according to the scope of the project and the difficulty of the website, one voice could be more relevant than the others.
Different websites, different costs to consider
Let’s see three different scenarios, just to understand when one cost is more important than the others.
Scenario 1: small website, no anti-bot
Let’s say we want to scrape all the product prices from a small e-commerce website like Gucci.com.
We don’t have any anti-bot installed, so there’s no need to use proxies or browser automation tools like Playwright, which require more computing power than Scrapy. Being a small website, our scraper will last only a few minutes, making its running cost almost irrelevant.
Given this, the most important cost will be the setup, since creating our scraper with the right selectors will be the most time-consuming activity of the project.
Scenario 2: small website, with antibot
This second scenario is similar to the first one, but with a great difference: the presence of an anti-bot.
Let’s say we have a small e-commerce website like Hermes.com, with strong anti-bot protection installed.
In this case, after creating a working scraper, which can be a time-consuming activity since we need to find the right countermeasures, we’ll probably need a browser automation tool and proxies for its run.
So we’ll have higher setup and running costs but the most important one will probably be the maintenance: in fact, when we encounter a strong anti-bot solution, it’s almost sure we’ll need to update our scraper every month or so, spending time to understand how to fix it and bypass the anti-bot.
Scenario 3: large website with rate limiting per IP
In this third scenario, we’ll consider a large e-commerce website like Zalando.com, where the only challenge is to scrape 1 million product prices using datacenter proxies since there’s a rate limiting in the number of requests a single IP can make in a certain timeframe.
In this case, the greatest cost will be the proxy usage, since we’ll need many GB of bandwidth per execution.
How to mitigate costs of web scraping
Depending on the website and the type of costs we’re facing, we can apply several techniques and tools to mitigate them. Let me bring just two examples of how costs can be cut, depending on the peculiarities of your project.
Make vs buy
This is a heated debate in the industry: while no one wants likes to pay a tool for scraping, we must consider that anti-bot solutions are created by companies that pour millions into their R&D departments, delivering softwares that is becoming more sophisticated over time.
This means that the time needed to create and maintain a web scraper for a website protected from an anti-bot can be considerable. On the other hand, unblocker solutions in most cases will solve the task for us, shifting the cost to a pay-per-usage approach.
There’s no general rule for saying which approach is the more convenient, since it depends on many factors, like the size of the scope of scraping and the cost of the unblocker solution.
The smaller the website, the less will be the usage of the unblocker and so the more convenient this solution will be.
Speaking about unblockers, next Tuesday there will be an update of our Great Web Unblocker Benchmark, where I’ll test different web unblockers on the same playground.
Here’s the latest version published in March.
Another way to save money in the development and maintenance phase of scrapers is to buy data directly on websites like Databoutique.com, a marketplace for web data.
If you’re trying to scrape a popular website, like the ones we’ve seen before, there’s a high chance that, in the world, someone else is doing the same, extracting the same data.
Instead of reinventing the wheel, why not think about buying data from someone else, who can keep a lower price by benefitting from multiple sales for the same dataset?
Datacenter proxies VS virtual machines
Let’s consider the Zalando example we’ve seen before: we’re using datacenter proxies that usually cost around 0.5-0.7 USD per GB. This means that for every 100 GB of usage, we’ll spend 50-70 USD.
But datacenter proxies are, like the name itself says, IP addresses coming from servers on data centers. But in this case, a different approach could be more convenient, by using a tool like Scrapoxy.
Scrapoxy, as we’ve already seen in The Lab 41 article, is a super proxy manager, that gives you the ability to connect different proxy providers and manage cloud providers as if they were data center proxy providers.
In fact, with Scrapoxy you can spawn a certain number of concurrent virtual machines. Requests sent from the scrapers to Scrapoxy are then routed randomly between these machines, which can be deleted once their IP is blocked by the target website or after a certain uptime.
The advantage of using this approach instead of a traditional datacenter proxy provider is quite easy to understand: you get a defined number of IPs with unlimited bandwidth.
I’ll write again about Scrapoxy soon, in order to share with you some more use cases.
How LLMs and AI will affect the web scraping costs
This is a great topic that needs to be covered in a dedicated post: next Sunday, together with Marco Vinciguerra, founder of ScrapeGraph.ai, we’ll deep dive into the implications of LLMs and AI in the future costs of web scraping projects.