Optimizing Proxy Usage for Large-Scale Scraping

How to cut costs of scraping with some simple rules

Dec 08, 2024

Web scraping at scale involves navigating challenges like IP bans, rate limits, geographical restrictions, and, last but not least, anti-bots. Proxies are an essential tool to overcome these barriers, especially when the size of your project becomes important.

Since the pricing model of proxies is strictly correlated with the size of our scraping activity, it’s key to understand how to optimize their usage to save some money. Even a marginal saving percentage could mean a significant amount of dollars when the scraping operations are enormous.

Need help for your scraping projects?

Understanding the different types of proxies

Proxies are not all equal since they can differ by type and features, and I wrote an extended article about them in this previous post.

The Web Scraping Club

Web Scraping from 0 to hero: Everything about proxies

One of the first answers to the question “Why my scraper is getting blocked?” that we gave in our previous post of the course “Web Scraping from 0 to Hero” was to try changing the IP of the scraper by using a proxy…

a year ago · 3 likes · Pierluigi Vinciguerra

Here’s a small recap and the main difference between them:

Proxies act as intermediaries between your scraping tool and the target website. They provide anonymity, distribute requests across multiple IPs, and help avoid triggering anti-bot mechanisms. The four primary types of proxies include:
Datacenter Proxies: Fast and cost-effective but more detectable by advanced anti-bot solutions.
ISP Proxies: A hybrid solution, ISP proxies are tied to real internet service providers but hosted in data centers. They combine the reliability of residential proxies with the speed of datacenter proxies, making them a versatile choice for many scraping scenarios.
Residential Proxies: IPs associated with real devices, ideal for bypassing advanced anti-bot systems.
Mobile Proxies: Use mobile carrier IPs, excellent for evading stringent bot detection mechanisms. They don’t usually get blocked by target websites since it could mean blocking hundreds of devices with the same IP.
Web Unblockers: APIs that use proxies and other techniques to avoid being blocked, for websites heavily protected.

The list has a precise order: it goes from the most detectable proxy type to the most trusted one. Datacenter IPs are easily detected and blocked by any anti-bot solution, while mobile proxies are rarely banned. Unblockers are designed to beat anti-bot solutions and guarantee the success of your web data extraction. However, this is also the order from the cheapest to the most expensive type, even though prices for mobile proxies have gone down consistently in the past months.

Since web unblockers are designed to have a high success rate, you could be tempted to use them whenever possible, so you don’t have to worry anymore about bans. That’s a good approach if you have infinite pockets.

But if you’re not in the position of burning money in the fireplace, you should probably think about a smarter approach: I call it “The Proxy Ladder.”

Check the TWSC YouTube Channel

The Proxy Ladder

The Proxy Ladder is the process of choosing the less expensive solution for your needs until it stops working. When it does happen, you move on to the next step with a more expensive and effective solution.

Step 1: No proxies

Once you have created your scraper, it’s time to test it locally and in the target running environment.

Let’s say your scraper is made by one simple Scrapy spider that requests 1000 URLs. Since there’s no browser to use, there's also no browser fingerprint to worry about.

The best case is that your scraper extracts data from all 1000 URLs on your local machine and in your target environment on a data center. Good job—you don’t need a proxy!

If your scraper, when it runs locally, starts returning errors after a while, it probably means that your IP has been rate-limited, so you need to rotate them to complete the 1000 URL extraction.

For this reason, we’re climbing the first step, choosing datacenter proxies.

Step 2: Datacenter proxies

We understand we need to rotate IPs for the scraper, so let’s start using data center proxies for this task. The easiest way is to buy rotating datacenter proxies from one of the many vendors out there. Alternatively, you can buy a pool of IPs and rotate them by yourself in your scraper. In my opinion, the first solution is better: you have access to a much wider range of IPs, and you don’t need to care about anything.

If the scraper can now perform all the requests, we have solved the issue and can deploy our scraper in our target environment without any trouble.

Otherwise, if the scraper stops soon after starting with some error codes, we just discovered that the target website blocks datacenter IPs and we need to move on to the third step of the ladder.

Step 3: Residential proxies

The target website blocks datacenter IPs? Well, it’s a common way to block bots; since every major cloud provider shares the subnets used by their data centers, it’s so easy for websites to exclude the traffic coming from there.

That’s why residential proxies are so popular, but they are also ten times more expensive than datacenter ones.

Just try rotating residential proxies instead of datacenter ones and see if now the scraper works until the end. If not, let’s move to the fourth step.

Step 4: Mobile proxies

In 2023, if you asked me what to use when residential proxies don’t work, I would have said web unblockers. They were more convenient than mobile proxies, whose prices were so high! While during 2023 the prices of unblockers dropped down, with more companies releasing their solutions, the same happened in 2024 with mobile proxies. Not so long ago (maybe two years), mobile proxies were priced at 40EUR per GB, now you can have a GB for eight euros.

So, if residential proxies aren’t working, try using mobile ones and see what happens. ]

Step 5: Web unblockers

If neither mobile proxies work, we have an issue bigger than IP reputation: probably the website is protected by an anti-bot solution, and it requires some extra effort to be scraped. If we arrive at this ladder step, your scraper will probably never work before, even from your local machine. In this case, you need to understand the anti-bot blocking you, read more posts from The Web Scraping Club, and know if the proposed solutions suit you and your skills.

If you don’t want to spend too much time finding your solution, you can use a web unblocker, even if it’s not the cheapest option on the market.

Optimizing the process

Now we have seen what’s the most common path for choosing the right proxy in a smart way, let me share with you some tips for cutting your proxy expenses:

Don’t pay for datacenter proxies: Tools like Scrapoxy help you manage a fleet of virtual machines on a cloud provider, creating your own datacenter proxy network with unlimited bandwidth.
Mind the pricing structure: if your scraper is reading from internal APIs, it will consume less GB than another one downloading the HTML. In the first case, it’s better to use a provider who makes you Pay per GB, while in the second, probably, it’s better to use a Pay-per-request approach (but you should do the math to check if this is true in your specific case).
Stick with a good provider: while having multiple providers can be a good idea for reliability, every provider offers discounts on large volumes of orders. Find the right balance between reliability and savings.

Like this article? Share it with your friends who might have missed it or leave feedback for me about it. It’s important to understand how to improve this newsletter.

Leave a Feedback

You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.

Refer a friend