The Web Scraping Club

Share this post

On choosing the right proxy provider for scraping

substack.thewebscraping.club

On choosing the right proxy provider for scraping

and invest buying IPv4 addresses

Pierluigi Vinciguerra
Oct 2, 2022
1
Share this post

On choosing the right proxy provider for scraping

substack.thewebscraping.club

Hi, this is Pierluigi from The Web Scraping Club, a newsletter where you can find news, insights, and tutorials with real-world examples about web scraping.

Being a paying user gives:

  • Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases (view here as an example).

  • Access to the GitHub repository with the code seen on ‘The LAB”

  • Access to private channels on our Discord server

But in case you want to read this newsletter for free, you will always get a post per week about:

  • News about web scraping

  • Anti-bot software and techniques insights

  • Interviews with key people in the industry

And you can always join the Web Scraping Club Discord server

Today we'll have a brief follow-up of the previous post, where we talked about proxies, how they work, and their different types.

I felt I'd said enough about it until last Thursday after I've been at Extract Summit in London. Among the great speeches I've attended, the one by Neil Emeigh from Rayobyte covered some aspects of proxies and IPs that I wasn't aware of and wanted to share with you in this post.

The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

What is an IP address and how it works

An Internet Protocol address (IP address) is a numerical label such as 192.0.2.1 that is connected to a network interface that uses the Internet Protocol for communication. 

It is composed of a 32-bit number, usually read and written with a "dot-decimal" notation, that splits the 32 digits in 4 octets, each divided by a dot.

IPv4 address
Pic By Michel Bakni - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=107652628

Due to the rising number of devices connected to the internet, the IPV6 protocol will increase the actual IPV4 size from 32 to 128 bits.

When a device connects to the internet, the Internet Service Provider assigns a free IP address to it, choosing between the addresses in the range that one of the five regional Internet registries has assigned to the ISP.

What is Rayobyte

Rayobyte is a proxy vendor, they sell data center, residential and mobile proxies that can be used for scraping the web. Describing the proxy business Neil said that every potential user of proxies should be aware of two key aspects:

  • diversity, in terms of IPs located in different subnets

  • reputation of the IPs

Let's see them in detail and why we should be careful about them.

Diversity is key

One thing that every web scraper developer is well aware of, is that we cannot make too many requests from the same IP address in a certain timeframe, otherwise we would be blocked.

That's the main reason why proxy providers are used when it comes to web scraping.

But I didn't know that some large websites like Google or Amazon, heavily targeted by bots, would temporarily ban not only your IP address but all the other 255 IP addresses in your subnet.

Let's make an example: let's say Amazon supports 2000 requests per hour from a certain subnet.

It means that from IP 98.0.1.1 I can make 2000 requests in one hour before getting blocked. But not only my IP will be blocked, but also IPs from 98.0.0.2 to 98.0.0.255 will be blocked from requesting data from Amazon.

But this also means that If I make 1000 requests from 98.0.0.1 and 1000 from 98.0.0.2, then all the addresses between 98.0.0.1 to 98.0.0.255 will be blocked again.

This leads to the "noisy neighbor problem": I don't know what the other users on the same subnetwork are doing, if they are scraping Amazon too, "burning" my total request number I can make.

Coming back to Neil's speech, this is the reason why the diversity of the sources in the IP rotation for the scrapers (and also for proxy providers) is a key success factor in web scraping projects involving large websites.

IP addresses have a reputation

Several services offer IP address blacklisting when bad actions are performed on them, like a spam campaign or fraud.

Being on these lists impacts the IP reputation and one of the measures that anti-bot software takes to prevent bots from accessing the websites is to check this reputation.

Some years ago, 4 million IP addresses were stolen from the Regional Internet Registry of Africa AFRINIC and sold on the black market to be used for fraud and spam.

As a result, these IP addresses and others in the same subnets are almost unusable for web scraping because of their low reputation and, even when browsing, CAPTCHAs are often triggered.

This must be considered when choosing the proxy provider for our web scraping project, and usually, when prices that are too good to be true it is due to the reputation of the IP addresses underlying the proxies.

Fun Fact: buying an IPv4 address in 2021 performed better than Dow Jones as an investment.

Average price of IPs
Average price of IP

Due to scarcity and increasing the increasing need for IP addresses, their prices on marketplaces like Neterra Cloud are skyrocketing!

If you're uncertain if you should invest your 1000$ in the latest ape's NFT or in some IPv4 address, I would go for the second, at least there's a real need and an intrinsic scarcity, until the usage of IPv6 finally takes off.

Jokes aside, for today is all, thanks for reading this post.

Is any of our you working on something spectacular in web scraping and want to share with us? please write to pier@thewebscraping.club and let’s talk about it! You could be in the next interview.


Latest post in The Lab

  • THE LAB #3: scraping Cloudflare protected websites

  • THE LAB #2: scraping data from a website with Datadome and xsrf tokens

  • THE LAB #1: scraping data from an app

The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Share this post

On choosing the right proxy provider for scraping

substack.thewebscraping.club
Previous
Next
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Pierluigi
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing