What is an IP address and how it works
An Internet Protocol address (IP address) is a numerical label such as 192.0.2.1 that is connected to a network interface that uses the Internet Protocol for communication.
It is composed of a 32-bit number, usually read and written with a "dot-decimal" notation, that splits the 32 digits in 4 octets, each divided by a dot.
Due to the rising number of devices connected to the internet, the IPV6 protocol will increase the actual IPV4 size from 32 to 128 bits.
When a device connects to the internet, the Internet Service Provider assigns a free IP address to it, choosing between the addresses in the range that one of the five regional Internet registries has assigned to the ISP.
What is Rayobyte
Rayobyte is a proxy vendor, they sell data center, residential and mobile proxies that can be used for scraping the web. Describing the proxy business Neil said that every potential user of proxies should be aware of two key aspects:
diversity, in terms of IPs located in different subnets
reputation of the IPs
Let's see them in detail and why we should be careful about them.
Diversity is key
One thing that every web scraper developer is well aware of, is that we cannot make too many requests from the same IP address in a certain timeframe, otherwise we would be blocked.
That's the main reason why proxy providers are used when it comes to web scraping.
But I didn't know that some large websites like Google or Amazon, heavily targeted by bots, would temporarily ban not only your IP address but all the other 255 IP addresses in your subnet.
Let's make an example: let's say Amazon supports 2000 requests per hour from a certain subnet.
It means that from IP 98.0.1.1 I can make 2000 requests in one hour before getting blocked. But not only my IP will be blocked, but also IPs from 98.0.0.2 to 98.0.0.255 will be blocked from requesting data from Amazon.
But this also means that If I make 1000 requests from 98.0.0.1 and 1000 from 98.0.0.2, then all the addresses between 98.0.0.1 to 98.0.0.255 will be blocked again.
This leads to the "noisy neighbor problem": I don't know what the other users on the same subnetwork are doing, if they are scraping Amazon too, "burning" my total request number I can make.
Coming back to Neil's speech, this is the reason why the diversity of the sources in the IP rotation for the scrapers (and also for proxy providers) is a key success factor in web scraping projects involving large websites.
IP addresses have a reputation
Several services offer IP address blacklisting when bad actions are performed on them, like a spam campaign or fraud.
Being on these lists impacts the IP reputation and one of the measures that anti-bot software takes to prevent bots from accessing the websites is to check this reputation.
Some years ago, 4 million IP addresses were stolen from the Regional Internet Registry of Africa AFRINIC and sold on the black market to be used for fraud and spam.
As a result, these IP addresses and others in the same subnets are almost unusable for web scraping because of their low reputation and, even when browsing, CAPTCHAs are often triggered.
This must be considered when choosing the proxy provider for our web scraping project, and usually, when prices that are too good to be true it is due to the reputation of the IP addresses underlying the proxies.
Fun Fact: buying an IPv4 address in 2021 performed better than Dow Jones as an investment.
Due to scarcity and the increasing need for IP addresses, their prices on marketplaces like Neterra Cloud are skyrocketing!
If you're uncertain if you should invest your 1000$ in the latest ape's NFT or in some IPv4 address, I would go for the second, at least there's a real need and an intrinsic scarcity, until the usage of IPv6 finally takes off.
Jokes aside, for today is all, thanks for reading this post.
Are any of you working on something spectacular in web scraping and want to share it with us? please write to pier@thewebscraping.club and let’s talk about it! You could be in the next interview.