Web Scraping from 0 to hero: Everything about proxies
What's a proxy, how many different types are available and how they work?
One of the first answers to the question “Why my scraper is getting blocked?” that we gave in our previous post of the course “Web Scraping from 0 to Hero” was to try changing the IP of the scraper by using a proxy.
In this new episode of the course, we’re seeing what’s a proxy, what kinds of proxies are on the market, and how they work.
What is a proxy?
Straight from Wikipedia, "In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource".
It’s like you’re adding a waypoint when you’re making a request to a server: instead of going directly to the target, the request will be routed to another server before reaching the final destination (and the same happens to the response). In this way, if the proxy is correctly configured, the target server cannot see the original IP address from where the request started but only the proxy server as a source.
While the IP address itself is not so important for an antibot solution, all the pieces of information that can be derived from it are crucial:
IPs have a reputation, based on their usage history. If they have been already used in spam attacks or botnets, they don’t have many changes to be considered legit by an anti-bot
IPs have also geography attached, so using a proxy in a certain region is the fastest way to circumvent geofencing from websites.
IPs are connected to an internet service provider and an owner. Knowing that an IP is coming from a mobile carrier or is owned by a cloud provider makes the difference in blocking bots.
Here’s an example of IP from the Google Cloud provider. As you can see, it’s easily detectable that it belongs to a data center region. In this case, we see that the Autonomous System Number (ASN) is 396982, which identifies the Google Cloud Platform as an Autonomous System. It means that the IP is managed, together with many others on the same subnet, by Google and it refers to its Cloud Computing unit.
How we can categorize proxies?
As we’ve seen, an IP address has many derived pieces of information attached and by using them, we can categorize IPs (and so proxies) according to different aspects.
Anonymity level
Based on the level of anonymization they provide, proxy servers can be classified into three main categories: transparent, anonymous, and high-anonymity proxies. Each type offers a different level of concealment of the user's identity and IP address, impacting the degree of privacy and security for online activities.
Transparent Proxies
Transparent proxies, the least private of the three, do not mask the IP address of the user. They forward the original IP address through the HTTP headers to the destination server. This means that while they can cache web pages and control internet usage, they do not provide anonymization. Transparent proxies are typically used in corporate environments to enforce policy compliance, filter content, and perform caching to expedite data retrieval. However, they are not suitable for users seeking to obscure their IP address for privacy or security reasons.
Anonymous Proxies
Anonymous proxies offer a middle ground in terms of anonymization. These proxies mask the user’s IP address from the destination server, making it appear that the request originates from the proxy server rather than the user's device. However, anonymous proxies still send certain information about the user's original IP address in the HTTP headers, which can indicate to the server that the request is being relayed through a proxy. This type of proxy is commonly used to bypass geographical restrictions on content and to prevent websites from tracking a user’s primary IP address. Although anonymous proxies provide privacy for most external observers, they do not fully conceal the fact that a proxy is being used.
High-Anonymity Proxies (Elite Proxies)
High-anonymity proxies, also known as elite proxies, provide the highest level of privacy and security. These proxies do not transmit any identifying information about the original IP address or disclose that a proxy is being used. To the destination server, it appears as if the proxy server’s IP is directly accessing the content, effectively shielding the user’s actual location and IP details. High-anonymity proxies are ideal for users whose primary concern is maintaining complete anonymity online, such as journalists working in sensitive political environments, activists, or individuals in countries with stringent internet censorship laws. Of course, this category is the most important for web scraping.
How the course works
The course is and will be always free. As always, I’m here to share and not to make you buy something. If you want to say “thank you”, consider subscribing to this substack with a paid plan. It’s not mandatory but appreciated, and you’ll get access to the whole “The LAB” articles archive, with 40+ practical articles on more complex topics and its code repository.
We’ll see free-to-use packages and solutions and if there will be some commercial ones, it’s because they are solutions that I’ve already tested and solve issues I cannot do in other ways.
At first, I imagined this course being a monthly issue but as I was writing down the table of content, I realized it would take years to complete writing it. So probably it will have a bi-weekly frequency, filling the gaps in the publishing plan without taking too much space at the expense of more in-depth articles.
The collection of articles can be found using the tag WSF0TH and there will be a section on the main substack page.
Proxy origin
Proxy servers can also be categorized based on the origin of the IP addresses they provide, which includes datacenter proxies, ISP proxies, residential proxies, and mobile proxies. Each category has unique characteristics that make it suitable for specific applications, including web scraping. Understanding these differences is crucial for selecting the right type of proxy for various online activities.
Data center Proxies
Datacenter proxies are provided by servers housed in data centers. These proxies offer IP addresses that are not affiliated with an internet service provider (ISP) and are instead owned by corporations. The main advantage of data center proxies is their speed and reliability, as they are hosted on powerful hardware that can handle large volumes of requests with low latency. For web scraping, data center proxies are highly efficient due to their fast response times and ability to handle concurrent requests, making them suitable for scraping websites that do not employ stringent anti-scraping measures.
ISP Proxies
ISP proxies are a hybrid of residential and data center proxies. They are provided by internet service providers and thus offer legitimate residential IPs but are hosted in data centers. This combination gives them the credibility of residential IPs with the performance benefits of data center proxies. For web scraping, ISP proxies are particularly effective because they are less likely to be blocked or detected as proxies by target websites, given their genuine ISP origin.
Residential Proxies
Residential proxies assign IP addresses that are linked to actual residential internet connections, provided through an ISP to a homeowner. These proxies are highly advantageous for web scraping as they appear as regular user connections to websites, thereby significantly reducing the likelihood of detection and blocking. Residential proxies are ideal for scraping sites with robust anti-bot protections, as they can be confused with real humans browsing the website, making them more effective than data center proxies for web scraping.
Mobile Proxies
To completely understand why mobile proxies are important for web scraping, we need to understand how a mobile address works first and the concept of CGNAT.
I talked about it in this previous post,
but let’s make a quick explaination.
CGNAT is a technology used by mobile and broadband providers to extend the life of IPv4 addresses by allowing multiple devices to share a single IP address. This method is essential due to the limited availability of IPv4 addresses and the gradual transition towards IPv6.
Under CGNAT, each mobile device does not receive a unique IP address. Instead, multiple devices on the same mobile network are assigned the same public IP address while being differentiated by unique port numbers. This setup significantly increases the number of devices that can connect to the internet under the umbrella of limited IP resources. When we use mobile proxies, we’re leveraging the networks where hundreds or thousands of devices share IP addresses. This makes them particularly hard to block since banning a single IP address that a mobile proxy might use can inadvertently block hundreds or even thousands of genuine users who share that IP address, leading to potential service disruption and user dissatisfaction. That’s why mobile proxies are a great choice for web scraping when you’re facing a strong antibot.
Where can I find proxies to use for web scraping?
There are several websites like Free Proxy List where you can find free proxies to use in your scraping projects.
While they can be used if you’re just testing some stuff, they’re not enough reliable for recurring usage.
In case you need a more reliable solution, there are plenty of vendors that can offer you any type of proxy you need (and on our offer page you can also find some discounts for most of them).
Proxyway is doing a great job by testing and mapping commercial proxy providers so if you need them, I would give it a try.
Also on Databoutique.com we’re creating a broader map of companies involved in web scraping, including of course proxy providers.
We just started and you can find the partial results on this page. If you’re working in a proxy seller company or in any company related to web scraping, please consider creating your company page on Databoutique, in order to be part of this great community of scraping professionals and share with it news and offers.