THE LAB #61: Evaluating your proxy provider
Measuring programmatically the quality of the IPs offered by proxy providers
"In God we trust; all others must bring data." – W. Edwards Deming
We all know that when you start a scraping project, you’ll need a proxy provider sooner or later. If you’re lucky, it’s just about rotating data center IPs but proxies can soon become the first cost item for a web data company, as we upgrade to unblockers or mobile ones.
So, it’s easy to understand why pricing is the main factor when choosing a proxy provider, but is this the only thing to consider? And are we sure we’re aware of all the pricing models and options we have on the market today?
Spoiler, No, and No.
With perfect timing, my interview with Or Lenchner, CEO of Bright Data, is live on the YouTube channel of The Web Scraping Club.
Or is a great product person and in the company since the early days, so I can say I learned so much about the proxy industry by listening to him.
Just give me cheap proxies!
Sure, but the question is: what’s the cheapest proxy for your case? Are you extracting data from API endpoints or from raw HTML? Do you need to bypass bot protections or not? Or do you just simply need to rotate IPs?
The answers to these questions define which is the best proxy for you since companies have different billing systems and plans.
Are you extracting data from API endpoints? So probably the pay-per-GB plans are the best ones for you since you don’t have all the overhead of the HTML code.
On the contrary, pay-per-request plans are usually more economically viable when downloading HTML.
In case the website doesn’t have any bot protection, using an unblocker is an overshoot and, on the contrary, thinking of bypassing them with ISP or datacenter IPs is a utopia.
All these reasonings, together with some tools and deliverables, are the outcome of the few consultancy tasks I’ve accepted so far. Since more and more of you are asking for this kind of service, I’ve decided to open up a few slots each month to dedicate to companies who are struggling to collect web data. It can be a cost optimization process, a focus on one website, or an architectural review, you can book the first introductory call at this link.
During this introductory call, we’ll cover the actual state of your web data collection infrastructure and we can decide the best path to go on for solving the most critical issues.
Since we started talking about the proxy cost, let me share with you one of the tools we’re using during the consultancy tasks.
It’s The Proxy Provider Pricebook, September edition. I’ve included all the most important companies in the field and tracked their pricing plans and their variations over time. It seems overwhelming but, together with my customers, we’ll use this calculator to check which proxy provider could be more convenient for them.
It’s not just price
After choosing a proxy provider, assuming we selected the right proxy type, we expect a 100% success rate, right?
Well, not so fast!
There are some other factors to consider. Is the provider’s IP pool large enough to satisfy my number of requests? Is the proxy type coherent with what we selected?
In case we wanted IPs from a region, are they located (or recognized) coherently with our choice?
If the answer to these questions is No, our success rate could dramatically drop, leaving us with a sense of frustration.
But how can we test if our proxy provider meets all these requirements?
Well, I’ve created some scripts you can find on the GitHub repository, available for paying readers inside folder 61.EVALUATEPROXIES
If you’re one of them but don’t have access to it, please write me at pier@thewebscraping.club to get it.
Test Methodology
For this test, I’ve selected three different proxy providers: provider 1 is one of the top names in the fields. Provider 2 is another famous brand, a little less established. Provider 3, instead, is an incumbent player who wants to be the cheapest on the market.
For each of them, I’ve made 50 requests on their rotating resident proxies in the USA and 50 in France, just to compare how they behave with nations of different sizes.
Per each request, we retrieve the IPs used by the proxy provider and we’ll make some analysis to understand the variety of IPs used, their location, reputation, and type.
Retrieving the IPs
After getting the credentials from the three providers, we’ll query the Ipify API endpoint to get the IP used.
The script to do so is quite simple but you can find it on the repository under the name ip.py.
For the analysis instead, we’re going to use ASN, an OSINT command line tool that packs together different services and APIs like IPInfo, Shodan, and so on.
After running our ip.py file, using the proxy provider’s credentials we’d like to test, we’ll find in our folder six different files, two per provider, one with the suffix USA and the other with the suffix FRA.
Test 1: IPs rotation
We’ve made only 50 requests and every proxy provider of this test claims to have thousands of IPs in the two regions, so we expect that the same IP is not used twice.
For providers 1 and 2 this is true for both the extractions, while for provider 3 we have 50 different IPs for the USA and 46 for France. Not a great sign to start.
Test 2: Geographical tests
One great feature of Asn is the geographical report it can create starting from a list of IPs, with just one command. you can find all the commands used in this article in the file ASN_commands.sh inside the repository.
Let’s see the results for Provider 1, first the USA and then France.
The IPs are from mixed countries, not exactly what we expected.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.