The Great Web Unblocker Benchmark - Cloudflare Edition

Testing different web unblockers against Indeed.com

Sep 22, 2024

Welcome to a new episode of the Great Web Unblocker Benchmark, our series where we compare the performances of the most famous unblocker solutions on the market.

After the first introductory episode in March and the Kasada edition in June, today it’s time to see who’s the best in bypassing Cloudflare.

The Web Scraping Club

The Great Web Unblocker Benchmark: March 2024

Today I’m starting a new series of articles, the great web unblocker benchmark, where I’ll test different web unblockers against the most well-known anti-bot solutions and compare the results between them. What are web unblockers? Web unblockers are commercial “super APIs” that allow scrapers to bypass anti-bots and rotate…

a year ago · 4 likes · 3 comments · Pierluigi Vinciguerra

Testing methodology and disclaimers

During our web scraping projects, we often encounter anti-bot solutions protecting the target website.

In this case, we have two options: improve our scraper to bypass them or delegate this task to third-party solutions like Web Unblockers. These two solutions have pros and cons: you need to invest your time in creating your scraper and probably you’ll need a browser automation tool, which will be slower and more expensive to run. Depending on the size of your scraping scope, delegating this task to a third-party provider by using an unblocker, considering the development time of the scraper, could be a cheaper option in the long run.

Web Unblockers work like an API so that we can include them in our browserless solutions: for this reason, in our case, all the tests in this article will be made by using a simple Scrapy Spider.

For this episode, I’ve created a scraper that reads the homepage of a famous job board protected by Cloudflare, indeed.com. Then the scraper goes to the list of Python Developer’s remote job board and from there, it extracts all the URLs of job the job offers in the first 7 pages.

Their number per page is not fixed, since it happens there are some ads between the job offers, but the overall number of the job posts should be around 100.

The Unblockers will be tested according to three measures:

a score from 0 to 100 for their success rate in scraping data from the website. 0 if the Unblocker was not able to scraper the homepage and not even a URL from the list. From 1 to 100 is the ratio between the number of requests and the successful ones on the first try.
The time spent scraping these URLs was only for those unblockers who were able to bypass the homepage. The scraper is the same for every unblocker, using 10 different parallel threads, each requesting 1 URL per second. Since the number of URLs will vary from execution to execution, I’ll use the ratio between URLs parsed, including the retries, and the number of seconds the whole execution lasted.
the cost for the scraping, considering only the pages scraped successfully

Given these premises, let’s see the engagement rules:

All the unblockers are tested using the same Scrapy spider, with different options or setups that are peculiar for each solution
To be considered successfully scraped, the requests should not only return the return code 200 but also be parsed successfully by the scraper. In fact, Cloudflare may return an empty rendered page, so this case should be considered an error.
Results are not shown in advance to vendors, to avoid any influence on the final result. This could also lead to errors on my side: I might have missed one option that could solve the anti-bot. If this is the case, I’ll update this article, so keep an eye on it even after its publishing.
Scrapy itself calculates the scraping time I’ve entered in the benchmark and it’s the elapsed time of the first successful run.
This is not a paid review of unblockers, but a quantitative test. In case your company sells a web unblocker not listed here and wants to participate in the next issues of the test, please write me at pier@thewebscraping.club
This test is created for educational purposes, in order to let the readers understand what’s the best tool for their needs when working with scraping public data. None of these tools should be used to damage the target website’s business.

Who’s participating in this round?

In this edition, we’ll see, in a rigorous alphabetical order:

While their ultimate goal (bypassing anti-bots) is the same, the technology under the hood is different between each provider and so are also their pricing models.

Smartproxy and Oxylabs have a pay-per-GB pricing model, which is convenient when you’re scraping API endpoints but gets expensive when you scrape pure HTML code, like in these tests.

Bright Data, ZenRows, and Zyte, instead, have a pay-per-request model, with some differences: Bright Data charges 3 USD per 1000 requests, but they become 6 when scraping a domain included in their premium list. In our case, no one of the websites we’re testing is contained in considered so.

ZenRows and Infatica use a credit system: you’re buying credits (250k, enough for 250k basic requests, for 69 EUR in the case of ZenRows, 250k per 25 USD at Infatica) and every basic request is one credit.

Zyte API has dynamic pricing calculated internally, so I could get the exact scraping cost from their dashboard.

Cloudflare is the most widespread anti-bot protection, so I’m expecting that every unblocker bypasses it, otherwise, its usability would be largely limited.

Who got blocked? ❌

As expected, no unblocker has been really blocked on the process of scraping Indeed.
I’ve got only mixed results with Infatica Web Scraper API which returned fewer pages than expected, but probably with some better try and error handling it could scrape the whole set of URLs needed. Probably some pages were returned before the full JS rendering happened or triggered some challenges since the scraper was not able to read some of the details of the job posts.

Bright Data ✅

With Bright Data we’ve been able to bypass Cloudflare and get all the items we wanted. In fact we got 97 items, but we needed to use also 127 requests to get them, with 23 retries. The percentage of retries on total requests is 21, so the final score for Bright Data is 79.

It took 4378 seconds to get all the 127 responses, with an average time of 34 seconds per request.

Since only the 104 successful requests will be billed at a price of 3$ per 1000 requests, the overall price of this extraction has been 0.312$

Oxylabs ✅

The Oxylabs Unblocker performed quite well on this occasion.

We got 99 items back from the scraper, using an overall of 109 requests with only 3 retries, so we got 97% success on the first try.

All these requests were made in 2803 seconds, with an average of 25.7 seconds per request.

The cost here has the limit of the Pay per GB model: for these 109 requests, we used 0.27 GB. Considering the price of 15 USD per GB for the smallest plan available, it means 4 USD for this small extraction.

Smartproxy ✅

Also, Smartproxy performed well in this test.

We got 96 items back from 106 requests, with only 3 retries, meaning a 96.8% success rate on the first try.

These requests took 3166 seconds to be completed, with an average of 30 seconds per request.

Smartproxy made the brilliant move to make you choose the payment metric for your unblocker, so instead of paying per GB, we can use the pay-per-request plan, which in this case is much more convenient. In fact, given the price of 2.25 USD per 1000 requests, our extraction cost us only 0.238 USD.

ZenRows ✅

Using Zenrows we got 104 items using 112 requests, with only one retry. It means a 99% accuracy on the first try, the best result for this round of tests.

The whole scraper took 3245 seconds to complete, meaning an average of 28.9 seconds per request.

Since we needed to use the JS rendering, the cost of our extraction was 0.72 EUR. In fact, websites that require JS rendering consume more credits than the basic requests, and for 69 EUR we got only 10k advanced requests available.

Zyte API

With Zyte API we got 90 items with 100 requests, 3 of them were unsuccessful, combining into a 97% success rate. The whole scraper lasted 650 seconds, being by far the fastest one with only 6.5 seconds per request.

Also, the price is very interesting, since the whole extraction cost 0.063 USD, the best price of this round.

Helping you choose the best tools for your scraping needs, including proxies and unblockers, is one of the activities we’ll cover during a 1-to-1 consultation path.

Book a consultation

If you’re struggling with your web scraping operations and need help, or simply want to check that everything is at its place in your company, you can book a consultation call and see what I can do for you.

Final remarks

Here’s the recap of our tests for the unblockers that were able to bypass Cloudflare on the website Indeed.com.

While the success rate is pretty similar for everyone, with Zenrows, Oxylabs and Zyte on the podium, the average time per request and the costs make a huge difference.

For both of them, a clear winner in this case is the Zyte API, 4x faster and cheaper than the second unblocker in the pool.

Thanks to all the companies involved in the test, in the next episode we’ll test the solutions against another anti-bot.

I hope you liked this second edition of “The Great Web Unblocker Benchmark”, we’ll have a fourth one in December.