Scraping Cloudflare websites using an API

And it's not the one you're expecting

Jul 21, 2024

One of the coolest things about being part of a club is meeting people with the same interests: the book club is where you comment on your latest reads and listen to different points of view on the same subjects.

The same happens at The Web Scraping Club: we get in touch with brilliant people who are passionate about web scraping, and who create cool things, like Botasaurus, Botright, Scrapoxy, and many others.

You’re in the right room when you’re not the smartest one in it, and that’s definitely my case: despite my experience in the web scraping industry, I would not be able to create such cool tools that make our scraping professional life easier.

The latest tool I have the luck to test in a preview is created by a person under the pseudonym of PixelWhisperer, a cybersecurity expert who runs the Web Scraping and Data Extraction Discord server, and its own Scraping wiki.

He recently developed an “unblocker API”, currently not publicly available, but I had the opportunity to test it against different anti-bots and we’ll see the results together.

PixelWhisperer API vs Cloudflare

The first test is, of course, against Cloudflare, since it’s the most widespread solution and probably the one where Maurice focused more of his efforts.

I’ve tested the API against Indeed.com, the US version.

Given 10 URLs, each for different remote job positions like Python Developer, I’ve created a Scrapy spider that calls the PixelWhisperer API and returns the HTML.

Just like any other similar commercial API, the integration process is quite simple: you can make a POST request to the API endpoint, passing as a parameter the target URL and in the headers your API key.

API_URL = "TARGET ENDPOINT"
    
	HEADERS = {
		"X-Api-Key": "MYAPIKEY"
	}

	
	LOCATIONS = location_file.readlines()

	def start_requests(self):
		for line in self.LOCATIONS:
			url, website, antibot = line.split('|')
			params = {
				"url": url
			}
			yield scrapy.FormRequest(
			    url=self.API_URL,
			    method='GET',
			    formdata=params,
			    callback=self.test_url,
				meta={'website':website, 'antibot':antibot.strip(), 'original_url': url},
				headers=self.HEADERS
			)

You can find the full code in The Web Scraping Club repository available for all the readers, under the folder PixelWhisperer.

The results? 100% success rate at first try! That’s great!

Let’s see what happens against other anti-bots!

PixelWhisperer API vs Akamai

Again, using 10 URLs from Zalando.com, I’ve used the same scraper to run the test. And as the previous one, we got a 100% success rate. That is quite impressive considering that we’re talking about a solo programmer who is creating an API with results similar to commercial solutions, with a higher budget and headcount.

PixelWhisperer API vs Others

I’ve also tested the API against Datadome, PerimeterX, and Kasada but, at least against the tested websites, with no success.

I’ve been able to scrape the Hermes.com homepage, protected by Datadome, but when I requested some products’ pages, the scraper was blocked.

Same for PerimeterX where all the requests made on Chrunchbase.com, even following the guidelines from the latest The Lab article, were unsuccessful.

The Web Scraping Club

The Lab #56: Bypassing PerimeterX 3

PerimeterX (which was acquired by Human Security some time ago) is one of the most important anti-bot solutions, together with Cloudflare, Datadome, and Kasada, as recognized by Forrester in their industry report. Before diving into the technical details of the solution I’ve worked on, let’s try to understand more about PerimeterX…

a year ago · 4 likes · Pierluigi Vinciguerra

Final remarks

I’m always amazed by the things that a talented programmer can achieve. In this case, PixelWhisperer created an unblocker API by himself that has almost the same results as commercial solutions.

Of course, it’s still a work in progress and, if you want to help him out, or just say “Thank You”, you can reach him by joining the Web Scraping and Data Extraction Discord server. He’s looking for Beta Testers!

Scraping Insights - A new series of video interviews

As mentioned last Wednesday, I’m recording a series of video interviews with key people in the web scraping industry and cyber security experts.

The first video with Nick Rieniets, CTO of Kasada, has been recorded: I hope will soon be out on the brand new The Web Scraping Club Youtube Channel. If you don’t want to miss it, I suggest subscribing to it.

Next week, there will be two recordings: one with Antoine Vastel, VP of Research at Datadome, and then with Fabien Vauchelles, an anti-ban expert at Wiremind and creator of Scrapoxy. If you’re a paying user, you can ask your questions live: you can join the conversation by using the link in the latest article.

The Web Scraping Club

Scraping Insights - A video interview series by The Web Scraping Club - Join us

Tomorrow I’ll start recording the first video interviews of the series that I’ve finally decided to call Scraping Insights. As a paying subscriber, you have the opportunity to join the recording and ask your questions live. The final videos will be available on the…

a year ago · 2 likes · Pierluigi Vinciguerra

The Web Scraping Club

Scraping Cloudflare websites using an API

And it's not the one you're expecting

PixelWhisperer API vs Cloudflare

PixelWhisperer API vs Akamai

PixelWhisperer API vs Others

Final remarks

Scraping Insights - A new series of video interviews

Discussion about this post