THE LAB #3: Scraping Cloudflare protected websites
Without buying any external software, for real.
Here’s another post of “THE LAB”: in this series, we'll cover real-world use cases, with code and an explanation of the methodology used.
In the future, this kind of content will be available only to paying subscribers. Being one of the first of the series, this one will be available for free until the 2nd of Oct 2022, then will be behind a paywall.
Being a paying user gives:
Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases (view here as an example).
Access to the GitHub repository with the code seen on ‘The LAB”
Access to private channels on our Discord server
But in case you want to read this newsletter for free, you will always get a post per week about:
News about web scraping
Anti-bot software and techniques insights
Interviews with key people in the industry
And you can always join the Web Scraping Club Discord server
Enough housekeeping, for now, let’s start.
What is Cloudflare?
Cloudflare NET 0.00%↑ is an American company, based in San Francisco, offering several services like DDoS mitigation services, Distributed DNS, Content Distribution Networks, and also anti-bot protection for websites.
On its anti-bot protection it uses both passive bot detection techniques like TCP, TLS, and HTTP fingerprinting and also active ones like Canvas fingerprinting and CAPTCHAs. On top of all this, it queries the browser to identify any automation tool and monitors what happens on the page, to track mouse movements and all actions that can make a bot detectable.
At this moment, it's one of the toughest solutions to bypass in a web scraping project. I think anyone who has some experience in this field has encountered this screen at least once in his life.
Since there's no silver bullet to avoid being blocked, we'll see 3 similar but not identical solutions for scraping 3 different websites:
As usual, we’re scraping public product price data, without logging in, and at a speed that doesn't harm the business of the target website.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.