THE LAB #10: Bypass Cloudflare Bot Protection with GoLogin
A new way to scrape Cloudflare-protected website using antidetect browsers
This article is sponsored by Serply, the solution to scrape search engine results easily.
Web Scraping Club readers can save 25% on all SERP scraping plans by using the code TWSC25.
Cloudflare anti-bot detection
If you google “Cloudflare bypass”, you will find hundreds of articles and Github repositories explaining how to bypass Cloudflare (or sell a solution for doing it). I also wrote another post on this topic some months ago, and it’s one of the most successful in terms of readers coming from search engines.
The reason is pretty straightforward: Cloudflare Bot Management solution is one of the strongest and most used anti-bot protection used on the internet.
Why is difficult
Unlike traditional security measures, which rely on IP blocking or CAPTCHAs, Cloudflare's Bot Management solution uses advanced machine learning algorithms to analyze the requests made to a website. This allows it to identify bots by looking for patterns in their behavior that are commonly associated with bots. For example, bots may make a large number of requests in a short period of time, or they may use a specific type of user agent or IP address or have inconsistent/suspect fingerprints.
Another reason why Cloudflare's Bot Management solution is hard to bypass is that it is constantly updated to detect new types of bots. The company uses machine learning algorithms to continuously update its detection methods, so it can quickly identify and block new types of bots as they appear.
Last but not least, there’s no silver bullet against Cloudflare Bot Management since it’s a highly customized solution, so what works for a website could not work for another one.
As proof of this, in my previous post about Cloudflare, I wrote three similar solutions for 3 different websites, but only two of them still work. In fact, during the past weeks, I’ve struggled to use Playwright with the Antonioli website for bypassing Cloudflare but I was blocked again after a few pages, especially when the execution was running inside a VM on AWS.
A new approach: anti-detect browsers
Having tried Playwright with different browsers and contexts and on several cloud providers without any success, I decided to give a try with Playwright launching an anti-detect browser.
What are anti-detect browsers?
Anti-detect browsers are usually a fork of Chromium but with some features that enhance the privacy of the users. Typically they obfuscate or randomize fingerprints and the location of the user, and this is the main difference from a classic execution of Playwright or Selenium.
Simplifying the comparison, using Chrome with Playwright the server knows you’re using a genuine version of Chrome but from a Datacenter machine because of its device fingerprint. With an anti-detect browser, you’re using a version of Chromium set up for maximum privacy, that connects using a custom profile that sends custom device fingerprints (i.e. you fake you’re running the browser from a Mac while it’s running from a server).
I needed then to test if this solution could work and, between the several browsers available, I had to choose one with the following specs:
Has a fully working free demo, to test my solution
Can quickly be integrated with Playwright, minimizing the impact on my production environment
Has a Unix client, always because of my production environment.
Given that, technically, any chromium-based browser could run with Playwright if the executable_path is specified in the following way
browser = playwrights.chromium.launch(executable_path='/opt/path_to_bin')
I’ve chosen GoLogin because of all the features above and for the fact that I could create different profiles (so different fingerprints), which I could use for my experiments.
After the onboarding for the trial, I created from the interface my first profile that mimics a Windows workstation.
Then I downloaded the browser’s client and the python source code from their repository, which is needed for interacting with Playwright using their API.
Using the tests on amiunique.org we can see the differences between the Playwright standard execution of Chrome from my Mac laptop and the one with a custom profile of a windows machine using GoLogin browser.
In the first case, we can see the Macintel platform and the macOS headers, which could be easily changed anyway.
Using Gologin instead, I am faking the execution from a Windows machine. Of course, the differences are much more than those I screenshotted, but you easily check them by yourself using the code I’ve shared in The Web Scraping Club Github repository reserved for paying readers.
Keep reading with a 7-day free trial