THE LAB #62: Bypassing Cloudflare with Nodriver
Testing the undetected-chromedriver successor for scraping Cloudflare protected websites
What is nodriver
Nodriver is an open-source web scraping utility designed to extract data from websites that heavily rely on JavaScript for rendering content. Unlike traditional scraping tools that require a browser driver (such as Selenium WebDriver) to interact with dynamic web pages, nodriver aims to perform these tasks without the overhead of managing a browser instance.
Subtracting the webdriver from the architecture of the solution has a double advantage:
by having direct access to the browser, the scraper is way faster
there’s a decreased detection surface for antibots since they have fewer signals to look for.
Main features
Let’s set the expectations for Nodriver by analyzing its features.
JavaScript Rendering Without a Driver: Enables scraping of JavaScript-heavy websites without needing a separate browser driver.
Lightweight and Efficient: Consumes fewer resources compared to browser automation tools requiring webdrivers
Ease of Use: Provides a straightforward API for developers to integrate into their projects.
Headless Operation: Runs in the background without opening a visible browser window.
Fully asynchronous: it can launch different browser windows and handle them simultaneously
On the other hand, this is not an all-in-one solution, since some features are not included:
fingerprint forging, Nodriver won’t change the fingerprint of the browser used, exposing information about the machine where it’s running
authenticated proxies, because of the limitations of Chromium, it is difficult to use authenticated proxies in your scraper
human behavior, packages like ghost-cursor for imitating human interactions with the mouse are not included
Benchmark with Playwright
Given all these premises, let's see how today a standard Playwright scraper looks like from an anti-bot perspective.
You can find the script in the folder 62.NODRIVER of the GitHub repository, available only for paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please write me at pier@thewebscraping.club so I can include you in the team.
In the script playwright_chrome.py I’m launching a Playwright session using Chrome browser and a persistent context and then loading the tests created by Antoine Vastel on deviceandbrowserinfo.com.
Since I’m running this script from my local machine, I don’t see any red flags about my device fingerprint, IP, hardware, and so on, except for this one.
While with some options on the startup of the browser session, I could circumvent the most basic stuff, I’m getting detected because of the CDP protocol detection test.
We talked about it in a previous THE LAB article, where we patched our Playwright scraper to avoid being detected because of the CDP protocol.
But is Nodriver able to pass this test without any additional patches?
If we run the nodriver_test.py script, we can see that also the CDP test is passed, thanks to its architecture.
That’s a great result, by default, we’re creating a scraper that bypasses all the most well-known tests for detecting bots.
Is this enough for bypassing Cloudflare to scrape data from a website protected by it?
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.