THE LAB #16: How to scrape Datadome protected websites (early 2023 version)
Tools and techniques to scrape Datadome protected websites
This post is sponsored by Proxyempire, your trusted proxy partner. Sponsorships help keep The Web Scraping Club Free and it’s a way to give back to the readers some value.
In this case, for all The Web Scraping Club Readers, using the discount code TWSC10 you can save 10% OFF for every purchase.
The Web Scraping Club is a free weekly newsletter about web scraping. Once every two weeks, I publish The Lab, paid content with deep dives on more technical aspects and solutions to common issues, with also code on a GitHub repository. You can access to the following articles using a 7 days trial and then subscribe if you find them useful.
As always, please read carefully the following disclaimer: all the information you will find here are for research purpose and should not be used to cause damage to any website business or operations. Scrape carefully and ethically without disturbing the operativity of the target website and only publicly available data not protected by copyright.
What is Datadome and how it works?
Datadome Bot Protection is a comprehensive software solution that is designed to protect your website or application from various types of malicious bots. The solution uses advanced bot detection techniques, such as device fingerprinting, behavior analysis, and machine learning algorithms, to distinguish between human and bot traffic. By identifying and blocking malicious bots, Datadome helps improve website performance, protect sensitive data, and prevent fraud.
One of the key features of Datadome is its ability to detect and block automated attacks that can cause harm to your website or application. These automated attacks can come in many forms, including scraping, account takeover, credential stuffing, and more. Datadome uses a variety of techniques to detect and block these attacks, including analyzing user behavior and patterns, analyzing IP addresses and user agents, and analyzing traffic patterns.
Datadome also includes a real-time dashboard that allows you to monitor bot activity and take action if necessary. This dashboard provides a detailed view of bot traffic, including the number of bots detected, the types of bots detected, and the actions that were taken. You can also set up alerts to notify you when certain bot activity is detected, allowing you to take immediate action to protect your website or application.
Overall, Datadome Bot Protection is a powerful solution that can help protect your website or application from the growing threat of malicious bots. By using advanced bot detection techniques and providing real-time monitoring and alerts, Datadome can help improve website performance, protect sensitive data, and prevent fraud.
How to detect Datadome?
The easiest way is via tools like Wappalyzer that test the tech stack of a website and can detect which anti-bot is used on it.
Another way is to inspect the cookies of the requests made to the target website: as an example, when we browse to Footlocker.it, as a response to the first request we get a Datadome cookie.
When browsing also a Datadome-protected website in Incognito mode, especially if it’s the first time you’re visiting it, you can encounter one of their challenge with a slider.
Given that results may vary from the target website configuration and from the environment you’re running the tests from, let’s try to figure out how to bypass Datadome Bot Protection first with some free open-source tools.
Playwright with Chrome ❌
We start our tests on a local machine with Playwright and Chrome. I’ve added to the standard configuration a new package I’ve discovered, python_ghost_cursor, which simulates human mouse movements using Bezier curves, which we have seen in our old post.
Anyway, this didn’t help since I’ve got the captcha when I try to go to the product list page of men’s shoes.
Playwright with Firefox ✅
Things got better after switching to Firefox, even if I needed to delete the python_ghost_cursor package since it works only with Chrome.
The results from both a local environment and a VM on a datacenter are great, so this solution is definitely approved. It seems that Chrome leaks some data used by Datadome to understand if there’s automation behind its execution. Let’s give it another try with another Chromium-based browser like Brave.
Playwright with Brave ✅
I’ll use the same scraper we’ve seen before with Chrome but change the executable path of the browser to point to Brave browser.
I’m able to browse the website both on a local machine and on a VM so it seems that the data leak depends strictly on Chrome. Good to know, now we have more free options to use.
Final thoughts on the free solutions
Datadome is gaining popularity as an anti-bot solution and bypassing it has a cost, especially if you need several large websites in your projects. As we have seen, we need headful browsers, so it means more CPU and memory to allocate for the scraper execution, but, at least, it can be bypassed with free tools.
On the other hand, we have some commercial solutions that allow us to bypass Datadome using a simple Scrapy spider and an API. These API calls have a cost but we can reduce significantly the execution costs for the machines hosting our scrapers. The convenience of these solutions depends from case to case but they are worth noting.