The Web Scraping Club

The Web Scraping Club

THE LAB #103: Bypassing DataDome-Protected Websites in the Agentic Era

Fifteen browser configurations, one tough anti-bot, and only a couple made it to the cart

Pierluigi Vinciguerra's avatar
Pierluigi Vinciguerra
Apr 30, 2026
∙ Paid

This year every web infrastructure company seems to be shipping a browser. But not a regular browser, one designed to be driven by an AI agent and to look human while doing it. We wanted to know if any of those browsers actually work against a serious anti-bot, so we picked a hard target, leroymerlin.fr behind DataDome, and tested more than a dozen different setups on the same four-step task: open the homepage, search for a product, open the first result, add it to the cart.

Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.

Claim your offer


The short answer is that a couple of tools finished the task, just one with any consistency. The story behind why is worth telling, because it explains what is happening at the intersection of AI agents and web data right now. We ran a similar exercise against Cloudflare earlier this year, and the conclusion is broadly the same: each anti-bot needs its own answer, and the answer changes every quarter.

From workflows to agents, and why that changes the data problem

Most code shipped under the AI banner is not really agentic. It is workflow code with an LLM dropped into a slot: generate a summary here, classify a record there, draft an email at the end. The control flow is hard-coded, and the model is one component among many.

The definition of an agent is quite different. The model decides the next action, observes the outcome, and decides again. The control flow lives inside the loop, not outside it. The agent has goals rather than scripts, and it picks tools and steps based on what it sees. That is what makes the engineering interesting, that is what makes it hard, and that is what sometimes makes it unreliable.

It also forces a different relationship with data. An agent that only sees its training corpus is stuck in the past. To make decisions worth anything, it has to read prices that change daily, stocks that move minute by minute, listings that did not exist last week. Some of that data sits behind APIs. Most of it does not. The web is still the largest and most current dataset in the world, and most of it is reachable only through a browser. So if we want our agents to act on real information, we have to give them a way to browse: opening a page, reading it, clicking a link, typing into a search bar, following a result, filling a form, all on sites that were never built for machines.


Start your scraping journey with Byteful: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries

Claim your 10GB here


This is the constraint that produced the wave of “agentic browser” launches we have seen over the last twelve months. Y Combinator alone has backed a long string of them. Hyperbrowser (S21) was an early entry: scalable cloud browser infrastructure with built-in CAPTCHA solving, proxy management, and now a multi-agent playground. The newer cohort followed the agent wave more directly: Browserbase (W24) ships a managed browser plus Stagehand for natural-language automation; BrowserOS (S24) is an open-source agentic browser that runs the agent locally on the user’s machine; Browser Use (W25) offers an open-source agent loop on top of Playwright, plus a cloud version. Skyvern is a self-hostable browser agent that uses an LLM and computer vision instead of fixed selectors. Outside the YC pipeline, Lightpanda is doing something different again, a headless browser engine written from scratch in Zig and aimed squarely at agents and crawlers (claiming roughly 9x faster execution and 16x lower memory than Chrome). It fits the “browser built for machines” line of thought we covered in Rethinking the web browser earlier this year. And the big AI labs are now in the same space: OpenAI shipped Operator and the ChatGPT Atlas browser, Anthropic shipped Computer Use, Perplexity launched Comet. Each project attacks the same problem from a slightly different angle, but the goal is identical: a browser an agent can drive without immediately tripping every detection mechanism on the other side.


Your scraping workflows deserve a proxy infrastructure that just works. With Swiftproxy on your side, consistency is built-in.

Try Swiftproxy today


The same problem scrapers have been chasing for a decade

For anyone who has worked in web data, none of this is new. The fight over whether a request looks human or automated has been going on as long as commercial scraping has existed. The product names have changed but the purpose not.

What has changed is who is selling the bypass. The companies that have spent years selling residential proxies and unblockers noticed quickly that the agentic boom is good for their business. They already have the IP networks, the fingerprint research, the bypass code, the cat-and-mouse experience. They know what TLS handshake Chrome sends in October 2025 and what it sent in October 2024. Pivoting all of that into a managed browser is a smaller leap than building one from scratch. Bright Data, Oxylabs, Rayobyte, ZenRows have all added a managed browser product alongside the proxy.

The other side of the line is moving in the opposite direction. Bot traffic has grown faster than human traffic for years, and the operators of large public sites care more about it than ever. DataDome, Cloudflare Bot Management, Akamai Bot Manager, HUMAN, Kasada: every one of them ships updates that target the exact tools we just listed. Fingerprint checks get stricter. Behavioral models get more sensitive. The JavaScript challenge changes shape every few weeks. There is no silver bullet, and there is no tool, browser, proxy, or service that bypasses every anti-bot on every site at all times. Anyone who claims otherwise is selling something that worked last quarter and might still work this week. The useful question is what works on a given target, today, at what cost.


Check the TWSC YouTube Channel


Picking a hard target

To answer that question concretely, we needed a target where the anti-bot was good and the signal was clean. We picked leroymerlin.fr, the French DIY retailer. Leroy Merlin runs DataDome standalone, with no other anti-bot layer on top, so attribution is straightforward. It also runs one of the more verbose DataDome configurations we have come across: response headers expose x-datadome-riskscore, x-datadome-protection, x-datadome-cid, and x-datadome-endpointid. Most DataDome-protected sites only show us the outcome. Here we see the score the engine assigns at every request, which is rare and very useful when comparing tools side by side.

The task we picked is small but realistic. From the homepage, the agent has to type “ampoule B22 led blanc” into the search bar, click the first product result, and add the product to the cart. Four steps. We dropped the login step on purpose: leroymerlin.fr requires an OTP to sign in, and we did not want OTP friction to confound an anti-bot test.

A run is a pass if the agent reaches the cart confirmation. Otherwise we record where it stopped and what DataDome said about it. Each tool runs ten times back to back, and we aggregate the results. Tools that support an external proxy use the same residential pool: Bright Data residential FR for the Bright Data runs, Geonode residential FR for the Geonode runs. Tools that ship their own proxy use it. The reason behind two different providers was because we wanted to diversify the IP addresses, to be sure that blocks were not a matter of IP reputation.


Need help with your scraping project?


The contestants

As you’ve seen before, the browser landscape is quite crowded and we could not cover all the tools. We picked four open-source projects and seven commercial products. Let’s start with the open source.

Camoufox is the stealth Firefox fork most people in the scraping world have already met (we introduced it on TWSC back in September 2024). It rotates real-world fingerprints, patches the obvious automation tells, and ships a Playwright-compatible API. We pair it with both Bright Data and Geonode residential proxies in France.

Pydoll takes a different route: it drives Chromium directly over CDP without WebDriver, with built-in humanized cursor movement and typing. Importantly, Pydoll implements an explicit Fetch.authRequired handler, which lets it authenticate proxies that require Basic auth.

Scrapling is a higher-level Python library. We use it in two modes. DynamicFetcher launches vanilla Playwright Chromium driven by Scrapling’s session manager. StealthyFetcher does the same, but under the hood uses patchright, a stealth-patched Playwright fork. Each gets its own row in the comparison.

RayoBrowse is the self-hosted stealth Chromium fork from Rayobyte, distributed as a Docker container that exposes a CDP endpoint on port 9222. Here we hit a wall worth flagging: for some reason RayoBrowse could not use the Bright Data residential proxy in our setup. Every navigation through that proxy failed instantly, even though the same credentials worked fine through curl from inside the same container. The same RayoBrowse setup worked fine with Geonode. We did not isolate the root cause, so we report RayoBrowse on Geonode only.

The commercial side is more crowded.

Browser Use exists in two flavors, and we tested both. The cloud version is the managed Browser Use, with its own residential proxy, its own stealth fingerprinting, and a fixed set of supported models; we drove it once in raw CDP mode (we steer it ourselves with Playwright) and once in agent mode (we hand the LLM the task in natural language and let it plan the steps).

Browserbase is a managed Chromium with optional residential proxies, Cloudflare Web Bot Auth verification, and the Stagehand agent SDK. We discovered during the test that the free tier excludes proxies entirely; without one, the session egresses from a US datacenter. We left this configuration in the test because it is what a free user would experience.

Browserless is a managed browser-as-a-service whose anti-bot story is a stealth path (/chromium/stealth) plus optional residential proxies for paid plans. The free plan caps sessions at 60 seconds, which is tight for a four-step flow. We tested it with the built-in residential proxy targeting France, and tried to test it with our external proxies via the externalProxyServer parameter; the external mode failed at connection time on every run, in the same Chromium-side authentication way that broke RayoBrowse, so we drop those configurations from the comparison.

ZenRows Scraping Browser is a managed Chromium with a built-in residential proxy network and built-in CAPTCHA solving; we connect via the WSS endpoint with proxy_country=fr to get a French exit point.

Bright Data Browser API sits at the other end of the same product category: a managed Chromium with built-in residential rotation and CAPTCHA solving, on a dedicated Browser API zone we configured on their dashboard.

As always, the code can be found in our GitHub repository reserved to paying users, inside the folder 103.BROWSERS.

What we had to fix before the numbers made sense

User's avatar

Continue reading this post for free, courtesy of Pierluigi Vinciguerra.

Or purchase a paid subscription.
© 2026 The Web Scraping Club SRL · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture