The Web Scraping Club

The Web Scraping Club

THE LAB #95: Bypassing Cloudflare in 2026

Testing Open Source Browser Automation Tools Against Real Targets

Pierluigi Vinciguerra's avatar
Pierluigi Vinciguerra
Jan 22, 2026
∙ Paid

In this first article of The Lab series of 2026, we’ll see how to bypass the most common anti-bot measure on the market: Cloudflare. Like every anti-bot defense, it keeps evolving, forcing scraping tools to keep pace, and what worked in 2025 may fail in 2026. This means that for scraping professionals, operations become more expensive. In fact, the cost of choosing the wrong tool and finding the right technique to bypass it is wasted development time and blocked pipelines, and, as we’ll find out, there’s no silver bullet for it. We tested the most common open-source browser automation tools against two Cloudflare-protected production sites to identify what actually works.


Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.

Claim your offer


Tool Landscape

We evaluated three browser automation tools that claim to bypass Cloudflare in 2026.

Camoufox is a custom Firefox build with fingerprint rotation and stealth patches. It uses Playwright’s Juggler protocol and focuses on avoiding detection through realistic Firefox fingerprints and non-default configurations.

Pydoll uses Chrome DevTools Protocol (CDP) for async Chromium automation. It avoids WebDriver entirely, emphasizing human-like interactions and behavioral anti-detection.

undetected-chromedriver provides a patched Selenium ChromeDriver. It modifies startup behavior and WebDriver fingerprints, serving as a drop-in replacement for standard Selenium workflows.
The code used for this test can be found on The Lab GitHub repository, inside the folder 95.CLOUDFLARE-2026, available only for paid subscriber of TWSC.

System Model: Cloudflare’s Detection Layers

Cloudflare operates as a multi-layered defense system. Understanding which layer blocks your requests determines which tool characteristics matter.

Layer 1: involves TLS and network fingerprinting. Cloudflare analyzes TLS handshakes and HTTP/2 frame ordering, inspecting cipher suites, negotiation order, and HTTP/2 header sequences to identify non-browser clients. Browser automation tools using real Chrome or Firefox inherit legitimate TLS fingerprints, typically passing this layer.

Layer 2: runs JavaScript Detections (JSD). Cloudflare’s JavaScript Detections engine executes lightweight JavaScript on HTML page requests to identify headless browsers and automated clients. The detection runs via an invisible code snippet that analyzes browser environment properties without visible challenges. When verification succeeds, Cloudflare issues a cf_clearance cookie with cf.bot_management.js.detection.passed = true. Failed verifications block the request or trigger additional challenges.

Layer 3: presents the JavaScript Challenge (IUAM), the visible “Checking your browser” interstitial. This executes complex JavaScript to validate browser capabilities and measure execution timing. The challenge tests Canvas and WebGL rendering, navigator property consistency, and execution patterns.

Layer 4: implements behavioral analysis. Cloudflare monitors mouse movements, scrolling patterns, click timing, and request sequencing to identify bot-like behavior. These signals feed into the bot scoring system.

Layer 5: applies machine learning bot scoring. Cloudflare’s ML engine assigns Bot Scores (1-99) using supervised learning to distinguish human and bot traffic. This accounts for the majority of detections and incorporates signals from all other layers plus IP reputation.

For this test, we focused on whether tools can pass the JavaScript detection and challenge layers (Layers 2-3). We assumed clean residential IPs and did not evaluate IP reputation impact.

Target Selection and URL Collection

We needed real production targets with active Cloudflare protection, and we chose two very well-known websites.

Harrods.com operates as a high-value e-commerce site where Cloudflare protects product pages and navigation. The sitemap is accessible at https://harrods.com/sitemap.xml.

Indeed.com is a job board with aggressive protection. When we attempted sitemap access, we received 403 Forbidden, confirming active Cloudflare filtering even for automated sitemap requests.

URL Extraction Process

For Harrods.com, standard sitemap parsing works:

def extract_urls_from_domain(domain: str, limit: int = 1000) -> List[str]:
    sitemap_url = f'https://{domain}/sitemap.xml'
    xml_content = fetch_sitemap(sitemap_url)

    if '<sitemapindex' in xml_content:
        sitemap_urls = parse_sitemap_index(xml_content)
        for sm_url in sitemap_urls:
            sm_content = fetch_sitemap(sm_url)
            urls = parse_sitemap(sm_content)
            all_urls.update(urls)
    else:
        urls = parse_sitemap(xml_content)
        all_urls.update(urls)

    return list(all_urls)[:limit]

For Indeed.com, sitemap access returned 403, forcing us to generate URLs based on known patterns:

def generate_indeed_urls(limit: int = 1000) -> List[str]:
    urls = []
    job_queries = ['software-engineer', 'data-scientist', 'product-manager', ...]
    locations = ['New-York-NY', 'Los-Angeles-CA', 'Chicago-IL', ...]

    for query in job_queries:
        for location in locations:
            urls.append(f'https://www.indeed.com/jobs?q={query}&l={location}')
            for start in [10, 20, 30]:
                urls.append(f'https://www.indeed.com/jobs?q={query}&l={location}&start={start}')

    return urls[:limit]

The 403 on sitemap access itself demonstrates Cloudflare’s filtering. Even basic reconnaissance triggers blocks.

Test Framework Design

We designed our test framework to validate successful page loads by checking for site-specific content elements rather than generic Cloudflare blocking indicators.

Content-Based Validation

We discovered early that generic Cloudflare detection (searching for “checking your browser” or “just a moment” strings) produces false positives. Tools can retrieve partial pages or incomplete JavaScript renders that contain neither Cloudflare challenges nor the actual target content.

Our validation approach checks for critical page elements that must be present in a successfully loaded page:

Indeed.com validation:

has_search_button = (
    'yosegi-InlineWhatWhere-primaryButton' in html and
    '<button' in html and
    '>Search</span>' in html
)

result = {
    'success': has_search_button,
    'status_code': 200 if has_search_button else (403 if html else 0),
    'content_length': len(html),
}

The search button element appears only when the full job listing page renders. Its absence indicates either Cloudflare blocking or incomplete JavaScript execution.

Harrods.com validation:

has_product_name = (
    'data-test-id="pdp-product-name"' in html and
    '<h1' in html
)

result = {
    'success': has_product_name and status_code == 200,
    'status_code': status_code,
    'content_length': len(html),
}

Product pages must contain the product name h1 element. Pages without this element are either editorial content, error pages, or Cloudflare blocks.

Metrics Collected

For each request, we recorded a success flag (based on content validation), HTTP status code (or inferred status), content length in bytes, final URL after redirects, and error messages when exceptions occurred.


Check the TWSC YouTube Channel


Tool Implementation

Camoufox Setup

Camoufox requires downloading the custom Firefox binary:

pip install -U "camoufox[geoip]"
python -m camoufox fetch

We configured it for headless operation with fingerprint rotation:

from camoufox.sync_api import Camoufox

def scrape_with_camoufox(url: str) -> Dict:
    with Camoufox(
        headless=True,
        humanize=True,
        os=['macos', 'windows'],
        geoip=False,
    ) as browser:
        page = browser.new_page()
        response = page.goto(url, timeout=30000, wait_until='domcontentloaded')
        page.wait_for_timeout(2000)
        html = page.content()

        return {
            'html': html,
            'status_code': response.status if response else 0,
            'url': page.url,
        }

We set headless=True for server execution. The humanize=True option enables human-like cursor movement to reduce behavioral signals. The os=[’macos’, ‘windows’] parameter rotates OS fingerprints between requests. We disabled geoip due to a lack of proxy infrastructure for this test.

The wait_for_timeout(2000) accounts for JavaScript challenges that execute after page load.

Undetected-chromedriver Setup

Standard pip installation works:

pip install undetected-chromedriver

Configuration uses the new headless mode to reduce detection surface:

import undetected_chromedriver as uc

def scrape_with_undetected_chromedriver(url: str) -> Dict:
    options = uc.ChromeOptions()
    options.add_argument('--headless=new')
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--no-sandbox')

    driver = uc.Chrome(options=options, version_main=None)
    driver.get(url)

    time.sleep(2)
    html = driver.page_source
    final_url = driver.current_url

    driver.quit()

    return {
        'html': html,
        'status_code': 200,  # Selenium doesn't expose status
        'url': final_url,
    }

The --headless=new flag uses Chrome’s updated headless mode with reduced fingerprint differences. The --disable-blink-features=AutomationControlled option removes the navigator.webdriver flag. The version_main=None parameter auto-detects the installed Chrome version.

The 2-second sleep allows JavaScript challenges to complete.

Pydoll: Chromium CDP-Based Automation

Pydoll uses Chrome DevTools Protocol (CDP) for browser control. We found that the API differs from the documentation, requiring experimentation to identify working methods.

Working configuration:

from pydoll.browser.chromium import Chrome
from pydoll.browser.options import ChromiumOptions

options = ChromiumOptions()
options.add_argument('--headless=new')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')

async with Chrome(options=options) as browser:
    tab = await browser.start()
    await tab.go_to(url)
    html = await tab.page_source  # Use page_source, not tab.html()
    current_url = await tab.current_url  ```

We encountered API inconsistencies during testing. Documentation suggests from pydoll import Browser, but the Browser class doesn’t exist. The correct method is tab.page_source, not tab.html(). URL retrieval uses tab.current_url as an attribute, not a method call.

These inconsistencies required us to test against the installed version (2.15.1) to identify working patterns.


Need help with your scraping project?


Test Results: Indeed.com with Camoufox

When we tested Camoufox against Indeed.com, we observed behavioral patterns specific to how Cloudflare’s Turnstile operates on this target.

User's avatar

Continue reading this post for free, courtesy of Pierluigi Vinciguerra.

Or purchase a paid subscription.
© 2026 Pierluigi · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture