How to start with Scrapy and Playwright - Part 2

Best starting configurations for your scrapers, this time with Playwright

Nov 24, 2024

In the first part of this article, we’ve seen how to properly configure your Scrapy spider when starting your web scraping project.

The Web Scraping Club

How to start with Scrapy and Playwright - Part 1

When approaching a web scraping project from scratch as a Python developer in 2024, two libraries you cannot miss are Scrapy and Plawyright…

8 months ago · 5 likes · Pierluigi Vinciguerra

As mentioned in the article, Scrapy is enough when we’re targeting websites with low or no anti-bot protection.

But when things get tough as a modern anti-bot kicks in, depending on the website configuration, you will probably need a browser automation tool like Playwright to simulate a human-generated browser session.

So, first of all, understanding if we’re in this case is the first step to understanding how to configure our scraper. Installing the Wappalyzer browser extension for free is a simple and quite accurate method for detecting a website's tech stack. Under the security section, we will find out if there’s an anti-bot installed (at least among the most famous ones).

Finding the most efficient way to scrape a website is one of the services we offer in our consulting tasks, in addition to projects aimed at boosting the cost efficiency and scalability of your scraping operations. Want to know more? Let’s get in touch.

Get in touch

Now that we understand the challenges we’re going to face, here’s the rule of thumb I apply when facing the most common anti-bots during a traditional web scraping of an e-commerce site (so no login or no ticketing/betting site, which are normally more protected).

PerimeterX → Scrapy + eventually Scrapy Impersonate and residential proxies
Akamai → Scrapy + Scrapy Impersonate and residential proxies. Alternatively, web unblockers can be used if stability is crucial.
Cloudflare → Scrapy + Scrapy Impersonate or Playwright with Firefox, always with residential proxies. Alternatively, web unblockers can be used if stability is crucial.
Kasada → Playwright + Anti-Detect Browsers. Alternatively, web unblockers can be used if stability is crucial.
Datadome → Camoufox. Alternatively, web unblockers can be used if stability is crucial. The design of the scraper's journey on the website is crucial.

Every website has its rules, and sometimes, things don’t work in this way, but this approximates what’s happening today in my codebase. And since this strategy involves the use of Playwright, let’s see how we can set it up correctly from the start.

Being undetected using Playwright is hard.

From an anti-bot perspective, detecting a Scrapy spider is simple: throw some Javascript challenge at it, and since it doesn’t have a built-in JS engine, it will fail. Using scrapy-splash, the JS rendering engine built for Scrapy, in this case, doesn’t help. So, you must change your strategy and use a browser automation tool like Playwright, which poses other challenges. Browsers are a gold mine of information and features that anti-bots can exploit, and they can be grouped more or less in this way:

Behavior Analysis: Tracks mouse movements and interactions.
Device Fingerprinting: Collects browser and device-specific configurations.
IP Tracking: Monitors IP behavior, patterns, and consistency between the IP and browser geolocation signals.
JavaScript Challenges: Checks execution capabilities and behaviors.

Using a browser increases an antibot's surface of attack, making detecting your scraper easier. Luckily, we have many tools and a good knowledge base to hide our Playwright scrapers. We can manually configure our Playwright scraper or delegate this stuff to other packages that make our scraper more stealthy. Depending on the case, I usually combine the two options.

Configuring Playwright manually

Playwright can spawn a Chromium browser or an instance of a Chromium-based browser like Chrome, Edge, Brave, and so on. I usually prefer the second option since our task is to mimic the behavior of a real user, and using a real browser is definitely better at doing that.

In addition, I’m also using a persistent context, meaning that on the machine where the scraper is running, all the cookies and browser sessions are stored in a certain directory on the disk. In this way, in the next session, the browser starts with all the info and cookies from the previous run, as a normal user does on their computer, instead of starting from scratch.

Here’s a snippet for doing this:

from playwright.sync_api import sync_playwright

def run_brave():
    with sync_playwright() as p:
        # Specify the path to the Brave executable
        brave_path = "/path/to/brave"  # Replace with the actual path
        browser = p.chromium.launch_persistent_context(
            user_data_dir="path/to/user/data",  # Optional: to use persistent profiles
            headless=False,  # Run in headful mode for visibility
            executable_path=brave_path,  # Set Brave's path
            args=[
...
            ]
        )
        page = browser.new_page()
        page.goto("https://example.com")
        print(page.title())
        browser.close()

run_brave()

As you can see, I’ve left the args list empty because this requires more deep diving.

When a browser is used by a browser automation tool, it automatically changes some of its property values that, as you can imagine, are the biggest red flags for anti-bot softwares.

In addition, we need to reduce the detection surface for our scraper by disabling some of the browser’s features that can detect automation signals.

Here’s a list of the most common arguments passed when creating a scraper:

--no-sandbox: Disables the Chrome sandbox, typically used in automated environments. While it can enhance compatibility, it may trigger suspicion if overused, so only use it when necessary.
--disable-dev-shm-usage: Prevents Chrome from writing shared memory files, reducing memory-related crashes on some systems.
--disable-blink-features=AutomationControlled: Hides the navigator.webdriver property, which is a clear sign of automation.
--disable-infobars: Removes the "Chrome is being controlled by automated software" message, which might be a red flag for anti-bot systems.
--window-size=width,height: Sets a consistent browser window size, mimicking real user behavior. Adjust to common screen resolutions like --window-size=1920,1080 or use --start-maximized to open the browser in maximized mode, similar to how users interact with their browsers
--disable-extensions: Disables extensions not typically present in standard Chrome installations

Check the TWSC YouTube Channel

Another trick I use is the slow_mo parameter when creating a Playwright instance: it’s a wait between the different actions happening inside the browser so that they feel a bit more natural and not in an immediate sequence.

from playwright.sync_api import sync_playwright

def run_with_slow_mo():
    with sync_playwright() as p:
        # Launch the browser with slow_mo
        browser = p.chromium.launch(headless=False, slow_mo=500)  # 500ms delay between actions
        context = browser.new_context()
        page = context.new_page()

        # Open a website and interact
        page.goto("https://example.com")
        page.click("text=More information...")  # Slows down here
        page.screenshot(path="screenshot.png")

        browser.close()

run_with_slow_mo()

Unluckily, with the increased complexity of anti-bots, this approach is starting to become obsolete because of two main reasons:

Anti-bots nowadays can detect we’re using Playwright by sending messages via the CDP protocol (we wrote about it here)
Some properties, like the browser's fingerprint, cannot be changed by common parameters but need to be spoofed by patching directly Plawwright.

During these years at The Web Scraping Club, we have seen several of these patches and tools. In next Sunday’s episodes, we’ll conclude our Playwright configuration with the newest tools for making your scraper undetectable.

Like this article? Share it with your friends who might have missed it or leave feedback for me about it. It’s important to understand how to improve this newsletter.

Leave a Feedback

You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.

Refer a friend

The Web Scraping Club

How to start with Scrapy and Playwright - Part 2

Best starting configurations for your scrapers, this time with Playwright

Being undetected using Playwright is hard.

Configuring Playwright manually

Discussion about this post