Web scraping from 0 to hero: Microsoft Playwright
What is Microsoft Playwright and why you need a browser automation tool for web scraping
Welcome back to “Web Scraping from 0 to Hero”, our biweekly free web scraping course provided by The Web Scraping Club.
After a brief introduction to Scrapy, the most important Python framework for writing scrapers, today we’ll see a browser automation tool, Microsoft Playwright, and understand why its relevance in the web scraping industry is growing.
What we’ve learned up to now
Let me recap our journey for the many of you who just joined The Web Scraping Club recently (thanks!)
A brief introduction to web scraping, what is it, and some legal landscape.
A checklist to tick off before starting scraping, with analysis and plans
Finally a first example of a scraper built in Scrapy, divided into parts 1 and part 2
But we’ve already seen that Scrapy could not be enough in some cases and we need to add to our toolbelt something else: today we’ll see Microsoft Playwright.
How the course works
The course is and will be always free. As always, I’m here to share and not to make you buy something. If you want to say “thank you”, consider subscribing to this substack with a paid plan. It’s not mandatory but appreciated, and you’ll get access to the whole “The LAB” articles archive, with 30+ practical articles on more complex topics and its code repository.
We’ll see free-to-use packages and solutions and if there will be some commercial ones, it’s because they are solutions that I’ve already tested and solve issues I cannot do in other ways.
At first, I imagined this course being a monthly issue but as I was writing down the table of content, I realized it would take years to complete writing it. So probably it will have a bi-weekly frequency, filling the gaps in the publishing plan without taking too much space at the expense of more in-depth articles.
The collection of articles can be found using the tag WSF0TH and there will be a section on the main substack page.
What is Microsoft Playwright and what are its key features
Microsoft Playwright has lately emerged as a key tool in the world of web automation and scraping. Developed by Microsoft and released in 2020, it is an open-source automation library that is engineered to help automate interactions with web browsers with the purpose of testing applications. The project is similar to Puppeteer and the reason is quite simple: Puppeteer is another browser automation tool developed by a Google team that was hired by Microsoft to develop Playwright, as this thread on Hacker News states.
While Puppeteer can be used only by Javascript developers, Playwright can be used also by Pythonistas like me (and that’s why on these pages you’re not finding a Puppeteer program. If someone is willing to take care of the JS section of this newsletter, please reach out to me).
Before deep diving into the Playwright features and capabilities, let’s understand why browser automation tools are important for the web scraping industry.
Why Browser Automation Tools and Web scraping are a perfect match
The landscape of web scraping has evolved significantly over the years, and browser automation tools have become a cornerstone in this evolution. These tools, including Microsoft Playwright, offer functionalities that are crucial for tackling today's complex web environments. Modern websites are no longer static pages with simple HTML content. They have evolved into complex applications with dynamic content, interactive elements, and sophisticated front-end frameworks. These advancements enhance user experience but pose significant challenges for traditional web scraping methods. Browser automation tools can interpret and interact with these modern web architectures, making them indispensable for effective data extraction.
Dealing with Dynamic Content and JavaScript
A significant portion of the web today relies on JavaScript to load content and manage user interactions. Traditional scraping tools like Scrapy, which only fetches static HTML content, often miss out on this dynamically loaded data. Browser automation tools like Playwright can render JavaScript just like a regular browser, ensuring that all the content, including that loaded asynchronously, is captured.
Interactive Elements and Complex Workflows
Many websites require interaction to reveal their data, such as clicking buttons, filling out forms, or navigating through menus. Browser automation tools can simulate these interactions, accessing and extracting data that would otherwise be inaccessible. Playwright, for instance, offers robust APIs to handle these complex user interactions.
Bypass Anti-bots
As web scraping has become more prevalent, websites have implemented various anti-scraping measures. These include CAPTCHAs, IP rate limiting, browser fingerprinting, and detecting unnatural browsing patterns. Browser automation tools can mimic human-like browsing behavior, making them less likely to trigger these anti-scraping mechanisms. Playwright, since it could use real browsers to navigate the websites, passes a genuine browser fingerprint to the target website, decreasing the chances of being detected as a bot.
Testing and Debugging
Another important aspect of browser automation tools is their role in testing and debugging scraping scripts. Playwright offers features like screenshot capture and detailed error logging, which are invaluable for diagnosing issues in scraping scripts. These features save time and resources, especially when developing complex scraping operations.
All these nice features come, of course, at a cost. Running a real browser requires much more resources than a standard scraper written in Scrapy and, due to resource constraints on the running machines and the loading times of all the JS on the pages, the scrapers are generally much slower. In my experience, it also happens that the performances of a scraper degrade through time, ending up being stuck because it consumes all the resources of the hosting machine.
This doesn’t mean Playwright doesn’t have a crucial role nowadays in web scraping, so let’s discover more about this tool.
Introduction to Microsoft Playwright
At its core, Microsoft Playwright is designed to automate browser-based tasks, but its capabilities extend far beyond typical automation. It is not just a tool for testing web applications but it's a comprehensive solution for automating all browser interactions. Playwright comes with patched binary browsers installed, which means that right after the installation you can run scripts that use Safari, Chrome, Firefox, and Edge without any need to install any other stuff. These browsers are not the official versions of the browsers, since they are modified to exploit the full functionalities of Playwright but you can still launch the latest version of any Chromium-based browser like Brave or Chrome, by installing them by yourself on your device.
Core Features of Microsoft Playwright
Cross-Browser Compatibility: One of Playwright’s most significant features is its ability to support multiple browsers, including Google Chrome, Mozilla Firefox, and Safari's WebKit. This compatibility ensures that you can write a single script that works across various browsers, simplifying the testing and scraping process.
Headless Mode: Playwright can operate in headless mode, meaning it can execute browser interactions without the graphical user interface. This helps in creating faster and less resource-intensive scripts, especially when you’re running them on the cloud and the size of the machine matters in terms of dollars.
Rich API for Browser Automation: Playwright provides a rich set of APIs to automate browser interactions. These include navigating to URLs, capturing screenshots, filling out forms, clicking on elements, and handling file downloads, among others.
Handling Dynamic Content: With modern web applications heavily relying on JavaScript, Playwright’s ability to render JavaScript content makes it an invaluable tool. It ensures that dynamic content loaded via AJAX and JavaScript frameworks like React, Angular, or Vue.js is fully rendered before scraping. This means that also JS challenges by anti-bot could be solved by the browser.
Auto-Wait Features: Playwright’s auto-wait feature ensures that the automation script automatically waits for the necessary elements to load before performing any actions. This feature is critical in dealing with web pages where content loading is dependent on asynchronous calls and dynamic DOM updates.
Language Support: Playwright supports several programming languages, including Node.js, Python, C#, and Java.
Network Interception and Modification: Playwright can intercept network requests, which allows developers to monitor, modify, or block certain network calls. This feature is particularly useful in testing scenarios or when you need to scrape data from sites that load content dynamically but also if you’re using some proxies in a headful script and don’t want the scraper to load images, which would waste bandwidth and cost you money.
Mobile Emulation: It also provides features for mobile emulation, allowing developers to mimic mobile devices, including their viewport sizes and user-agent strings. This is crucial for testing and scraping websites that have different layouts or functionality on mobile devices.
Even though Playwright has been created as a web application testing tool, all these features make you understand why it’s a great tool to be proficient in.
When to Use Playwright Instead of Scrapy
As we have seen, Playwright has some great features but also limits in its usage. So when do we need to choose Playwright instead of Scrapy?
Dynamic and Interactive Web Pages
Dealing with JavaScript-Heavy Sites:
Websites that heavily rely on JavaScript for rendering content pose a significant challenge for traditional scraping tools like Scrapy. Playwright, with its ability to render JavaScript and handle asynchronous requests, can effectively manage these dynamic sites. It ensures that all dynamically loaded elements are captured, which might be missed by Scrapy.
Complex User Interactions:
Scenarios involving interactions like clicking buttons, filling forms, navigating through drop-down menus, or dealing with pop-ups require a tool like Playwright that can automate these processes.
To summarize: if the data you’re willing to scrape is available when you click “view source” on your page (so basically is available in the raw HTML), you can use Scrapy. If it’s dynamically gathered and visible only on the rendered page (so only when from “Inspect page”), then you need Playwright.
Browser-Specific Data and Features
Rendering Across Different Browsers:
Some websites display differently in various browsers due to specific CSS styles or JavaScript functionalities. Playwright’s support for multiple browsers (Chrome, Firefox, and WebKit) allows for testing and scraping these browser-specific variations. This is a feature where Scrapy, which primarily deals with static HTML content, falls short.
Mobile Emulation:
Playwright can emulate mobile environments, including device-specific configurations and touch interactions. This is particularly important for scraping sites that have different content or layouts in mobile views, a capability not inherently available in Scrapy.
Overcoming Anti-Scraping Techniques
Evading Detection:
Playwright's ability to imitate human-like interactions (like typing speed, mouse movements, and click patterns) makes it more adept at avoiding detection by sophisticated anti-bot measures employed by many websites. Also, the capability of using real browsers makes its detection less easy.
Scrapy, being more focused on static content extraction, does not inherently offer this level of interaction simulation. Since it doesn’t have any JS rendering engine natively, every challenge thrown by anti-bot fails, making the scraper easily detectable.
Performance and Speed Considerations
As mentioned previously, Scrapy’s performances are way better than Playwright’s ones. By consuming fewer resources and reading only the HTML code instead of rendering also JS content, Scrapy is way faster and capable of launching more concurrent requests than Playwright, given the same machine, since concurrency is not natively supported.
Ideally, Scrapy is the default choice for starting, but if you understand that you need more, then Playwright can help you with your scraping project.
In the next post of Web Scraping from 0 to Hero, we’ll write our first scraper with Playwright, so prepare to get your hands dirty.