THE LAB #72: Advanced logging in Playwright
RabbitMQ, screenshots and system monitoring for your Playwright scrapers
Before the holiday break, we saw some techniques for logging inside your Scrapy Spider, both in real-time and using statistics at the end of the scraper, storing this data on a database using RabbitMQ.
Using Scrapy greatly simplifies the process: in the first instance, few things can go wrong. Your requests end correctly or get refused because of an anti-bot. The bandwidth used and all the stats about the response codes are logged by scrapy itself; it’s only a matter of sending them into the RabbitMQ queue.
On the other hand, in Playwright, we have to deal with a browser that is automated by our code, and many things could not work as expected. In this article, we’ll explore the importance of logging in Playwright, the key metrics to monitor, and how to implement a robust logging system. We’ll track system performance and HTTP request metrics and handle selector failures by sending snapshots on Amazon S3 for debugging. All this stuff, of course, will be sent to RabbitMQ using the same data structure as the previous article.
Key Metrics to Monitor
As mentioned, running an automated browser is more complex than running a simple Scrapy Spider. First, it is much more resource-intensive, especially regarding CPU and Memory. Finding the correct virtual machine size that balances performance and cost is always a challenge. For this reason, the first key metrics we’ll monitor are the CPU and Memory usage of the running environment.
Then, of course, we need to monitor the “classical” metrics on the requests: status codes, bandwidth consumption, number of requests, and so on.
Last but not least, sometimes things don’t go as planned. Maybe the cookie banner took too long to pop and interrupted your scraper by hiding the needed selector, the website did some A/B testing, or the scraper simply didn’t wait for the page to load fully, and now it’s stuck. Finding out what’s happening can be hard, so creating snapshots and sending them to an S3 bucket in case of failure can be a faster way to start debugging your scraper.
1. Running Environment
The resource-intensive nature of browser automation makes system performance monitoring critical:
CPU Usage: This helps you determine if your scraper is overloading the system, experiencing inefficient operations, or simply loading too many pages simultaneously.
Memory Usage: This feature tracks memory consumption to detect potential leaks that could crash the scraper during extended runs. This is particularly true if we forget to close pages and contexts correctly when they’re no longer needed, leaving garbage in the memory that could pile up until the machine stops working.
2. Request Metrics
As said in the previous articles, monitoring HTTP requests is vital for evaluating scraper efficiency and detecting issues:
Request Count: This tracker counts the total number of requests made by the scraper, which is particularly useful for estimating the costs of a proxy with a price-per-request pricing plan.
Bandwidth Consumption: Same as before, but needed to understand how much your scraper would cost if it should use pay-per-GB proxies.
Response Status Codes: This is crucial to understand if you encountered errors during your scraping session and if there’s room for improvement.
3. Scraper Failures
Interacting with elements on a webpage is a common task in Playwright, but missing or dynamic selectors can cause failures. For this reason, it’s important to have a clear picture of the circumstances under which the error occurs to understand how to improve the scraper. Creating snapshots and saving them on S3 is the starting point for this.
The script is in the GitHub repository's folder 72.PLAYWRIGHTLOGGING, which is available only to paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please use the following form to request access.
How to efficiently log your Playwright scripts
Playwright doesn’t have any stats collectors, so we should create one by ourselves by listening to the events happening during the execution of the scraper.
What is an Event Listener in Playwright?
An event listener in Playwright is a mechanism that allows you to observe and respond to specific events that occur during the execution of a browser session. These events are triggered by various actions or states in the Playwright runtime, such as HTTP requests, page navigations, console logs, or network failures.
Attaching event listeners to a Page, Browser, or Context object allows you to define custom logic to execute when a specific event occurs. This is particularly useful for monitoring, debugging, and logging in web scraping or browser automation.
Here are some frequently used Playwright events and what they track:
As an example, we can track the response codes of a request with the following script
def log_response(response):
print(f"Response: {response.status} {response.url}")
page.on("response", log_response)
Given all these premises, let’s start building our Playwright script with Python
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.