THE LAB #70: Advanced logging in Scrapy

How to extract the most meaningful metrics from your scrapers

Dec 12, 2024

∙ Paid

Last Thursday, we saw how to visualize your Scrapy’s runtime logs on a Grafana dashboard, thanks to the Prometheus database.

Visualizing your scrapers’ behavior during their runtime can be fascinating, but it doesn’t add much value to your operations team, especially when launching hundreds of scrapers in a distributed environment.

This is the first post of a small masterclass about logging, where we’ll build a modern infrastructure to monitor your distributed scraping operations.

But before diving into technical details, let’s stop thinking about what’s important to know for us.

Key Metrics to Track

Why Logging Is Essential in Scraping

I know I’ve already said that multiple times, but repetita iuvant. Logging is essential to web scraping since it gives you the data to diagnose the health of your web data pipelines. We can its usage in these fields:

Monitor performance: See how efficiently spiders are running, how long they take, and if they complete in your desired timeframe
Debug errors: knowing if your scrapers incurred errors during their execution is vital to proactively fixing your data pipeline.
Optimize costs: The longer your scraper runs, the more compute you pay. If it fails, you need to change your proxy type or provider, which will consequently cause a relapse in costs. Which website costs you the most to scrape? And what does happen if I need to use unblockers on website X? Can I still afford to scrape it? All these questions can be answered only if you have enough data.

For large-scale scraping operations, where multiple spiders run simultaneously, logging is not just a debugging tool—it becomes a strategic necessity.

Two Types of Logs: Runtime and Post-Execution

Logs in web scraping can be broadly categorized into two types:

Live Metrics Logs: These provide real-time feedback during the scraping process.
Summary Metrics Logs: These give a bird’s-eye view of the spider's performance after it completes.

Live Metrics Logs

Live metrics are critical for on-the-fly monitoring and automatic decision-making logic implemented in the scraper. These logs include:

Request status codes: To ensure your spider makes successful requests and identify when errors occur.
Proxy endpoint: To track requests' success or failure rate per proxy endpoint and provider.
Execution IP: To monitor the actual IP address being used.

Tracking these metrics live allows you to:

Rotate Proxies Dynamically: If certain proxies underperform (e.g., return many 4XX or 5XX errors), they can be replaced in real-time using a proxy ladder—a mechanism for grading and prioritizing proxies based on their reliability. This will be the subject of a future article.
Act proactively: if you notice some errors during the execution of a scraper, you can start working on the fix before it ends its execution.

Summary Metrics Logs

Post-execution logs help you assess the overall performance of the scraping session. These include:

Total number of requests: Helps measure the scope of the scraping task.
Successful requests: Indicates the scraper's efficiency.
Failed requests: Points to issues with proxies, target site restrictions, or scraper misconfiguration.
Bytes transferred: Helps estimate proxy or bandwidth costs, which is especially important if you’re scaling operations.

In a nutshell, we want to collect statistics from the execution (number of requests and their status codes, bandwidth used, context information like IP, scraper name, and proxy provider) from our scrapers. We also want to expose return codes during the execution to live monitoring and, eventually, implement an automatic proxy ladder.

Solution architecture

In the previous post about Grafana, we had Prometheus and Scrapy installed on the same machine. Scrapy sent its logs to a local instance of Prometheus, which was then collected from the central Prometheus server.

This approach is not scalable in a distributed environment, where we must centralize the log ingestion in only one place.

In our scenario, with such a degree of logging, the number of rows to store could easily reach several million per day, produced by hundreds of scrapers on different machines. Of course, logging should not impact the scrapers' performance, so the most efficient solution is to decouple the log production from its storage, using a Message Queue.

What is a Message Queue, and how it works

A Message Queue (MQ) is a system that enables applications to communicate asynchronously by exchanging messages. Producers (such as web scrapers) place messages onto a queue, and consumers (such as data processors) retrieve and process them independently. This decoupling ensures that producers and consumers do not need to simultaneously operate at the same speed or even be online. MQ systems provide a buffer (the "queue") where messages are stored until the consumers are ready to process them. They offer features like message persistence, acknowledgments, and dead-letter queues, which ensure reliability and fault tolerance even in the face of system failures or temporary unavailability of consumers.

How It Works:

Producers send messages to a queue, which acts as a temporary storage.
Consumers fetch messages from the queue at their own pace.
Messages are processed, and acknowledgments are returned to the queue to confirm successful handling.
Depending on the configuration, the system can handle retries for failed message processing and ensures messages are delivered exactly once or at least once.

Best Use Cases:

High-Volume Data Ingestion: Handling millions of logs, metrics, or events generated by distributed systems.
Decoupled Microservices: Enabling different services to communicate without being tightly coupled.
Load Leveling: Managing spikes in data production by queuing messages until consumers catch up.
Real-Time Processing Pipelines: Feeding logs or data into analytics, machine learning, or alerting systems.

MQ systems like RabbitMQ are particularly effective in scenarios requiring high throughput, scalability, and fault tolerance.

Direct database writes, or API-based logging could be alternatives to this system, but they would struggle to handle large-scale workloads reliably, as they can overwhelm storage systems or introduce latency in the scrapers.

Streaming platforms like Apache Kafka offer high throughput, but their complexity and infrastructure overhead make them less accessible for more straightforward use cases.

RabbitMQ, by contrast, provides a straightforward and scalable solution for message handling, making it the best choice for applications that need to process millions of messages daily.

The final architecture will look like the following picture.

Implementing the logging system

In this first post of the series, we’ll see how to write a middleware for Scrapy that extracts the information needed for our logging system.

The script is in the GitHub repository's folder 70.LOGGING, which is available only to paying readers of The Web Scraping Club.

GitHub Repository

If you’re one of them and cannot access it, please use the following form to request access.

For now, we’ll make it print on the screen the messages that, in the next article, we’ll send via RabbitMQ.

Designing the message structure

Our target solution is to write structured messages with different keys and values on an MQ, so the simplest way to structure them is using JSON format.

Keep reading with a 7-day free trial

Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.