THE LAB #71: Sending Scrapy logs to RabbitMQ

Saving the logs of your distributed scraping architecture to your database

Dec 19, 2024

∙ Paid

In the past episode of The Lab on this blog, we have seen how to improve the standard logs in Scrapy to print both “live” data and summary statistics.

Today, we’re following up on that article by modifying the middleware we wrote: instead of printing the logs on the screen, we send them to a RabbitMQ endpoint and then to an AWS Redshift instance.

As mentioned in the previous article, we chose this architecture for several reasons. First, we want this method to work with hundreds of scrapers, so we need a solution capable of ingesting messages with a high concurrency level. Second, since we don’t want to slow down our scrapers and create bottlenecks when writing messages to the DB, we need to decouple the ingestion and consumption of the logs. Last, the consumption should be done in batches and not every row, so we don’t overload the database with millions of small inserts.

RabbitMQ ticks all three boxes: messages are queued in a buffer over a queue, and then a consumer reads them in batches and writes them to a database in a fully asynchronous mode.

How RabbitMQ, Listeners, and Consumers Work

RabbitMQ acts as a message broker that receives, queues, and dispatches messages between publishers (message senders) and consumers (message receivers). It decouples systems and enables asynchronous communication, ensuring messages are reliably handled without requiring immediate processing.

Publisher: Sends messages to RabbitMQ. Publishers connect to RabbitMQ and post messages to an exchange, which determines how messages are routed to specific queues.
Queue: A buffer where messages wait until consumed. Queues are durable (survive restarts) or temporary, depending on requirements.
Consumer: A service or application that listens for messages from a RabbitMQ queue. Consumers fetch messages from a queue and process them. Acknowledgment (ACK) ensures RabbitMQ knows the message was successfully received and processed.
Listener: A running instance of a consumer continuously listening for new messages on a queue. It can handle one or more messages simultaneously, ensuring efficient and scalable processing.

By linking publishers and consumers, RabbitMQ enables fault-tolerant, distributed workflows. If a consumer fails to process a message, RabbitMQ can requeue the message, ensuring reliable delivery.

So, let’s start completing our solution by installing the prerequisites we need.

Installing prerequisites

Pika: interacting with RabbitMQ with Python

Pika is a lightweight Python library that provides a clean and simple interface to interact with RabbitMQ through the AMQP protocol.

RabbitMQ implements AMQP (Advanced Message Queuing Protocol), allowing reliable message queuing, routing, and delivery between services or applications.

Pika enables developers to connect to a RabbitMQ server, declare exchanges and queues, publish messages, and consume them. Unlike heavy frameworks, Pika is easy to use and flexible, making it ideal for embedding RabbitMQ into lightweight systems, such as Scrapy Spiders.

It supports synchronous and asynchronous communication, allowing fine-grained control over message flow.

Using this instruction, we can install it via pip and import it into our Python programs.

import pika

How to start RabbitMQ and set a queue

RabbitMQ acts as a message broker, queuing and distributing messages efficiently. Below are the essential steps:

Install RabbitMQ: If RabbitMQ isn't already running, you can set it up locally using Docker:
```
docker run -d --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management
```
5672 is the default port for messaging, while 15672 opens the web management UI (login with guest/guest). In this article, we’re setting up a local RabbitMQ installation, but, of course, it can be installed on a cluster to improve its reliability as the message volume increases.
Set Up a Queue: Access the RabbitMQ management console at
http://localhost:15672 and create a queue named scrapy_logs.
From its UX I’ve created the scrapy_logs queue, adding the following parameters:
1. x-message-ttl: the amount in milliseconds after a message gets deleted if not used. I’ve set one hour.
2. x-expires: amount in milliseconds after the queue expires if no messages are received.
3. x-max-lenght: number of messages that the queue can store before refusing them

Now that we’ve set up the environment, we can modify the middleware created in the previous article and then write the queue consumer.

The script is in the GitHub repository's folder 71.RABBITMQ, which is available only to paying readers of The Web Scraping Club.

GitHub Repository

If you’re one of them and cannot access it, please use the following form to request access.

Modifying the middleware

In the previous article, we created the function log_metric to print logs on the console.

Today, we’re modifying it to send these messages on the queue we just created.

Keep reading with a 7-day free trial

Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.