THE LAB #69: Building a dashboard for your scrapers with Grafana

Visualizing the operations of your Scrapy spider with interactive dashboards

Dec 05, 2024

∙ Paid

In the latest The Lab post, we saw how to schedule your fleet of scrapers on AirFlow and create data pipelines to retrieve your data and store it in an S3 bucket.

Today, we’re covering another crucial aspect of scaling your web data infrastructure: logging.

Now a word from our web scraping consulting arm: RE Analytics

Since 2009, we’ve been at the forefront of web scraping, helping some of the world’s largest companies with sustainable and reliable access to critical web data.

If your web data pipeline is under pressure— be it due to high costs, technical hurdles, or scaling challenges— we’re here to provide unbiased, confidential advice tailored to your case.

Take the Assessment Questionnaire

The importance of logging in web scraping projects

In web scraping projects, and especially larger ones, logging and monitoring are not just beneficial—they are essential. As the volume of scraped data grows and the number of target websites increases, so does the complexity of managing your scraping operations.

Logging provides a detailed record of your scraper's activities, such as request counts, error rates, and response times. This information is crucial for identifying bottlenecks, debugging issues, and ensuring the scraper adheres to the expected behavior. Logging also directly impacts data quality by allowing you to address incomplete or inconsistent data quickly. In fact, by looking at the logs, you can see if a scraper has been interrupted midway by some errors or antibot.

Monitoring the logs and implementing actions when errors are spotted helps deliver timely and complete data to end-users. This is critical in industries where up-to-date information is key to decision-making, such as e-commerce, finance, and market research. It also makes your web data pipeline more resilient.

What is Grafana, and why did I choose it for web scraping?

Grafana is an open-source platform designed for monitoring, analyzing, and visualizing metrics from various data sources.

Initially created for system monitoring, Grafana has grown into a versatile tool that supports many use cases, including infrastructure management, application monitoring, and web scraping analytics. Its primary feature is the ability to create highly customizable dashboards, allowing users to track and visualize key metrics in real-time. Grafana supports integration with numerous data sources, such as Prometheus, Elasticsearch, InfluxDB, and MySQL, making it adaptable to diverse workflows. Another helpful feature is that its alerting system enables proactive responses to critical issues by sending notifications through email or messaging platforms.

For this article, I chose Prometheus as a data source: it’s open source, designed to store real-time metrics, and, last but not least, fits perfectly with Grafana.

Setting up the environment

Installing Grafana

First of all, I need to install Grafana on my environment, in this case a Mac but you can find instructions on the official website.

brew install grafana

And then start the localhost server.

brew services start grafana

Now, I can connect to my dashboard server.

Installing Prometheus

Now the same operation must be done for Prometheus.

brew install prometheus

and then we need to start it by using its prometheus.yml file

 brew services start prometheus

Now, we should be able to see its UI at http://localhost:9090.

The script is in the GitHub repository's folder 69.GRAFANA, which is available only to paying readers of The Web Scraping Club.

GitHub Repository

If you’re one of them and cannot access it, please use the following form to request access.

Adding Prometheus as a data source in Grafana

We can add Prometheus as a data source in Grafana by following these steps:

Log in to the Grafana web interface.
Navigate to Configuration > Data Sources and click Add data source.
Select Prometheus from the list.
Enter the Prometheus server URL (default is http://localhost:9090) and save the configuration.

Keep reading with a 7-day free trial

Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.