THE LAB #68: Scheduling Scrapers with Airflow

How to manage a fleet of scrapers with Apache Airflow

Nov 28, 2024

∙ Paid

Scraping is like cherries: once you have tasted one web data extraction, you want to have more and more. Today, we’re seeing how to schedule your data extraction using one of the most popular tools available for data engineers: Apache Airflow

While many tools exist for scheduling tasks, Airflow is a great choice for building flexible and robust data pipelines, from the web to a database (or wherever you want).

Now a word from our web scraping consulting arm: RE Analytics

Since 2009, we’ve been at the forefront of web scraping, helping some of the world’s largest companies with sustainable and reliable access to critical web data.

If your web data pipeline is under pressure— be it due to high costs, technical hurdles, or scaling challenges— we’re here to provide unbiased, confidential advice tailored to your case.

Take the Assessment Questionnaire

What Is Apache Airflow?

Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. Think of it as a task manager on steroids, capable of handling complex dependencies, retries, and multi-step workflows with ease.

In the context of web scraping, Airflow can automate the entire pipeline:

Triggering scrapers at specific intervals.
Handling dependencies between tasks, such as data scraping, processing, and storage.
Alerting on failures and logging all activity for debugging and monitoring.

If you're used to crontab or simple task schedulers, Airflow provides an upgrade with its intuitive Directed Acyclic Graph (DAG) interface, which allows you to visualize and manage workflows effortlessly.

I’m completely new to Airflow and just started to study it for writing this article, so I invite you to test it by yourself and leave me your feedback in the comments or on our Discord server, so I can improve the pipeline I’m creating.

Why Use Airflow for Web Scraping?

Scalability: As your scraping needs grow, managing tasks manually or with simpler tools becomes unmanageable. Airflow scales with you.
Error Handling: Airflow has built-in mechanisms to retry tasks on failure, notify you, and log issues comprehensively.
Integration: It integrates seamlessly with Python, making it ideal for orchestrating scrapers built with libraries like Scrapy, Selenium, or BeautifulSoup.
Ease of Use: Airflow's UI clearly visualizes workflows, allowing users to track progress and identify bottlenecks.
Extensibility: You can trigger Airflow tasks based on external events, making it ideal for dynamic scraping workflows.

Writing my first data pipeline with Airflow

Setting up the environment

First of all, I need to set up my environment, so I’m installing Airflow via pip

pip install apache-airflow

and then I need to initialize its database.

airflow db init

I mentioned that Airflow has its own web interface, so let’s create the credentials to use it.

airflow users create \
    --username admin \
    --firstname Pier \
    --lastname TWSC \
    --role Admin \
    --email pier@thewebscraping.club

After a few seconds, you will be prompted to enter the password.

Finally, we can start also the web server

airflow webserver --port 8080
airflow scheduler

and log in to http://localhost:8080.

The script is in the GitHub repository's folder 68.AIRFLOW, which is available only to paying readers of The Web Scraping Club.

GitHub Repository

If you’re one of them and cannot access it, please use the following form to request access.

Creating my first dag

Airflow workflows are defined as Directed Acyclic Graphs (DAGs). Each DAG contains tasks, and the relationships between tasks define the workflow. Here’s an example to schedule a Scrapy scraper.

Keep reading with a 7-day free trial

Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.