THE LAB #68: Scheduling Scrapers with Airflow
How to manage a fleet of scrapers with Apache Airflow
Scraping is like cherries: once you have tasted one web data extraction, you want to have more and more. Today, we’re seeing how to schedule your data extraction using one of the most popular tools available for data engineers: Apache Airflow
While many tools exist for scheduling tasks, Airflow is a great choice for building flexible and robust data pipelines, from the web to a database (or wherever you want).
Now a word from our web scraping consulting arm: RE Analytics
Since 2009, we’ve been at the forefront of web scraping, helping some of the world’s largest companies with sustainable and reliable access to critical web data.
If your web data pipeline is under pressure— be it due to high costs, technical hurdles, or scaling challenges— we’re here to provide unbiased, confidential advice tailored to your case.
What Is Apache Airflow?
Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. Think of it as a task manager on steroids, capable of handling complex dependencies, retries, and multi-step workflows with ease.
In the context of web scraping, Airflow can automate the entire pipeline:
Triggering scrapers at specific intervals.
Handling dependencies between tasks, such as data scraping, processing, and storage.
Alerting on failures and logging all activity for debugging and monitoring.
If you're used to crontab or simple task schedulers, Airflow provides an upgrade with its intuitive Directed Acyclic Graph (DAG) interface, which allows you to visualize and manage workflows effortlessly.
I’m completely new to Airflow and just started to study it for writing this article, so I invite you to test it by yourself and leave me your feedback in the comments or on our Discord server, so I can improve the pipeline I’m creating.
Why Use Airflow for Web Scraping?
Scalability: As your scraping needs grow, managing tasks manually or with simpler tools becomes unmanageable. Airflow scales with you.
Error Handling: Airflow has built-in mechanisms to retry tasks on failure, notify you, and log issues comprehensively.
Integration: It integrates seamlessly with Python, making it ideal for orchestrating scrapers built with libraries like Scrapy, Selenium, or BeautifulSoup.
Ease of Use: Airflow's UI clearly visualizes workflows, allowing users to track progress and identify bottlenecks.
Extensibility: You can trigger Airflow tasks based on external events, making it ideal for dynamic scraping workflows.
Writing my first data pipeline with Airflow
Setting up the environment
First of all, I need to set up my environment, so I’m installing Airflow via pip
pip install apache-airflow
and then I need to initialize its database.
airflow db init
I mentioned that Airflow has its own web interface, so let’s create the credentials to use it.
airflow users create \
--username admin \
--firstname Pier \
--lastname TWSC \
--role Admin \
--email pier@thewebscraping.club
After a few seconds, you will be prompted to enter the password.
Finally, we can start also the web server
airflow webserver --port 8080
airflow scheduler
and log in to http://localhost:8080.
The script is in the GitHub repository's folder 68.AIRFLOW, which is available only to paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please use the following form to request access.
Creating my first dag
Airflow workflows are defined as Directed Acyclic Graphs (DAGs). Each DAG contains tasks, and the relationships between tasks define the workflow. Here’s an example to schedule a Scrapy scraper.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.