THE LAB #74: Running scrapers on GitHub Actions
Save money and time by using GitHub infrastructure for running your scrapers
Let’s say we just created our scraper, and with a smile, we understand that it’s working fine from the beginning to the end. Now, it’s time to schedule it so we can gather data regularly without running it manually.
With the advancement of the services available today for managing hardware infrastructure, we have plenty of choices.
Options for Scheduling Scrapers Today
Depending on the size of our project and its execution time, there are several approaches available, each suited for different needs and levels of complexity:
Cron Jobs on Local or Remote Servers: Set up cron jobs on a server you control to run your scraper at specified intervals. This method requires managing your local device and ensuring it’s always online or using managed hosting services, so we don’t have to worry about the infrastructure. In this case, our scraper will always have the same data center IP address, making it detectable in the long run if we don’t use proxies. This measure is quite “old school” but still effective when scheduling short scrapers on websites with no protection, so you don’t need excessive overhead for managing your servers.
Cloud Platforms: Virtual Machines, Containers, and Lambda Functions are all available on the major cloud providers (AWS, Azure, GCP) and can be used to schedule your scrapers. While Lambda Functions, because of their hardware and execution time constraints, can be used for short scrapers, Virtual machines, and Containers can adapt to handle more extensive and more complex projects. If you rotate the machines, you can also have different IP addresses at each execution at no cost, which is always good. We’ve just seen in this article how to schedule a scraper using AWS Lambda.
Third-Party Tools: Platforms like Zapier, IFTTT, or Automate.io can run tasks on a schedule. These tools are user-friendly but may lack the flexibility needed for complex scraping tasks.
GitHub Actions: A developer-friendly solution integrated into GitHub, ideal for managing scraping workflows directly within your code repository. It’s simple to set up and free for public repositories, making it an excellent choice for developers already using GitHub. It has more or less the same limitations as Lambda functions, so it can be a solution for short scrapers.
In this article, we’ll focus on GitHub Actions and how to use them to schedule your web scraper.
What is GitHub Actions?
GitHub Actions is a CI/CD (Continuous Integration and Continuous Deployment) platform provided by GitHub. It allows you to automate workflows such as building, testing, deploying code, or running scripts directly from your GitHub repository.
Pros and Cons of Using GitHub Actions
GitHub Actions is a powerful and flexible tool with advantages and limitations similar to any other solution.
Pros
Integration with GitHub: Since it’s built into GitHub, setting up and managing workflows is seamless if your codebase is already hosted there.
Ease of Use: YAML-based configuration is straightforward, even for developers new to CI/CD.
Free for Public Repositories: Unlimited workflow executions for public projects, making it cost-effective for open-source work.
Multi-Platform Support: Supports various languages and operating systems, allowing flexibility in project requirements.
Ephemeral Runners: Each job runs in a clean environment, reducing dependency conflicts and ensuring consistency.
Extensive Marketplace: A wide range of prebuilt actions is available to simplify common tasks.
Cons
Time Constraints: GitHub-hosted runners have a maximum job runtime of 6 hours, which may not be sufficient for long-running scrapers.
Resource Limitations: Runners have limited CPU, memory, and storage, which may impact resource-intensive scrapers.
Free Tier Limits for Private Repositories: Private repositories have a cap on runner minutes (e.g., 2,000 minutes/month for free-tier users).
Ephemeral Environment: Temporary runners mean you need to manage data persistence explicitly using artifacts or external storage.
Learning Curve for Complex Workflows: While simple workflows are easy to configure, advanced setups may require more effort and expertise.
How Does GitHub Actions Work?
GitHub Actions operates as an automation tool that integrates directly into your GitHub repository. The core concept revolves around defining workflows, which are sets of instructions written in a YAML file. These workflows specify what tasks should be performed, when they should occur, and in what environment they should run. The process begins by creating a workflow file in your repository's .github/workflows/ directory. Each workflow file describes the automation in terms of triggers, jobs, and steps.
Triggers are the events that prompt GitHub Actions to start running a workflow. These can include events like pushing code to the repository, opening a pull request, scheduling tasks using cron syntax, or manually triggering workflows via the GitHub Actions interface. This flexibility ensures that automation can align with your development or operational needs.
Within the workflow, jobs represent the broader tasks to be executed. Each job consists of a sequence of steps, which are the specific commands or actions executed on a virtual machine. For instance, a job could include steps to check out the repository code, install dependencies, and execute a script.
To execute these jobs, GitHub provides runners—virtual machines that act as the execution environment. These runners come pre-configured with popular tools and programming languages, making them ready for use. GitHub maintains hosted runners in its data centers, offering global availability and reliable execution. Alternatively, you can configure self-hosted runners for more control over the environment.
Setting up GitHub Actions is straightforward. Navigate to the Actions tab in your repository, where you can select a predefined workflow template or create your own from scratch. Once the workflow file is added to and committed to your repository, GitHub detects it automatically and makes the action available for execution. This simple integration simplifies the scheduling process and lets you focus on the tasks you want to automate.
Setting up our scraper
First of all, I’ve chosen a working scraper for my collection for this test. It’s a price scraper from the Ounass website, a luxury e-commerce very popular in the Middle East.
I’ve created a new repository called 74.GITHUBACTIONS, and put the code of the scraper there.
The script is in the GitHub repository's folder 74.GITHUBACTIONS, which is available only to paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please use the following form to request access.
Before we can schedule the code runs, we need to create a Yaml file inside the repository. In this file, we tell GitHub what to do and when to do it.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.