THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers
Pro and cons of actual scheduling solutions for Scrapy
Here’s another post of “THE LAB”: in this series, we'll cover real-world use cases, with code and an explanation of the methodology used.
In the future, this kind of content will be available only to paying subscribers. Being one of the first of the series, this one will be available for free until the 19th of Oct 2022, then will be behind a paywall.
Being a paying user gives:
Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases (view here as an example).
Access to the GitHub repository with the code seen on ‘The LAB”
Access to private channels on our Discord server
But in case you want to read this newsletter for free, you will always get a post per week about:
News about web scraping
Anti-bot software and techniques insights
Interviews with key people in the industry
And you can always join the Web Scraping Club Discord server
Enough housekeeping, for now, let’s start.
Web scraping is like eating cherries: one website pulls the other and you will soon find yourself with hundreds of sparse scrapers in your servers, scheduled via crontab randomly.
In this post, we'll see how to handle this complexity with some tools that use the Scrapy embedded functions to create a web dashboard for monitoring and scheduling your scrapers.
Why Scrapy?
It's much easier to manage the Scrapy spiders because they have already bundled inside a telnet connection and a console that allows external software to query the scraper status and report it in the web dashboards.
From the Telnet console, you can basically start, pause and stop scrapers and monitor statistics about the Scrapy engine and the data collection.
All you need to do is to log in via telnet to the address of the machine where Scrapy is running, using the username and password provided in the settings file.
These are the default values in setting.py file when a scraper is created:
TELNETCONSOLE_ENABLED = 1
TELNETCONSOLE_PORT = [6023, 6073]
TELNETCONSOLE_HOST = '127.0.0.1'
TELNETCONSOLE_USERNAME = 'scrapy'
TELNETCONSOLE_PASSWORD = None
In case the password is not set, Scrapy will automatically generate a random one at the beginning on the execution of the scraper. You should see during the start up a line like the following:
[scrapy.extensions.telnet] INFO: Telnet Password: fe53708491f51304
Once you connect to the console, you can retrieve your scraper stats with the command
stats.get_stats()
and get the stats you usually get at the end of the execution
{'log_count/INFO': 10,
'start_time': datetime.datetime(2022, 10, 11, 18, 19, 8, 178404), 'memusage/startup': 56901632,
'memusage/max': 56901632,
'scheduler/enqueued/memory': 43,
'scheduler/enqueued': 43,
'scheduler/dequeued/memory': 13,
'scheduler/dequeued': 13,
'downloader/request_count': 14,
'downloader/request_method_count/GET': 14,
'downloader/request_bytes': 11016,
'robotstxt/request_count': 1,
'downloader/response_count': 9,
'downloader/response_status_count/404': 2,
'downloader/response_bytes': 30301,
'httpcompression/response_bytes': 199875, 'httpcompression/response_count': 9,
'log_count/DEBUG': 14,
'response_received_count': 9,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1, 'downloader/response_status_count/200': 7,
'request_depth_max': 2,
'item_scraped_count': 5,
'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/404': 1}
or the status of the Scrapy engine and control its execution by pausing or terminating it, using the following commands:
engine.pause()
engine.unpause()
engine.stop()
Now we have understood how to gather stats and command the scraper from remote, we could certainly create some scripts to gather all these pieces of information, but there's no need to reinvent the wheel, as there are already several open-source solutions that can help us.
Scrapyd
Scrapyd is an application that schedules and monitors Scrapy spiders, with also a (very) basic web interface. It can be used also as a versioning tool for the scrapers since it allows creation of multiple versions for the same scraper, even if only the latest one can be launched.
Let’s configure together a basic scrapyd server.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.