What is Scrapy?

Scrapy is an open-source web crawling framework for Python used for extracting data from websites, developed and maintained by Zyte.

As described in Wikipedia, the Scrapy framework provides you with powerful features such as auto-throttle, rotating proxies, user agents, and several other useful options. Being highly modular, it can be expanded with many additional features useful for web scraping.

How to use it

To use Scrapy, you will need to have basic knowledge of Python programming. The framework can be installed using pip, a package manager for Python. Once installed, users can create a new Scrapy project using the Scrapy CLI and start writing their spider scripts to scrape the data they need.

Scrapy has several built-in features that make it easy to use and effective. It includes support for handling HTTP requests and responses, following links, handling cookies and sessions, and processing data in the desired format. It also provides tools for logging, managing duplicate requests, and handling exceptions.

If you want to understand better how to create your first scraper you can have a look at the official documentation or see my post about it.

The Web Scraping Club
Create your first python scraper with Scrapy
Hi, this is Pierluigi from The Web Scraping Club, a newsletter where you can find news, insights, and tutorials with real-world examples about web scraping. Being a paying user gives: Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases …
Read more

The Scrapy’s architecture

Always from the Scrapy documentation, we can have a deeper understanding of how Scrapy works by knowing its understanding.

The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.

  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.

  3. The Scheduler returns the next Requests to the Engine.

  4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).

  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).

  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).

  7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).

  8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.

  9. The process repeats (from step 3) until there are no more requests from the Scheduler.

If you want to see Scrapy in action, I recommend this video from John Watson Rooney with a tutorial for beginners.

This post is written by Pierluigi Vinciguerra (pier@thewebscraping.club)

If you liked this post and want to receive in your inbox a weekly article about web scraping, please consider subscribing to The Web Scraping Club for free.