Web scraping from 0 to hero: creating our first Scrapy spider - Part 2
Writing a fully working scraper with Scrapy
Today we’re having a new episode of “Web Scraping from 0 to Hero”, the free web scraping course provided by The Web Scraping Club.
In the latest lesson, we started writing our first scraper with Scrapy and today we’ll finish our job, explaining the code line by line.
A brief recap from the past lesson
We needed to scrape the e-commerce website Valentino.com, so after checking with Wappalyzer there were no anti-bot and the data was publicly accessible from the web, we decided to proceed with creating a Scrapy scraper.
To do so, we launched a few commands to create the skeleton of the Scrapy project and, after having defined the output data structure, we also defined the Scrapy Item we’ll use.
Finally, we started writing the scraper that gets the URL of every product category and prints it.
From then, in this article, instead of only printing the URL of every product category, we’ll use them to crawl their pages and reach every product page, scraping the data we need to output.
How the course works
The course is and will be always free. As always, I’m here to share and not to make you buy something. If you want to say “thank you”, consider subscribing to this substack with a paid plan. It’s not mandatory but appreciated, and you’ll get access to the whole “The LAB” articles archive, with 30+ practical articles on more complex topics and its code repository.
We’ll see free-to-use packages and solutions and if there will be some commercial ones, it’s because they are solutions that I’ve already tested and solve issues I cannot do in other ways.
At first, I imagined this course being a monthly issue but as I was writing down the table of content, I realized it would take years to complete writing it. So probably it will have a bi-weekly frequency, filling the gaps in the publishing plan without taking too much space at the expense of more in-depth articles.
The collection of articles can be found using the tag WSF0TH and there will be a section on the main substack page.
Crawling product categories
Once we get the product categories’ URLs from the home page, we need to use them as input for requests.
As a first step, we need to import the Scrapy implementation of requests with the following line
from scrapy.http import Request
and then make a request per each URL we retrieved in the previous step.
categories = response.xpath('//a[@class="column-element"]/@href').extract()
for category in categories:
print(category)
yield Request(category, callback=self.parse_product_category)
Scrapy Requests Basics
The syntax just seen is quite easy to explain: we pass a URL, in this case inside the variable named category, and the callback parameter, which indicates a function that will handle the response of the request.
In web scraping projects, two main aspects to consider when making a request are the cookies handling and their headers.
Both these features can be controlled on the settings.py file of our projects.
We can set the default headers for every request made in the scraper setting the variable DEFAULT_REQUEST_HEADERS with a JSON that mimics real browser request headers.
DEFAULT_REQUEST_HEADERS = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US,en;q=0.7",
"cache-control": "max-age=0",
"sec-ch-ua": "\"Not_A Brand\";v=\"8\", \"Chromium\";v=\"120\", \"Brave\";v=\"120\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"macOS\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"sec-gpc": "1",
"upgrade-insecure-requests": "1"
}
As for cookie handling, Scrapy natively passes cookies from the initial request to the following ones, unless we explicitly set the value
COOKIES_ENABLED = False
always inside the settings.py file.
If we wish to monitor how cookies evolve from request to request, we can enable cookie debugging mode by adding to the same file also the variable
COOKIES_DEBUG = True
After enabling the cookies debug mode, we can see how they change from request to request.
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.valentino.com/en-gb/>
Cookie: affinity="348bf538a97ef22e"; AKA_A2=A
[scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://www.valentino.com/en-gb/>
Set-Cookie: vltn_aka_geo=en-gb; expires=Sun, 05-Jan-2025 10:14:41 GMT; path=/
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.valentino.com/en-gb/> (referer: None)
https://www.valentino.com/en-gb/women/woman-sale-ready-to-wear
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.valentino.com/en-gb/women/woman-sale-ready-to-wear>
Cookie: affinity="348bf538a97ef22e"; vltn_aka_geo=en-gb; AKA_A2=A
In this case, we can see that from request to request cookies are being added to the cookie jar and used in the following requests.
import scrapy
from scrapy.http import Request
class EcomscraperSpider(scrapy.Spider):
name = "ecomscraper"
allowed_domains = ["valentino.com"]
start_urls = ["https://www.valentino.com/en-gb"]
def parse(self, response):
categories = response.xpath('//a[@class="column-element"]/@href').extract()
for category in categories:
print(category)
yield Request(category, callback=self.parse_product_category)
def parse_product_category(self, response):
products = response.xpath('//a[@class="productCard__image"]/@href').extract()
for product in products:
print(product)
After the adaption to the scraper, we’re able to get all the URLs of the products listed on the website.
Now we’ll need to parse the product pages to gather all the information needed for the output.
Scraping the product data
The traditional scraping techniques rely on writing a specific selector for every output field of the scraper. I’m using the word traditional since AI and LLMs are changing the web scraping landscape and more and more tools are promising to extract meaningful features automatically from HTML code. This will mean no more selector maintenance but probably less control over the output, but we’ll see later this year which will be the products that will be able to find the right balance between these two aspects.
For today, let’s stick to the basics and see how to write selectors for our Scrapy spider.
Generally speaking, we have two types of selectors: XPATH selectors and CSS selectors.
XPATH, also known as XML Path Language, is a language used to identify and select XML nodes in a document. It can be used in different languages like Python, C, C++, and many others, being a part of the XSLT standard.
CSS selectors instead are part of the CSS language and are used to select elements inside HTML code.
If you’re interested in learning more about them, we’ll dedicate further lessons in this course to the argument. In the meantime, you can have a look at my previous comparison article and these two websites to practice with XPATH selectors and CSS selectors.
Personally, I find more readable the XPATH syntax more readable, and since it allows selecting items in both directions (descendant and parent items), while with CSS you can select only descendant items, I prefer to use XPATH instead of CSS selectors.
Writing the selectors
As we proceeded with our scraper, we extracted all the item’s URLs and passed them to another function that will be responsible for the scraping of the product pages and for the output generation.
def parse_product_category(self, response):
products = response.xpath('//a[@class="productCard__image"]/@href').extract()
for product in products:
yield Request(product, callback=self.parse_product_category)
The new function parse_product_category will use the response variable, where the HTML of the page will be stored.
The HTML of the product page can be studied in the browser, by using the view source tool.
It’s important to understand that the View Page Source tool allows you to see the HTML code before the browser rendering, while Inspect shows you the page code after it. Since Scrapy has not natively installed a browser for rendering HTML, you need to use the raw HTML for writing your scraper.
By analyzing the code, we can see that the information we need is all inside a JSON, inside the variable cif_productData =, which needs to be formatted before use.
As we mentioned in the previous articles, using JSON is preferred over writing selectors, since they change less frequently than HTML. On this occasion, we’ll use both since this is a scraper for educational purposes.
Working on the product JSON
In the first step, we’re extracting and cleaning the JSON, since we need to replace the ‘\x22’ string with the double quote, so we can use its fields and map them with the output field of the spiders.
To parse JSON strings in a Python program, we need to import the JSON package into the program.
import json
Now we’re ready to extract the JSON string and manipulate it to format it correctly.
json_data_str= response.xpath('//script[contains(text(), "cif_productData")]/text()').extract()[0].split('cif_productData = "')[1].split('productData')[0].strip()[:-2].replace('\\x22', '"')
The XPATH selector looks for the first script node in the HTML which contains the string “cif_productData =” in its text element.
The string does not contain only the JSON but also some additional text not useful for our scope so we’ll split it and adjust it to clean the JSON and make it parsable.
def parse_product_data(self, response):
json_data_str= response.xpath('//script[contains(text(), "cif_productData")]/text()').extract()[0].split('cif_productData = "')[1].split('productData')[0].strip()[:-2].replace('\\x22', '"')
#print(json_data_str)
json_data = json.loads(json_data_str)
product_code = json_data['responseData']['sku']
full_price = json_data['responseData']['price_range']['maximum_price']['regular_price']['value']
price = json_data['responseData']['price_range']['maximum_price']['final_price']['value']
currency_code = json_data['responseData']['price_range']['maximum_price']['final_price']['currency']
product_category = json_data['responseData']['product_hierarchy'].split('/')[3]
product_subcategory = json_data['responseData']['product_hierarchy'].split('/')[4]
gender = json_data['responseData']['gender']
itemurl = response.url
image_url = json_data['responseData']['image']['responseData']['url'].replace('[image]', 'image').replace('[divArea]', '500x0')
product_name = response.xpath('//h1[@class="productInfo__title"]/text()').extract()[0]
item = ValentinoItem(
product_code = product_code,
full_price = full_price,
price = price,
currency_code = currency_code,
country_code = 'GBR',
item_URL = itemurl,
image_URL = image_url,
product_name = product_name,
gender = gender,
product_category = product_category,
product_subcategory = product_subcategory
)
yield item
After extracting the string from the first XPATH selector, I passed the resulting string into an online JSON parser for better visualization and understanding.
The JSON navigation is quite plain vanilla, we only needed to work on the image_url variable with some string substitutions.
For the variable product_name instead, we used an XPATH selector: we’ve found out that the product_name is inside the text of the H1 node and we extracted it with a simple expression.
Managing the scraper output
We assigned the extracted variables from the scraper to the item’s fields and, with the yield instruction, we’re exporting the item to the output of the scraper.
With the command
scrapy crawl ecomscraper -o results.txt -t csv
we’re saving to the results.txt file in a csv format, while using the -t json parameter we could store the results in a JSON file.
That’s it! We wrote our first scraper with Scrapy and you can find it in our GitHub repository available for all readers.
In the next lesson, we’ll see some Scrapy tricks to improve the reliability and the output of the scraper.
Why not replace('[divArea]', 'divArea')? Where did you get 500x0 ?