Web Scraping from 0 to hero: our first scraper with Microsoft Playwright
An introduction to web scraping with Playwright
Hi everyone and glad to see you again in our course “Web Scraping from 0 to Hero”, our biweekly free web scraping course provided by The Web Scraping Club.
In the latest article, we introduced Microsoft Playwright, a browser automation tool adapted to web scraping. In fact, despite its main focus is on web application testing, it’s gaining more and more popularity in web scraping communities for its capabilities in automating different browsers with no hassle.
How the course works
The course is and will be always free. As always, I’m here to share and not to make you buy something. If you want to say “thank you”, consider subscribing to this substack with a paid plan. It’s not mandatory but appreciated, and you’ll get access to the whole “The LAB” articles archive, with 30+ practical articles on more complex topics and its code repository.
We’ll see free-to-use packages and solutions and if there will be some commercial ones, it’s because they are solutions that I’ve already tested and solve issues I cannot do in other ways.
At first, I imagined this course being a monthly issue but as I was writing down the table of content, I realized it would take years to complete writing it. So probably it will have a bi-weekly frequency, filling the gaps in the publishing plan without taking too much space at the expense of more in-depth articles.
The collection of articles can be found using the tag WSF0TH and there will be a section on the main substack page.
When to use Scrapy and when Playwright?
When approaching a web scraping project, the choice between Scrapy and Playwright depends mainly on the presence or not of anti-bot protection or if the content needed is loaded as a result of some Javascript. Both tools are powerful, but if data is available in the HTML code and there’s no anti-bot on the target website, there’s no need to use Playwright.
The performances of Scrapy are great: it’s asynchronous by definition, you can scrape in parallel several pages, it’s lightweight on the machine where the scraper runs, and it’s designed for web scraping.
Playwright, and all the other browser automation tools like Selenium and Puppetteer, should be used as a last resort in case we’re tacking difficult websites that need a real browser for being scraped.
The initial setup
As an initial step, we should set up our environment. It’s quite straightforward but I’m linking the official documentation in case you need further assistance.
Install Python: Ensure you have Python installed on your system. You can download it from the official Python website.
Install Playwright: Once Python is installed, you can install Playwright using pip. Open your terminal or command prompt and run:
pip install playwright
Install Browsers: Playwright requires specific browser binaries to operate. Install them by running:
playwright install
With the setup out of the way, let's create our first scraper.
Our first scraper
You can write your scrapers in Python using different programming languages, I’ve opted for Python since it’s the only one I know.
I’ll use the same target website we used for the Scrapy example, the website valentino.com, so we can see also by comparing the code the differences between the two tools. I’m jumping directly to code this time, but if you want to read all the preliminary phases you should run through before starting coding, have a look at the first part of the Scrapy tutorial.
Import the packages: sync or async?
In Playwright, if you use Python you can choose between two ways to proceed with your scraper: synchronous or asynchronous. While in the first case the commands will be executed sequentially, in the second you need to handle parallelism to not overload your machine and the target server.
For the sake of simplicity, we’ll create a sync scraper, so that will be more readable.
from playwright.sync_api import sync_playwright
Let’s build the scraper logic
The website has the classical structure of a standard e-commerce: one main menu, with products divided by main categories and subcategories listed in the sub-menus.
If we have a look at the rendered code of the page, via the Inspect tool, we can see that the sub-menus are visible only after the main menu is clicked.
On the other hand, if we have a look at the raw HTML, via the view source tool, we can find the same selectors we’ve used in the Scrapy spider.
In this part of the tutorial, we’ll use the raw HTML for scraping the data we need, while in the second part, coming in two weeks, we’ll have fun interacting with the rendered page.
Since we’re reading the raw HTML, the logic is quite simple: we’ll iterate over each subcategory and then scrape from the product list pages the data we need for our scraper.
Parsing the HTML
While retrieving the raw HTML from Playwright is a piece of cake, we don’t have embedded in Playwright an HTML parser, so we’ll need to import the one we like the most.
If you’re a fan of CSS selectors or you have experience with this package, you can use BeautifulSoup.
from bs4 import BeautifulSoup
While if you’re more proficient with XPATHs you can choose between lxml and the scrapy package HtmlResponse. I’ll go with this one for this example.
from scrapy.http import HtmlResponse
I’ll include also the libraries needed to handle the JSON inside the product pages and the CSV output
import json
import csv
We create an instance of Playwright in synchronous mode and we iterate on the product subcategories just like we did in the Scrapy example.
with sync_playwright() as p:
browser = p.chromium.launch(channel='chrome', headless=False)
page = browser.new_page()
page.goto('https://www.valentino.com/en-gb/', timeout=0)
page.wait_for_load_state()
html=HtmlResponse(url="my HTML string", body=page.content(), encoding='utf-8')
categories = html.xpath('//a[@class="column-element"]/@href').extract()
for cat in categories:
print(cat)
page.goto(cat, timeout=0)
page.wait_for_load_state()
The line
browser = p.chromium.launch(channel='chrome', headless=False)
creates a new browser window, in this case we’re using a Chrome one by specifying ‘chrome’ in the channel option.
Without this parameter, Playwright would open a Chromium window. We could also create a Firefox, Edge or Safari window, with a set of parameters you can find in the documentation and we’ll dig deeper during a more advanced lesson.
Inside this browser window, we opened a page to make it open Valentino’s website URL and then waited until the load finished.
page = browser.new_page()
page.goto('https://www.valentino.com/en-gb/', timeout=0)
page.wait_for_load_state()
Only now we can read the HTML, extract the subcategory page URL and, one subcategory at a time, open its page.
Iterating to get the product
To create a scraper with the same scope as the Scrapy one we created some lessons before, we need to crawl until the product page. Now we have understood the trick, we need to repeat the HTML parsing part and iterate on each product in every category. Once we are on the product page, we can finally export the results of the page parsing on a CSV file.
categories = html.xpath('//a[@class="column-element"]/@href').extract()
for cat in categories:
print(cat)
page.goto(cat, timeout=0)
page.wait_for_load_state()
html_plp=HtmlResponse(url="my HTML string", body=page.content(), encoding='utf-8')
products = html_plp.xpath('//a[@class="productCard__image"]/@href').extract()
for product in products:
page.goto(product, timeout=0)
page.wait_for_load_state()
html_pdp=HtmlResponse(url="my HTML string", body=page.content(), encoding='utf-8')
json_data_str= html_pdp.xpath('//script[contains(text(), "cif_productData")]/text()').extract()[0].split('cif_productData = "')[1].split('productData')[0].strip()[:-2].replace('\\x22', '"')
json_data = json.loads(json_data_str)
product_code = json_data['responseData']['sku']
full_price = json_data['responseData']['price_range']['maximum_price']['regular_price']['value']
price = json_data['responseData']['price_range']['maximum_price']['final_price']['value']
currency_code = json_data['responseData']['price_range']['maximum_price']['final_price']['currency']
product_category = json_data['responseData']['product_hierarchy'].split('/')[3]
product_subcategory = json_data['responseData']['product_hierarchy'].split('/')[4]
gender = json_data['responseData']['gender']
itemurl = product
image_url = json_data['responseData']['image']['responseData']['url'].replace('[image]', 'image').replace('[divArea]', '500x0')
product_name = html_pdp.xpath('//h1[@class="productInfo__title"]/text()').extract()[0]
with open("output.txt", "a") as file:
csv_file = csv.writer(file, delimiter="|")
csv_file.writerow([product_code,full_price,price,currency_code,product_category,product_subcategory,gender,itemurl,image_url, product_name])
file.close()
After the end of the last loop, we’re going to close the page and the browser session.
page.close()
browser.close()
Final remarks
We have created a scraper similar to the one we’ve created with Scrapy in the past lessons. I deliberately didn’t consider some aspects, like the pagination inside every subcategory, to not overcomplicate these examples with details that are not useful now.
What we can surely note from the comparison of the two scrapers is the speed: with Scrapy we could get the data in minutes while with Playwright it would take hours to crawl all the single product pages, especially in synch mode like we’re operating.
That’s exactly what I meant at the beginning of this article when I said that Playright is for sure a useful tool but it must be used only when really needed, otherwise we could incur longer executions and also more resource-demanding scrapers.