5 Playwright useful features for web scraping
Or also: 5 things I wish I knew long when I started using it
Microsoft Playwright is a framework for Web Testing and Automation, released in January 2020 and developed by a team of engineers who had previously worked on similar projects like Puppeteer at Google.
Since it’s built to automate browsers based on Chromium but also Firefox and Webkit (not yet supported by Puppeteer), it can be used to test web apps but also for web scraping when a real browser is needed to bypass anti-bots. While it’s not specifically designed for web scraping, it has several features that could be useful when we design our scrapers. In this post, we’ll see 5 Playwright useful features for our bot design.
Network Intercept
Playwright provides APIs to monitor and modify network traffic, both HTTP and HTTPS, with few instructions.
Let’s say we want to print all the network calls made when loading a page, without opening a network tab in the browser.
With the page.route() function, we could do it using the following code:
with sync_playwright() as p:
#browser = p.chromium.launch_persistent_context(user_data_dir='./userdata/', channel="chrome", headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"])
browser = p.chromium.launch(headless=False,slow_mo=200)
page = browser.new_page()
def log_and_continue_request(route, request):
print(request.url)
route.continue_()
# Log and continue all network requests
page.route("**/*", log_and_continue_request)
page.goto("https://www.immobiliare.it/vendita-case/milano/?criterio=rilevanza")
browser.close()
We could also modify the headers at runtime with route.fulfill function, if the scraper needs to.
Video Record of the bot execution
Struggling to understand what’s happening when your headless scraper runs?
You can record a video of what happens inside a browser context, using this code:
with sync_playwright() as p:
#browser = p.chromium.launch_persistent_context(user_data_dir='./userdata/', channel="chrome", headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"])
browser = p.chromium.launch(headless=True, slow_mo=2000)
context = browser.new_context(
record_video_dir="videos/",
record_video_size={"width": 640, "height": 480}
)
page = context.new_page()
page.goto("https://www.immobiliare.it/vendita-case/milano/?criterio=rilevanza")
page.wait_for_load_state()
page2 = context.new_page()
page2.goto("https://www.casa.it/")
page2.wait_for_load_state()
time.sleep(3)
browser.close()
It’s a new function also for me and I’m still figuring out why sometimes my videos are empty, so if you know how to make it work properly, please let me know in the comment section
iFrame Handling
iFrames can be a challenge to handle when you need to extract data from them. Playwright makes it easy to select them and query their content via selectors.
Let’s have a look at this test page:
The text in the box is contained in an iFrame located on the page https://seleniumbase.io/w3schools/demo_iframe.htm
Playwright allows us to select an iFrame on a page using its name or its URL, and then query it using selectors.
With this code, I’m printing out the href attribute in the blue box.
with sync_playwright() as p:
#browser = p.chromium.launch_persistent_context(user_data_dir='./userdata/', channel="chrome", headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"])
browser = p.chromium.launch(headless=False, slow_mo=2000)
page = browser.new_page()
page.goto("https://seleniumbase.io/w3schools/iframes")
frame = page.frame(url=r".*demo_iframe.*")
content = frame.eval_on_selector("a", "el => el.href")
print(content)
browser.close()
Modify geolocation and timezone
As we have seen in the previous post of The Lab, where we changed the geolocation coordinates to scrape the Lowe’s website but with different pickup stores, we could set a different pair of coordinates from the original ones.
Also, we can override our context timezone, to create a more realistic fingerprint.
with sync_playwright() as p:
browser = p.chromium.launch( executable_path='/Applications/Brave Browser.app/Contents/MacOS/Brave Browser', headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"])
#browser = p.chromium.launch(headless=False, slow_mo=2000)
context = browser.new_context(
#geolocation={"longitude": float(lon), "latitude": float(lat)},
permissions=["geolocation"],
)
page = context.new_page()
page.goto("https://browserleaks.com/geo")
time.sleep(10)
browser.close()
browser = p.chromium.launch( executable_path='/Applications/Brave Browser.app/Contents/MacOS/Brave Browser', headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"])
context = browser.new_context(
geolocation={"longitude": -73.935242, "latitude":40.730610},
permissions=["geolocation"],
timezone_id='America/New_York'
)
context.set_geolocation({"longitude": -73.935242, "latitude":40.730610})
page = context.new_page()
page.goto("https://browserleaks.com/geo")
time.sleep(10)
browser.close()
With this code, we can test the Browser API responsible for geolocation and timezone, showing original values at first and then the modified ones.
Emulating a device
Always using browser context, we can choose to emulate a certain device for our scraping activity. This means that Playwright will customize our context with a set of values for screen resolution, user agent, screen size, and other parameters.
with sync_playwright() as p:
iPhone_13 = p.devices['iPhone 13 Pro landscape']
browser = p.webkit.launch(headless=False)
context = browser.new_context(**iPhone_13)
page = context.new_page()
page.goto("https://www.site24x7.com/tools/browser-fingerprint-test.html")
time.sleep(10)
browser.close()
The full list of supported devices, with the parameters changed in the browser context, can be found on the Playwright’s GitHub repository.
In our code example, the settings changed where the following:
"iPhone 13 Pro landscape": {
"userAgent": "Mozilla/5.0 (iPhone; CPU iPhone OS 15_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1",
"screen": {
"width": 390,
"height": 844
},
"viewport": {
"width": 750,
"height": 342
},
"deviceScaleFactor": 3,
"isMobile": true,
"hasTouch": true,
"defaultBrowserType": "webkit"
},
This doesn’t cover all the parameters usable for a complex device fingerprint used by modern anti-bot systems but could be helpful on some occasions.
Final remarks
This small list of Playwright features doesn’t want to be exhaustive but the intention here was to highlight some perks of using it.
Doing the same things with Selenium could have been more difficult and time-consuming, and this explains why Playwright is gaining traction as a tool for headful web scraping.
can anyone solve this program using java script