The Lab #38: Bypassing Kasada for web scraping 2024 edition
Another articles with tools and techniques to bypass an anti-bot
Let’s continue this 2024 by writing another article where we bypass an anti-bot solution to get some public data, like prices from e-commerce.
After writing two posts about Cloudflare (here’s part one and at this link you can find part two), today it’s the turn of Kasada, a niche anti-bot solution with a different approach to bot detection. It’s used by large platforms like Twitch and e-commerce websites like Canada Goose, which will be our testing target for this article
How does Kasada work?
Kasada has a peculiar way of working, compared to other solutions. Let’s consider it as a sort of firewall: instead of the connection port, it collects hundreds of data points from our browser, applies some AI, and then lets the request complete successfully or not. The core business of Kasada is fraud prevention, so I can assume that some more checks and tests are made at the moment of the login to a website or the purchase, but to me, it’s an unexplored territory so let’s keep it to what we can see when browsing public data.
When we send the first request to a website protected by Kasada we get automatically a 429 error: it’s a challenge triggered by the anti-bot, that passes our parameters to the Kasada system.
If the data passed by this request to the Kasada backend is coherent with a plausible fingerprint generated by a human browsing, your request gets redirected to the requested URL, otherwise, you’ll see only a white page.
Given this, I approached the scraping from canadagoose.com e-commerce website with three tools, starting with undetected-chromedriver.
All three solutions provided in this article can be found in the private repository on GitHub, available for paying users. If you’re one of them but don’t have access, please write me at firstname.lastname@example.org with your GitHub username since I need to add you manually.
First solution: undetected-chromedriver
Undetected-chromedriver is a Selenium Chromedriver patch that should not trigger anti-bot services.
If you’re not familiar with Selenium and webdrivers, let’s take a step back.
Selenium is a browser automation tool used for testing web apps: it provides a framework for interacting with web pages. It allows you to write code that can perform actions like clicking buttons, entering text, and extracting data from web pages. For this reason, it’s well-known also in the web scraping industry. To communicate with the browser it needs the so-called webdrivers, which are links between the Selenium commands and the browsers. Every browser has its webdriver: ChromeDriver for Google Chrome, GeckoDriver for Mozilla Firefox, and EdgeDriver for Microsoft Edge.
So undetected-chromedriver is a version of the standard chromedriver, more focused on the features and settings needed for web scraping.
The scraper should work as follows: load the home page, get the list of all the product categories, and iterate through it. For every product category, the scraper should scroll down to the bottom to load every product, since the page works with infinite scroll and loads more items as you scroll down if they are available. After all the items for a single category are listed on the page, then we can scrape information out from them.
Load the home page
import undetected_chromedriver as uc
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options.headless = False
driver = uc.Chrome(options=options)
After the usual imports, we create the list of options we need to pass to the undetected-chromedriver.
The other option I usually add is ‘—disable-blink-features=AutomationControlled’, which disables the banner on the browser that states it’s controlled by an automation tool. I’m not aware of any anti-bot detecting this feature but in order to make the session similar to a human-controlled one, I always disable this feature. With undetected-chromedriver, this is not needed since it’s disabled by default.
Closing the popup and selecting the categories
for cat_url in categories:
The first thing I do after loading the homepage is to close the popup that opens when the homepage is loaded for the first time.
Then, I’m creating a list of all the URLs of the product categories, so I can use it as input for the second part of the scraper.