THE LAB #7: Scraping PerimeterX protected websites
Is scraping Perimeterx website so difficult as it seems?
Here’s another post of “THE LAB”: in this series, we'll cover real-world use cases, with code and an explanation of the methodology used.
Being a paying user gives:
Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases (view here an example).
Access to the GitHub repository with the code seen on ‘The LAB”
Access to private channels on our Discord server
But in case you want to read this newsletter for free, you will always get a post per week about:
News about web scraping
Anti-bot software and techniques insights
Interviews with key people in the industry
And you can always join the Web Scraping Club Discord server
Enough housekeeping, for now, let’s start.
What is PerimeterX?
PerimeterX is one of the most well-known anti-bot solutions, used by some of the top-tier websites on the net. They recently merged with Human Security, another company in the anti-bot industry but more focused on fraud prevention and account abuse.
How to detect PerimeterX anti-bot solution?
If we analyze the tech stack of the target website with Wappalyzer, PerimeterX appears in the security section with a good degree of precision. Detecting it by inspecting the network tab in the developer tools is pretty easy. When active, you will see that PerimeterX sets a cookie with the following format when loading the first page of the website.
The Human Challenge
The Human Challenge is the PerimeterX “trademark” when talking about anti-bot challenges. Instead of throwing a Captcha or a Re-Captcha, their anti-bot solution shows this big button that a human must keep “pressed” with the mouse until the challenge is solved.
In a 2020 interview published on the company website, Gad Bornstein, product manager at that time before going to Meta, explained that this peculiar solution has several advantages both for website users and owners.
It’s 5x faster for humans to solve compared to other solutions and this leads 10-15x lower abandonment rate with Human Challenge compared to reCAPTCHA,
Another interesting topic in this interview is how the solution works:
Bot Defender also works in real time, so every time a user gets a new page, we calculate their behavior, path, fingerprints and all those machine learning models. Then we get a score that defines whether you're a human or a bot.
And if you are categorized as a bot, the Challenge triggers. This means that scrapers need to pretend to be like humans but also act like humans.
Finding a test website for this article has not been easy, the websites I knew were using PerimeterX now paired it with also Cloudflare bot management which would have affected our tests.
We’ll use neimanmarcus.com as a target website but before starting coding, I’m sharing with you this good article about PerimeterX made by Zenrows. It lets you understand in detail how it works and it describes a hypothetical solution for reverse engineering its functioning.
In my opinion, despite being very interesting for understanding what happens under the hood, I would not implement this kind of solution in my scrapers. The algorithm can change often and this requires restarting the reverse engineering process. The most durable method is instead trying to simulate being a real user, using as less resources as possible.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.