The Lab #35: Bypassing PerimeterX with Python and Playwright
Bypassing Perimeterx with free Python tools in 2023.
What is PerimeterX and how it work?
PerimeterX (now Human Scraping Defense) is one of the most famous anti-bot solutions available on the market.
It employs a sophisticated approach involving behavioral analysis and predictive detection, combined with various fingerprinting methods. These techniques assess multiple factors to distinguish between authentic users and automated bots attempting to access website resources.
The system's defenses are powered by machine learning algorithms that scrutinize requests and predict potential bot activities. If any suspicious behavior is observed, PerimeterX might deploy challenges to verify if the source is a robot. In clear bot scenarios, it blocks the IP address from accessing site resources.
PerimeterX's major defenses fall into four primary categories:
IP Monitoring: PerimeterX rigorously analyzes the IPs visiting a protected site. It examines past requests from the same IP, the frequency of these requests, and the intervals between them to spot bot-like patterns. The system also investigates the IP's origin, whether it's from an ISP or a data center, checks against known bot networks, and assesses the IP's historical reputation, assigning a reputation score to determine its trustworthiness.
HTTP Headers: These headers in HTTP requests and responses reveal important details about the request. PerimeterX uses this data to identify bot-like activities, scrutinizing whether the headers are unique to specific browsers or default ones used by HTTP libraries. Inconsistencies or missing headers often lead to denial of access, as legitimate browsers usually send consistent and complete header information.
Fingerprinting: This complex defense layer includes several techniques:
Browser Fingerprinting: Gathers details about the browser, its version, settings like screen resolution, operating system, installed plugins, and more to create a unique user fingerprint.
HTTP/2 Fingerprinting: Provides additional request details, similar to HTTP headers, but with more comprehensive information like stream dependencies and flow control signals.
TLS Fingerprinting: Analyzes the initial, unencrypted information shared during the TLS handshake, such as device details and TLS version, to identify unusual request parameters.
CAPTCHAs and Behavioral Analysis: PerimeterX uses its own CAPTCHA, 'HUMAN Challenge', to differentiate between humans and bots. It monitors webpage interactions, like mouse movements and keystroke speeds, to detect non-human behavior. This is highly effective as replicating the complexity of genuine user interactions is difficult for bots.
Despite its robust defenses, we still can use some techniques and tools to bypass them and scrape data from target websites.
All the techniques and tools we see in this article are for testing purposes and should not be used to harm any website or its business. Please use web scraping techniques in a ethical way. If you’ve got any doubt about your web scraping operations, please ask for legal advice.
Analyze the target website
The target website for this test is Neiman Marcus, a well-known department store that also has an e-commerce website, from where we’ll try to scrape product prices.
We immediately see from Wappalyzer that it’s protected by PerimeterX, so we’ll not consider a standard Scrapy strategy.
About the scraping strategy, instead, let’s try to load a product category page and see what happens.
When opening this Women’s Boots catalog, we notice that the first page loads and has the product data in the HTML and also inside a JSON. If we go to page two, the website makes a call to an API endpoint, with all the data we need.
We can load the following URL in our browser to get the same results
This means that all we need is this categoryID and then we can happily crawl the website, gathering data from its API endpoint, saving bandwidth and with a low request number.
First solution: hRequests ✅
In our previous The LAB article where we tested the hRequests package, we have already seen that it could be a solution.
In the GitHub repository available for the paying readers of the newsletter you will find the full code of the Scrapy spider that uses hRequests to get all the prices from a product category.
But lately, after a Python package upgrade on my machine, something has broken and cannot work anymore with all the features of hRequests, but still the scraper partially works for the whole Women Clothing category (22k items approx). In fact, without using any proxy, I could gather about 5k items before receiving a Human Captcha, when running from a local environment.
Let’s try adding a rotating residential proxy service and some error handling: whenever we get a captcha, we’ll close the hrequests session and then open a new one, which will load the first page of the product category.
By doing so, we’re resetting the session, clearing the cookies previously collected, and getting new ones. In fact, changing the IP is not enough if we don’t get rid of the cookies saying we didn’t pass the challenge.