The Lab #42: Bypassing PerimeterX without a browser automation tool
Bypassing PerimeterX with free tools and without running a browser
Welcome back to another episode of The Lab, our series of articles where we write some code to solve some common issues in web scraping. And what’s more common than being blocked by an anti-bot?
In the past weeks, we have seen how to bypass the most common anti-bots we can find.
All these articles give some hints on how to bypass a certain anti-bot in a real-world case, but unluckily there’s no silver bullet available.
Different websites with the same anti-bot installed could set different engagement rules and protection levels, or even inside the same website, the countermeasures can change.
Let’s take as an example the famous online travel listing Booking.com: the website is protected by PerimeterX and it uses some internal API to show the results of your queries about your next travels. If you try to browse the website you need some tool to bypass PerimeterX but the internal APIs are not protected, probably intentionally.
There could be different reasons for that:
getting data from the APIs is less resource intensive both from the scraper and from the website point of view, so they prefer that people use them to get data
they know that some external applications are using them
web scraping could be nothing more than a bother if done responsibly.
Probably the last point, in my opinion, is a common pattern: yes, web scraping can be a nuisance, but the most important use case of anti-bots is fraud detection, like automating the buying process of a certain item soon after it’s published (think about the sneaker market). This explains why websites selling “rare” items like the most recent drops of sneakers or streetwear, or even Hermes, are the most protected, maybe not from the home page, but for sure when you try to purchase something.
But let’s go back to the 100% legal web scraping of public information, which is the main content of this newsletter.
We just mentioned PerimeterX, which is a widespread anti-bot solution we already covered in the past.
Two months ago we wrote about how to bypass it using Plawright, since almost every anti-bot requires a JS rendering engine to solve their challenges and in most cases, scrapy_splash is not enough.
But since there are a lot of OSS tools available on GitHub for web scraping, are we sure we cannot use any of them to avoid launching Playwright and make scraping faster and less resource-intensive?
Spoiler: yes, I’ve found one ✅
How to detect a website using PerimeterX?
PerimeterX (now Human Scraping Defense) is known for throwing their Press and Hold CAPTCHA, but before that, we have other methods to detect PerimeterX.
In the first instance, we can use the free Chrome Extension from Wappalyzer: it’s easy and quite accurate.
Other signals of its presence can be discovered in the cookies, as mentioned on the great Web Scraping Wiki created by Maurice-Michel Didelot, a super expert in cybersecurity and one of the members of our amazing community. In this Wiki, you can find info about deobfuscating and reverse engineering anti-bots, essential for creating new tools and understanding what happens under the hood of our browser.
All the techniques and tools we see in this article are for testing purposes and should not be used to harm any website or its business. Please use web scraping techniques in a ethical way. If you’ve got any doubt about your web scraping operations, please ask for legal advice.
How to bypass PerimeterX with Scrapy?
The challenge of the article is to find a way to bypass PerimeterX on websites and scrape public data from it.
I’ll save you time and jump already to the conclusion, without telling you all the trials and errors I’ve made to find the right solution, which is Scrapy Impersonate.
Let me share with you two real-world examples where this solution fits particularly well.
You can find the code of the scrapers on The Web Scraping Club GitHub repository, available for paying readers. If you’re one of them but don’t have access, write me at pier@thewebscraping.club with your GH username.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.