The Lab #42: Bypassing PerimeterX without a browser automation tool

Bypassing PerimeterX with free tools and without running a browser

Feb 22, 2024

∙ Paid

Welcome back to another episode of The Lab, our series of articles where we write some code to solve some common issues in web scraping. And what’s more common than being blocked by an anti-bot?

In the past weeks, we have seen how to bypass the most common anti-bots we can find.

The Web Scraping Club

The Lab #38: Bypassing Kasada for web scraping 2024 edition

Let’s continue this 2024 by writing another article where we bypass an anti-bot solution to get some public data, like prices from e-commerce. After writing two posts about Cloudflare (here’s part one and at this link you can find part two), today it’s the turn of Kasada, a niche anti-bot solution with a different approach to bot detection. It’s used by large platforms like Twitch and e-commerce websites like Canada Goose, which will be our testing target for this article…

a year ago · 2 comments · Pierluigi Vinciguerra

The Web Scraping Club

The Lab #36: Bypassing Cloudflare with anti-detect browsers

Cloudflare Bot Protection is one of the most used anti-bot solutions and we’ve already seen in the past articles from The Lab how it relies on device fingerprint as one of the techniques to detect bots. In this post, we see how to use a commercial solution like GoLogin to change our device fingerprint to bypass the anti-bot. To be completely honest, this article was initially meant to compare different anti-detect browsers against Cloudflare but due to some time constraints, I’ve decided to write only about GoLogin because it’s the one I’m more familiar with. Anyway, a comparison between different anti-detect browsers will come soon…

2 years ago · 2 likes · Pierluigi Vinciguerra

The Web Scraping Club

The Lab #37: Bypassing Cloudflare with anti-detect browsers - Part 2

In the latest article of The Web Scraping Club, we’ve seen how to configure GoLogin to bypass Cloudflare Bot Protection. We have seen how device fingerprinting works, since our scraper worked from our local machine but not on a server on the AWS datacenter, even using residential proxies…

2 years ago · 5 likes · 1 comment · Pierluigi Vinciguerra

The Web Scraping Club

The Lab #34: Bypassing Datadome - End of 2023 Version

Before starting with the main topic of the article, where I’ll try some approaches to scrape data from a Datadome-protected website, let me remind you of a webinar by Smartproxy, coming out in the next few hours. I’ll be there with Fabien Vauchelles…

2 years ago · 2 likes · 3 comments · Pierluigi Vinciguerra

All these articles give some hints on how to bypass a certain anti-bot in a real-world case, but unluckily there’s no silver bullet available.

Different websites with the same anti-bot installed could set different engagement rules and protection levels, or even inside the same website, the countermeasures can change.

Let’s take as an example the famous online travel listing Booking.com: the website is protected by PerimeterX and it uses some internal API to show the results of your queries about your next travels. If you try to browse the website you need some tool to bypass PerimeterX but the internal APIs are not protected, probably intentionally.

There could be different reasons for that:

getting data from the APIs is less resource intensive both from the scraper and from the website point of view, so they prefer that people use them to get data
they know that some external applications are using them
web scraping could be nothing more than a bother if done responsibly.

Probably the last point, in my opinion, is a common pattern: yes, web scraping can be a nuisance, but the most important use case of anti-bots is fraud detection, like automating the buying process of a certain item soon after it’s published (think about the sneaker market). This explains why websites selling “rare” items like the most recent drops of sneakers or streetwear, or even Hermes, are the most protected, maybe not from the home page, but for sure when you try to purchase something.

But let’s go back to the 100% legal web scraping of public information, which is the main content of this newsletter.

We just mentioned PerimeterX, which is a widespread anti-bot solution we already covered in the past.

The Web Scraping Club

The Lab #35: Bypassing PerimeterX with Python and Playwright

What is PerimeterX and how it work? PerimeterX (now Human Scraping Defense) is one of the most famous anti-bot solutions available on the market. It employs a sophisticated approach involving behavioral analysis and predictive detection, combined with various fingerprinting methods. These techniques assess multiple factors to distinguish between authentic users and automated bots attempting to access website resources…

2 years ago · Pierluigi Vinciguerra

Two months ago we wrote about how to bypass it using Plawright, since almost every anti-bot requires a JS rendering engine to solve their challenges and in most cases, scrapy_splash is not enough.

But since there are a lot of OSS tools available on GitHub for web scraping, are we sure we cannot use any of them to avoid launching Playwright and make scraping faster and less resource-intensive?

Spoiler: yes, I’ve found one ✅

How to detect a website using PerimeterX?

PerimeterX (now Human Scraping Defense) is known for throwing their Press and Hold CAPTCHA, but before that, we have other methods to detect PerimeterX.

In the first instance, we can use the free Chrome Extension from Wappalyzer: it’s easy and quite accurate.

Other signals of its presence can be discovered in the cookies, as mentioned on the great Web Scraping Wiki created by Maurice-Michel Didelot, a super expert in cybersecurity and one of the members of our amazing community. In this Wiki, you can find info about deobfuscating and reverse engineering anti-bots, essential for creating new tools and understanding what happens under the hood of our browser.

All the techniques and tools we see in this article are for testing purposes and should not be used to harm any website or its business. Please use web scraping techniques in a ethical way. If you’ve got any doubt about your web scraping operations, please ask for legal advice.

How to bypass PerimeterX with Scrapy?

The challenge of the article is to find a way to bypass PerimeterX on websites and scrape public data from it.

I’ll save you time and jump already to the conclusion, without telling you all the trials and errors I’ve made to find the right solution, which is Scrapy Impersonate.

Let me share with you two real-world examples where this solution fits particularly well.

You can find the code of the scrapers on The Web Scraping Club GitHub repository, available for paying readers. If you’re one of them but don’t have access, write me at pier@thewebscraping.club with your GH username.

TWSC GitHub Repository

Keep reading with a 7-day free trial

Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.