THE LAB #25: Bypassing Perimeterx in 2023
How to bypass PerimeterX anti-bot solution using both free and commercial solutions
This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.
PerimeterX anti-bot solution, recently acquired by Human, is one of the most spread anti-bot solutions on the web.
As with every modern anti-bot solution, it uses all the most recent techniques to detect bot traffic and scrapers like:
fingerprinting on different layers
AI applied to behavioral analysis
In this post we’ll see some techniques, both free and commercial, to bypass a PerimeterX-protected website like Neimanmarcus.com.
These techniques and the code on our GitHub repository (available for paying readers), should be taken as a starting point since each website could implement its protection in different ways.
If you’re a paying user and don’t have access to the GitHub repository, write me at firstname.lastname@example.org, since I need to give you access manually.
How to detect if a website uses PerimeterX?
You can use the free Wappalyzer Chrome extension and, in the security section, if a website uses PerimeterX you should see it.
If you don’t want to, or you’re unable to use Wappalyzer, you can check the stored cookies of a website and look for one called pxhd, like the following:
Set-Cookie: _pxhd=mepSfyVT0voiGSF9HsgdW4GlbT9YEefstfxozu2ajYq03dJV8h2lkYZyeOKOzoI85m8SLs/5HvDHR7cL3xekUQ==:q54rImD/eeV8qJLNcp3bRbS70bTuuAPgMys0dty8tiWi4DfdxA1bKCqbaFXUNhNIaW3etC3KxGcSDewg7TBKJDu7lhV1MegAxcolO-AJEAE=; Expires=Wed, 14 Aug 2024 16:52:01 GMT; path=/;
One of the distinctive trait of the bot protection is the “Press and Hold” button, which looks like the following.
We’ve recently reviewed the Undetected Chromedriver version 3.5 and we’ve found out that it performs quite well with PerimeterX, both in a local environment and on a datacenter, using the proper proxies to avoid bans in IP ranges and IP rate limit.
In the GitHub repository, inside the file undetected-chromedriver.py, you'll find out a scraper that crawls one item category, using undetected-chromedriver.
You will notice a peculiarity: I’ve used Brave Browser instead of Chrome, and this is because after loading the first page, UC gets stuck on a call to an external API. This doesn’t happen when using Brave Browser.
options = uc.ChromeOptions() options.binary_location = '/Applications/Brave Browser.app/Contents/MacOS/Brave Browser' driver = uc.Chrome(headless=False,use_subprocess=True, options=options) driver.get("https://www.neimanmarcus.com/")
One of the reasons I prefer Playwright to other tools like Selenium is its ductility: with few changes we can test different configurations and browsers without rewriting a scraper.
The first working solution we’re seeing could be also implemented by using Plawright + Firefox, again eventually with a proxy provider when executed on a datacenter.
This works basically straight out of the box, with only one additional option: when running the scraper from a server connected with a low latency network is slow_mo. It artificially slows down the execution of Playwright and I suppose this tricks the anti-bot, simulating a slower connection.
with sync_playwright() as p: browser = p.firefox.launch(headless=False, slow_mo=300) page = browser.new_page() page.goto('https://www.google.it', timeout=0) interval=randrange(10) time.sleep(interval) page.goto('https://www.neimanmarcus.com', timeout=0) page.wait_for_load_state("load")
You can find the full scraper in the repository, inside the file playwright_firefox.py.
Things are not so smooth instead when using the Chromium browser. It loads the website’s main page but then we got blocked on the first product page.
To make Chrome work against PerimeterX, we need to modify the scraper as follows:
CHROMIUM_ARGS= [ '--no-sandbox', '--disable-setuid-sandbox', '--no-first-run', '--disable-blink-features=AutomationControlled' ] .... with sync_playwright() as p: browser = p.chromium.launch_persistent_context(user_data_dir='./userdata/', headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"]) page = browser.new_page() page.goto('https://www.google.it', timeout=0) interval=randrange(10) time.sleep(interval) page.goto('https://www.neimanmarcus.com', timeout=0)
We’re disabling via command line the Google Sandbox, the first run page and the infobar saying that the browser it controlled by an automation.
On top, we’re using a Chrome installation and not a Chromium one, with a persistent context.
With this setup, contained in the file playwright_chrome.py, we can crawl the whole product category.
More or less, the same configuration can be applied to Brave browser. We’ll explicitly set the executable path for Brave and leave only the compatible options as arguments.
CHROMIUM_ARGS= [ '--no-first-run', '--disable-blink-features=AutomationControlled' ] ..... with sync_playwright() as p: browser = p.chromium.launch_persistent_context(user_data_dir='./userdata/', executable_path='/Applications/Brave Browser.app/Contents/MacOS/Brave Browser', headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"]) page = browser.new_page() page.goto('https://www.google.it', timeout=0) interval=randrange(10) time.sleep(interval) page.goto('https://www.neimanmarcus.com', timeout=0)
Final thoughts on free solutions
We’ve seen several approaches to bypass PerimeterX: as mentioned at the beginning of the article, they could be used as a starting point in your projects, since results may vary from the target website and the scraper running environment. Since PerimeterX uses also behavioral analysis to detect bots, you may have noticed I’ve introduced random sleeps from one action to another and also some random mouse scrolling.
Maybe not the most advanced solution but it seems enough to confuse the AI model for bot detection and have a green flag.
Keep reading with a 7-day free trial