THE LAB 32: hRequests vs anti-bots: a full benchmark
How does it perform against Cloudflare, Akamai, Datadome, PerimeterX and Kasada?
In one of the past articles, I wrote about hRequests (human requests), a Python package that enhances traditional HTTP requests with more features, like headless browsing, real TLS fingerprints from browsers, and many others.
We tested the hRequests package against Akamai, but today I want to dig deeper and see if we can bypass the most used anti-bots solutions like Cloudflare, Datadome, PerimeterX, and Kasada.
For this test, we’ll try to scrape the usual websites of our Hands-On article series, in order to get the same baseline. As a reminder, if hRequests bypassed our tests on these websites, doesn’t mean it will pass any website protected with the same anti-bot solutions. In fact, these software are highly customizable and could respond in a different way while having the same anti-bot installed.
On top, consider that, in this case, I’m testing hRequests from my laptop at the office, so I’m using a clean IP and the device fingerprint sent by the browser is legit.
Different setups could lead to different results but, with this article, I hope you can discover a new tool to add to your toolbelt and, with a proper setup, could help in your web scraping projects.
Given these premises, let’s see how the hRequests package performs against the most famous anti-bot solutions. The full code can be found on the GitHub repository reserved for paying users.
Akamai ✅
We’ve already seen this case in one post dedicated to Akamai and hRequests, but let’s recap what we’ve done.
We are testing against the website luisaviaroma.com, which has Akamai + reCAPTCHA.
Our plan to bypass this combo is:
Load the homepage to grab the initial Akamai cookie and use a chromium headful window to bypass the reCAPTCHA.
session=hrequests.BrowserSession(headless=False)
url='https://www.luisaviaroma.com/'
akamai_test=session.get(url)
page = akamai_test.render(mock_human=True)
Once the cookies are stored in the session, we can use them to browse the product’s pages headlessly.
while current_page<total_pages:
try:
url='https://www.luisaviaroma.com/'+language+'-'+country+'/shop/'+category+'&Page='+str(current_page)+'&ajax=true'
akamai_test=session.get(url)
print(akamai_test.text)
json_data=json.loads(akamai_test.text)
print(json_data)
products = json_data["Items"]
Easy and effective.
Cloudflare ✅
Let’s check if we can use the same approach with Cloudflare, testing the website harrods.com.
Unluckily, it doesn’t work as before, we hit the wall while loading the home page.
Playing around with the browsers supported by the package and the operating system to mock, I’ve modified the starting loop as follows:
session=hrequests.firefox.Session(os='mac')
url='https://www.harrods.com/'
cloudflare_test=session.get(url)
page = cloudflare_test.render(mock_human=True, headless=False)
#print(page.content)
url='https://www.harrods.com/en-it/shopping/women-clothing?icid=megamenu_shop_women_clothing_all-clothing&pageindex=1'
check=0
while check == 0:
page.goto(url)
interval=randrange(20,30)
time.sleep(interval)
In this way, we’re opening a Firefox browser window and load a product category in it.
It seems we’re bypassing both the Cloudflare Turnstile check and the hard block by Cloudflare we usually get on this website.
I only needed to add a random sleep interval between pages since the browser has some difficulty in rendering the pages and sometimes it could happen that we’re starting to parse the HTML before it has fully rendered.
Datadome ❌
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.