hRequests: bypass Akamai with Python requests
Python requests with super powers and browser automation embedded
“The first rule of web scraping is: you do not talk about web scraping.” it’s a sentence written in the description of the “webscraping” subreddit. While I understand that secret sauces must remain so, I believe that sharing techniques, best practices, and tools makes the industry more efficient. That’s the case of the tool we’re going to see today, hRequests, a version of Python Requests with steroids, that I didn’t know until some days ago when this post by Ehsan popped up in my LinkedIn timeline.
It’s not the first time I have discovered a new package in this way but hrequest is truly interesting, so I’ve decided to write about it and test it against some anti-bot.
What is hrequests?
hrequests, which stands for Human Requests, is a Python package that adds superpowers to traditional Python requests.
With its HTTP backend written in GO language, the package covers mainly two aspects useful for scraping: HTTP request handling and browser automation.
On the request side, we have a set of features that make our scrapers less detectable. Starting from the TLS layer, we can choose a browser and the package will send a TLS Fingerprint that will mimic it, just like we have seen with Scrapy Impersonate package, used to bypass Cloudflare. In this case, the TLS spoofing is based on the GO package tls-client by bogdafinn.
Additionally, the package will send request headers coherent with the browser chosen, just like a real browsing session would do.
Here’s the result of the API used for testing https://tls.browserleaks.com/json, when we decided to impersonate a Chrome Browser making a request from my Mac.
{
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5830.2 Safari/537.36",
"ja3_hash": "e5bf60b9da5a6612c428b482319ae86f",
"ja3_text": "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,5-17513-13-65281-23-16-18-10-11-0-43-35-51-45-27-21,29-23-24,0",
"ja3n_hash": "aa56c057ad164ec4fdcb7a5a283be9fc",
"ja3n_text": "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,0-5-10-11-13-16-18-21-23-27-35-43-45-51-17513-65281,29-23-24,0",
"akamai_hash": "a345a694846ad9f6c97bcc3c75adbe26",
"akamai_text": "1:65536;2:0;3:1000;4:6291456;6:262144|15663105|0|m,a,s,p"
}
Here’s the code for that request:
import hrequests
session = hrequests.Session('chrome')
tls_test=session.get('https://tls.browserleaks.com/json')
print(tls_test.text)
Unluckily I cannot find a database of updated JA3 strings to check the output, but we’ll see later how the package works fine in bypassing Akamai.
On top of all these cool features on the request side, we have also additional ones for the browser automation.
In fact, you can create browser sessions, using Firefox or Chrome, and control them using simple commands that allow you to click, browse pages and get the rendered content.
In my opinion, this part is too basic for complex web scraping but can be used for simpler projects.
Here’s an example from the official documentation:
>>> session = hrequests.Session(browser='chrome')
>>> resp = session.get('https://www.somewebsite.com/')
>>> with resp.render(mock_human=True) as page:
... page.find('.input#username').type('myuser')
... page.find('.input#password').type('p4ssw0rd')
... page.click('#submit')
# `session` & `resp` now have updated cookies, content, etc.
The most interesting part here is the mock_human=True parameter. Based on the botright package (another software I’m willing to test soon), it uses a vast array of techniques to avoid bot detection. We’ll dig deeper into these features in one of the next posts since they deserve a “Hands On” episode, so for the moment we’ll focus more on the HTTP requests side.
Bypass Akamai with hrequests
In one of the latest THE LAB posts about Akamai, I’ve ended with a solution I honestly didn’t like too much. I’ve tested Scrapy Impersonate but with no luck, the mobile app’s calls were protected and the final result was to force a working Akamai cookie in the execution of the scraper.
This means that if the cookie stops working, I need to change it after the spider breaks, not an optimal solution. Maybe with hrequests I’ll be luckier?
After installing it, I start by simply importing the package into the Scrapy project
...
import json
import time
import hrequests
Then, instead of using a traditional Scrapy Request to query the API, I’m looping on a function that calls the API and paginate the requests.
Since the API returns a JSON, we only need to parse it to get the data needed. The issue is that, on top of Akamai, there’s also a recaptcha to solve.
For this reason, we load the home page of the website using a headful browser, in order to bypass the ReCaptcha and get the valid token from Akamai.
session=hrequests.BrowserSession(proxy_ip="YOURRESIDENTIALPROXYPROIVIDER", headless=False)
url='https://www.luisaviaroma.com/'
akamai_test=session.get(url)
page = akamai_test.render(mock_human=True)
Once bypassed, we can switch to a classic Python hrequest with no browser needed and parse the JSON returned.
url='https://www.luisaviaroma.com/'+language+'-'+country+'/shop/'+category+'&Page='+str(current_page)+'&ajax=true'
akamai_test=session.get(url)
print(akamai_test.text)
json_data=json.loads(akamai_test.text)
Iterating for different page numbers until the last one is reached allows us to get all the items from a single category of the target website. In case the recaptcha pops up again, we handle it with a try/except clause where we get again a new token by loading the home page again.
The whole code can be found on the GitHub repository available to everyone.
The logic of the spider is similar to the previous version: we get the Akamai token and use it for scraping by passing it in the session. In this newest version, instead of using a fixed cookie, we’re generating dynamically by loading the homepage with a browser, a much more resilient solution.
This also highlights the flexibility of the hrequests package, which allows the operator to switch from headful browser to requests smoothly and use both of them inside a single spider.
How many of you knew about the hrequests package? Is there a package you want to suggest to me? please write me at pier@thewebscraping.club