THE LAB #30: How to bypass Akamai protected website when nothing else works
And without paying any commercial solution. An ode to trivial solutions.
In the past weeks, I’ve had some headaches with some Akamai-protected websites: our running scrapers stopped working, very few items were collected before getting blocked, and there seemed to be no way to fix them.
In this post, we’ll see the process that brought me to the final solution, which is not the best, but it works without using any commercial tool.
Understanding the context
The target website is a fashion retailer, well-known in the industry, with around 40k items available. Until some weeks ago, every product was scraped by a Scrapy spider run, using concurrent executions, one per product category. The website was already protected by Akamai, but splitting the execution from one scraper for the whole website to different executions allowed us to obtain the whole product catalog before getting blocked.
But some weeks ago, I think an update on the Akamai anti-bot solution happened since not only me but also other people in our Discord server faced the same issue: the scrapers were immediately blocked at the first request on the target website.
Nothing new in the life of a professional web scraper, but let’s use this case to detail a bit more the process of fixing a scraper, independently from the issue.
The first step to take is to know what challenge you’re facing and, as usual, this step is solved by the Wappalyzer Browser Extension, which shows the actual anti-bot situation.
In this case, we have reCAPTCHA in combination with Akamai. I honestly don’t remember if it was installed before this issue but the fact that other people are complaining about Akamai these days makes me think that it has been updated recently.
The debugging process
What I usually do, and I’m not saying it’s the best debugging process, is try to discard easy-to-detect issues and causes. In this case, we’ve got a Scrapy spider running on a virtual machine in a datacenter, so the first thing is to understand if running it on a local environment solves the situation.
If we’re able to scrape some items by running it on a laptop, then we understand it’s not a scraper spider but an IP issue and take the subsequent actions, like changing the cloud provider where the machine runs or adding a proxy, in case datacenter IPs are banned.
Unfortunately, that was not the situation: also running the scraper locally, it didn’t work, so we needed to rework it.
Another thing easy to detect is the need for a headful browser: if we could get the full HTML code by replicating the browser request on the website by using cUrl, for sure we’re not gonna need a headful browser.
How to test it it’s really simple: by opening the browser’s developer tools, you can copy the exact cUrl request corresponding to the browser call.
In this case, we’re lucky enough to see that we could get the full HTML, so we don’t need a headful browser, so we understood that we should rewrite the Scrapy spider without using Playwright or another headless browser.
From now on, it’s only a matter of trial and error. Recently I’ve had much luck with scrapy_impersonate and its way of masking the JA3 TLS fingerprint, I decided to use it as the first possible solution, but it didn’t work.
I’ve got some connection errors to the API endpoint used by the Scrapy spider, so probably the anti-bot protection was doing its job properly.
Since the only way we could get data actually is by using a cUrl request, I’d like to understand if this still works if we don’t pass the cookies we got from the browser. This is because I usually prefer cookies to be generated during the runtime since it’s a common behavior, rather than then that could be stale and no longer valid and generate errors.
But in this case, the behavior is the opposite: without cookies, the request didn’t work, reducing the number of cookies to the only ones relatable to Akamai instead, it does.
We’re starting to figure out a plausible solution, we only need to get that cookie when the scraper starts.
Filling up the cookie jar
The main issue here is that we cannot load any single page of the target website without getting blocked, so we cannot get the cookies needed to keep scraping it.
Even if we know it’s not needed to get data, I’ve also tried to load the API using Playwright, so maybe the needed cookies could be generated, but I’ve been blocked with a captcha every time.
So it seems we’re a bit stuck at this point: since the website has also a mobile app, what if we try to scrape data directly from there, using the procedure described in the first episode of The Lab, ?
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.