What to expect from The Lab posts in 2024
Why I'm writing the "The Lab" articles and what to expect this new year
If you’ve been reading this newsletter for some time, you should be familiar with “The Lab” series of articles, but since many of you just joined recently, here’s a brief description. They are hands-on guides and solutions to common issues in the web scraping world: how can I bypass the anti-bot X, how can I make my bot undetectable from a human browsing and stuff like that.
In most cases, together with the description of the techniques, the code used in them it’s available on the Github private repository.
Botht the repository and the full article are behind paywall, since these articles are the most time consuming to write and test.
Generally speaking, writing articles for The Web Scraping Club requires hours every day and, while I’m making my best for you to extract value from this newsletter for free, every kind of support through paid subscriptions in kindly appreciated.
Why I’m writing The Lab articles?
To get rich, of course! 🤑🤑🤑
Just kidding, of course. The main reason why I write The Web Scraping Club is to share my experience in web scraping with other professionals who are facing similar challenges. That is something I would like to have read in the past, it could have been useful for my career, and I hope my notes are interesting for you now.
The biggest missing content online about web scraping activities, in my opinion, were solutions to real issues, on real websites. Since it’s a cat-and-mouse game, there’s the common idea that if you talk openly about detailed web scraping solutions, your words will be used by anti-bot companies to fix their products.
While I keep for myself the secret Cola recipe, I think that sharing these these techniques in public could bring more benefits to the webscraping community than make harms. Is it better for you to read a solution for your web scraping issues today even if, maybe, in the future won’t work anymore? Or you prefer to spend more weeks without delivering data to your final user/customer?
I really needed in the past some articles where it was described clearly what to do to bypass the anti-bot X, but the only articles I’ve found around were the ones of the commercial solutions, which at the end suggested to buy their services. This is perfectly legit and we’ll keep covering also commercial solutions on this pages, but it was not what I was looking for at that moment.
So I decided to write them by myself, and to me this is the most satisfying task, since it forces me to keep updated, study, make some research and, at the end, becoming a better web scraping professional. And I hope that by sharing the results of my studies with you, I’m also helping you somehow in your daily tasks.
What we covered in 2023 with The Lab?
During the past 35 articles, written mostly in 2023, we’ve faced several challenges, most of them created by anti-bots.
We started the 2023 by scraping OpenSea, the NFT marketplace, to understand how the sales of the Bored Ape Yatch Club sales evolved.
This involved bypassing it Cloudflare protection and, to keep track of the transactions, also scraping Etherscan which, at that time, was protected by Clouflare too.
From this example, you can understand the pros and cons of writing accurate solutions to scrape real websites: these website change and what it was real one year ago, maybe not could not work anymore. But this also means that we always have something to write about and study, since the web is constantly evolving.
In February instead I wrote the first edition of “The anti-detect anti-bot matrix”, where I compared 5 solutions (undetected-chromedriver, pyppetteer, and 3 different Playwright setups) against 5 anti-bots to see what’s working against who.
This article was extremely interesting and probably it will become a recurring one during 2024.
Always in February we had this great article by
(his Linkedin here) where he explains how to scrape data from a mobile app using Charles Proxies and Android Studio.This is the power of creating a community: everyone brings in its knowledge and expertise and Fabien is a master of web scraping techniques.
Of course, the core of The Lab are the articles where we bypass anti-bot solutions like Cloudflare, PerimeterX, Kasada and Akamai. They are the biggest challenges today for web scrapers and so they are the main focus for this series of articles.
The most successful The Lab Articles in 2023
As a confirmation of what just said, the podium of the top 3 articles of The Lab for the 2023 is occupied by these topics.
In third position we have our article about hRequests, where we benchmark this tool against the top 5 anti-bots on the market, and see where it works and where it’s not enough.
In second position we have the already mentioned Anti-detect Anti-bot matrix, again another benchmark against these anti-bots.
In first position our early 2023 update of the scraping solution against Cloudflare, which probably will need an update soon also for 2024.
What to expect in 2024
Talking about 2024, what’s cooking for this year in The Web Scraping Club kitchen?
As mentioned in our previous 2023 recap, The Lab will become a weekly issue from a bi-weekly 2023. This is made to bring more value to the readers, since I acknowledged that these articles are the ones most wanted by the community.
As already mentioned there will be some recurring issues like the Anti-Detect Anti-Bot matrix (maybe with a most readable name), and some other focuses on anti-bots.
And you what you would like to read on these pages?