The January 2023 recap for the Web Scraping industry
New tools and resources for starting this new year
This post is sponsored by Smartproxy, the premium proxy and web scraping infrastructure focused on the best price, ease of use, and performance.
In this case, for all The Web Scraping Club Readers, using the discount code WEBSCRAPINGCLUB10 you can save 10% OFF for every purchase.
How 2023 started for the web scraping industry
Content, content, content
The web scraping industry it’s gaining momentum and this can be seen also from the growing number of podcasts, newsletters, and Youtube channels about it.
I can mention as an example the “Ethical data explained” podcast, curated by SOAX, where experts talk about best practices from a legal point of view for doing web scraping.
Even Oxylabs started in December a monthly newsletter called “The Scraping Digest”, covering various topics about the industry. As an example, the latest issue was dedicated to Web Scraping and AI.
Zyte intensified its webinar production with two published last week and another one coming in the next days, with a mix of tutorials about their new API and some general content about web scraping.
Of course, it’s all content marketing but when done honestly and giving real value to the end users, it’s worth spending time to read/watch/listen to this material.
Another content I really liked this month was this video by Cobalt Intelligence Youtube Channel.
It’s about a success story of a person with a passion for web scraping that found a smart way to make a decent amount of money with a side hustle in a super niche market. I’ve found this intriguing and inspiring for all the freelancers here in the community that desire to make some extra money with a personal project. By the way, you can find many more cases on the channel, to me it’s one of the best Web Scraping.
Another video I loved, a bit more technical, it was made by(I suppose there's no need for any introduction here, but for the few who don't know him, he writes , one of the most interesting and successful substacks in the tech landscape).
In this video, you can follow Ben creating an end-to-end data project, from web scraping using Bright Data API to loading data in Snowflake and visualizing it with Tableau.
New year, old problems
This January brings also the first lawsuit of the year. Again Meta was involved, again personal data was scraped, and again there was, at least for what is read on the news at the moment, a clear break of the Terms and Conditions of the Facebook website.
In this case, Meta alleged that the startup Voyager Labs was improperly creating fake accounts and scaping user data.
As we always say, if data is shown on the internet it doesn’t mean it can be acquired without following any rule. Personal data, in my opinion, is a no-go for any web scraping project, for privacy and legal concerns. But if you really want to do it, please be followed by a legal team that can clarify to you what you can and cannot do.
Is PhantomJS now a ghost?
This research, signed by Antoine Vastel, head of research in Datadome, showed that despite it has been not updated since 2018, there’s still a significant amount of traffic on tested websites that can be traced back to a PhantomJs browser.
Of course, the web changed a lot in these years and PhantomJs is marking its time, with Playwright and Puppeteer gaining traction in the industry, but it’s surprising that there are still some PhantomJs solutions up and running.
A deobfuscation essay
Always about great quality content, Umasi published the second part of his study of the Kasada virtual machine mechanism to protect Nike’s website from bots. It’s a very technical series of posts but worth reading.
First part: link
Second part: link
Spoiler alert: I’m in talks with Veritas, another contributor to the same blog, and author of several interesting posts on scraping and deobfuscation like this one on Supreme. I’m working on an interview with him and hope to have him in this newsletter in February.
Most read post of the month
The most loved article of this month was “THE LAB #9” about scraping OpenSea, with some data mingling and analysis.
The Web Scraping Club is a free weekly newsletter about web scraping.
Once every two weeks, I publish The Lab, paid content with deep dives on more technical aspects and solutions to common issues, with also code on a GitHub repository. You can access to the following articles using a 7 days trial and then subscribe if you find them useful.
The Lab - premium content with real-world cases
THE LAB #8: Using Bezier curves for human-like mouse movements
THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting
THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers
THE LAB #2: scraping data from a website with Datadome and xsrf tokens
Our discord server is the place where we can share our experiences interactively or have a chat, find the bargain from our partners, and much more. I’d be glad to see you all there.
Not sure what you mean by Zyte API? I was using BrightData. Is Brightdata using Zyte.