The January 2023 recap for the Web Scraping industry
New tools and resources for starting this new year
How 2023 started for the web scraping industry
Content, content, content
The web scraping industry it’s gaining momentum and this can be seen also from the growing number of podcasts, newsletters, and Youtube channels about it.
Of course, it’s all content marketing but when done honestly and giving real value to the end users, it’s worth spending time to read/watch/listen to this material.
Another content I really liked this month was this video by Cobalt Intelligence Youtube Channel.
It’s about a success story of a person with a passion for web scraping that found a smart way to make a decent amount of money with a side hustle in a super niche market. I’ve found this intriguing and inspiring for all the freelancers here in the community that desire to make some extra money with a personal project. By the way, you can find many more cases on the channel, to me it’s one of the best Web Scraping.
Another video I loved, a bit more technical, it was made by SeattleDataGuy (I suppose there's no need for any introduction here, but for the few who don't know him, he writes SeattleDataGuy’s Newsletter , one of the most interesting and successful substacks in the tech landscape).
In this video, you can follow Ben creating an end-to-end data project, from web scraping using Bright Data API to loading data in Snowflake and visualizing it with Tableau.
New year, old problems
This January brings also the first lawsuit of the year. Again Meta was involved, again personal data was scraped, and again there was, at least for what is read on the news at the moment, a clear break of the Terms and Conditions of the Facebook website.
In this case, Meta alleged that the startup Voyager Labs was improperly creating fake accounts and scaping user data.
As we always say, if data is shown on the internet it doesn’t mean it can be acquired without following any rule. Personal data, in my opinion, is a no-go for any web scraping project, for privacy and legal concerns. But if you really want to do it, please be followed by a legal team that can clarify to you what you can and cannot do.
Is PhantomJS now a ghost?
This research, signed by Antoine Vastel, head of research in Datadome, showed that despite it has been not updated since 2018, there’s still a significant amount of traffic on tested websites that can be traced back to a PhantomJs browser.
Of course, the web changed a lot in these years and PhantomJs is marking its time, with Playwright and Puppeteer gaining traction in the industry, but it’s surprising that there are still some PhantomJs solutions up and running.
A deobfuscation essay
Always about great quality content, Umasi published the second part of his study of the Kasada virtual machine mechanism to protect Nike’s website from bots. It’s a very technical series of posts but worth reading.
First part: link
Second part: link
Spoiler alert: I’m in talks with Veritas, another contributor to the same blog, and author of several interesting posts on scraping and deobfuscation like this one on Supreme. I’m working on an interview with him and hope to have him in this newsletter in February.
Most read post of the month
The most loved article of this month was “THE LAB #9” about scraping OpenSea, with some data mingling and analysis.
The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
If you wish to receive articles like this directly in your email, you can subscribe below.