The January 2023 recap for the Web Scraping industry

New tools and resources for starting this new year

Jan 29, 2023

Content, content, content

The web scraping industry it’s gaining momentum and this can be seen also from the growing number of podcasts, newsletters, and Youtube channels about it.

I can mention as an example the “Ethical data explained” podcast, curated by SOAX, where experts talk about best practices from a legal point of view for doing web scraping.

Even Oxylabs started in December a monthly newsletter called “The Scraping Digest”, covering various topics about the industry. As an example, the latest issue was dedicated to Web Scraping and AI.

Zyte intensified its webinar production with two published last week and another one coming in the next days, with a mix of tutorials about their new API and some general content about web scraping.

Of course, it’s all content marketing but when done honestly and giving real value to the end users, it’s worth spending time to read/watch/listen to this material.

Another content I really liked this month was this video by Cobalt Intelligence Youtube Channel.

It’s about a success story of a person with a passion for web scraping that found a smart way to make a decent amount of money with a side hustle in a super niche market. I’ve found this intriguing and inspiring for all the freelancers here in the community that desire to make some extra money with a personal project. By the way, you can find many more cases on the channel, to me it’s one of the best Web Scraping.

Another video I loved, a bit more technical, it was made by SeattleDataGuy (I suppose there's no need for any introduction here, but for the few who don't know him, he writes SeattleDataGuy’s Newsletter , one of the most interesting and successful substacks in the tech landscape).

In this video, you can follow Ben creating an end-to-end data project, from web scraping using Bright Data API to loading data in Snowflake and visualizing it with Tableau.

New year, old problems

This January brings also the first lawsuit of the year. Again Meta was involved, again personal data was scraped, and again there was, at least for what is read on the news at the moment, a clear break of the Terms and Conditions of the Facebook website.

In this case, Meta alleged that the startup Voyager Labs was improperly creating fake accounts and scaping user data.

As we always say, if data is shown on the internet it doesn’t mean it can be acquired without following any rule. Personal data, in my opinion, is a no-go for any web scraping project, for privacy and legal concerns. But if you really want to do it, please be followed by a legal team that can clarify to you what you can and cannot do.

Is PhantomJS now a ghost?

An interesting article popped out in Hacker News last week about the usage of PhantomJs, one of the headless browsers with Javascript APIs used for web scraping.

This research, signed by Antoine Vastel, head of research in Datadome, showed that despite it has been not updated since 2018, there’s still a significant amount of traffic on tested websites that can be traced back to a PhantomJs browser.

Of course, the web changed a lot in these years and PhantomJs is marking its time, with Playwright and Puppeteer gaining traction in the industry, but it’s surprising that there are still some PhantomJs solutions up and running.

A deobfuscation essay

Always about great quality content, Umasi published the second part of his study of the Kasada virtual machine mechanism to protect Nike’s website from bots. It’s a very technical series of posts but worth reading.

First part: link

Second part: link

Spoiler alert: I’m in talks with Veritas, another contributor to the same blog, and author of several interesting posts on scraping and deobfuscation like this one on Supreme. I’m working on an interview with him and hope to have him in this newsletter in February.