The Web Scraping Club

Share this post

The January 2023 recap for the Web Scraping industry

substack.thewebscraping.club

The January 2023 recap for the Web Scraping industry

New tools and resources for starting this new year

Pierluigi Vinciguerra
Jan 29
3
Share this post

The January 2023 recap for the Web Scraping industry

substack.thewebscraping.club

This post is sponsored by Smartproxy, the premium proxy and web scraping infrastructure focused on the best price, ease of use, and performance.

Smartproxy

In this case, for all The Web Scraping Club Readers, using the discount code WEBSCRAPINGCLUB10 you can save 10% OFF for every purchase.


How 2023 started for the web scraping industry

New year
New year

Content, content, content

The web scraping industry it’s gaining momentum and this can be seen also from the growing number of podcasts, newsletters, and Youtube channels about it.

I can mention as an example the “Ethical data explained” podcast, curated by SOAX, where experts talk about best practices from a legal point of view for doing web scraping.

Even Oxylabs started in December a monthly newsletter called “The Scraping Digest”, covering various topics about the industry. As an example, the latest issue was dedicated to Web Scraping and AI.

Zyte intensified its webinar production with two published last week and another one coming in the next days, with a mix of tutorials about their new API and some general content about web scraping.

Of course, it’s all content marketing but when done honestly and giving real value to the end users, it’s worth spending time to read/watch/listen to this material.

Another content I really liked this month was this video by Cobalt Intelligence Youtube Channel.

It’s about a success story of a person with a passion for web scraping that found a smart way to make a decent amount of money with a side hustle in a super niche market. I’ve found this intriguing and inspiring for all the freelancers here in the community that desire to make some extra money with a personal project. By the way, you can find many more cases on the channel, to me it’s one of the best Web Scraping.

Another video I loved, a bit more technical, it was made by

SeattleDataGuy
(I suppose there's no need for any introduction here, but for the few who don't know him, he writes
SeattleDataGuy’s Newsletter
, one of the most interesting and successful substacks in the tech landscape).

In this video, you can follow Ben creating an end-to-end data project, from web scraping using Bright Data API to loading data in Snowflake and visualizing it with Tableau.

New year, old problems

This January brings also the first lawsuit of the year. Again Meta was involved, again personal data was scraped, and again there was, at least for what is read on the news at the moment, a clear break of the Terms and Conditions of the Facebook website.

In this case, Meta alleged that the startup Voyager Labs was improperly creating fake accounts and scaping user data.

As we always say, if data is shown on the internet it doesn’t mean it can be acquired without following any rule. Personal data, in my opinion, is a no-go for any web scraping project, for privacy and legal concerns. But if you really want to do it, please be followed by a legal team that can clarify to you what you can and cannot do.

Is PhantomJS now a ghost?

An interesting article popped out in Hacker News last week about the usage of PhantomJs, one of the headless browsers with Javascript APIs used for web scraping.

This research, signed by Antoine Vastel, head of research in Datadome, showed that despite it has been not updated since 2018, there’s still a significant amount of traffic on tested websites that can be traced back to a PhantomJs browser.

Of course, the web changed a lot in these years and PhantomJs is marking its time, with Playwright and Puppeteer gaining traction in the industry, but it’s surprising that there are still some PhantomJs solutions up and running.

A deobfuscation essay

Always about great quality content, Umasi published the second part of his study of the Kasada virtual machine mechanism to protect Nike’s website from bots. It’s a very technical series of posts but worth reading.

First part: link

Second part: link

Spoiler alert: I’m in talks with Veritas, another contributor to the same blog, and author of several interesting posts on scraping and deobfuscation like this one on Supreme. I’m working on an interview with him and hope to have him in this newsletter in February.

Most read post of the month

The most loved article of this month was “THE LAB #9” about scraping OpenSea, with some data mingling and analysis.

The Web Scraping Club
THE LAB #9: Scraping OpenSea NFT's data
The NFT hype cycle In the past week, a scandal that involves the famous influencer Logan Paul and his crypto project called “Criptozoo” exploded, thanks to the Cofeezilla investigations (you can see the full story here). Basically, it seems that this crypto game has never been delivered for multiple factors but people, trusting the public profile of Logan, put several million USD into it, hoping to have some return, a thing that never happened. It’s nothing new under the Crypto sun, Ponzi schemes promising impossible returns on investments are discovered every day, and surprisingly there’s always someone who got caught in the fishnet…
Read more
3 months ago · Pierluigi Vinciguerra

The Web Scraping Club is a free weekly newsletter about web scraping.

Once every two weeks, I publish The Lab, paid content with deep dives on more technical aspects and solutions to common issues, with also code on a GitHub repository. You can access to the following articles using a 7 days trial and then subscribe if you find them useful.


The Lab - premium content with real-world cases

  • THE LAB #10: Bypass Cloudflare Bot Protection with GoLogin

  • THE LAB #9: Scraping OpenSea NFT's data

  • THE LAB #8: Using Bezier curves for human-like mouse movements

  • THE LAB #7: Scraping PerimeterX protected websites

  • THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting

  • THE LAB #5 - Scraping Airbnb.com using GraphQL

  • THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers

  • THE LAB #3: scraping Cloudflare protected websites

  • THE LAB #2: scraping data from a website with Datadome and xsrf tokens

  • THE LAB #1: scraping data from an app


Our discord server is the place where we can share our experiences interactively or have a chat, find the bargain from our partners, and much more. I’d be glad to see you all there.

Join our Discord Server

3
Share this post

The January 2023 recap for the Web Scraping industry

substack.thewebscraping.club
Previous
Next
3 Comments
SeattleDataGuy
Writes SeattleDataGuy’s Newsletter
Jan 30

Not sure what you mean by Zyte API? I was using BrightData. Is Brightdata using Zyte.

Expand full comment
Reply
2 replies by Pierluigi Vinciguerra and others
2 more comments…
TopNewCommunity

No posts

Ready for more?

© 2023 Pierluigi
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing