The Web Scraping Club

Share this post

The 2022 recap for the Web Scraping industry

substack.thewebscraping.club

The 2022 recap for the Web Scraping industry

My end of year remarks, the birth of The Web Scraping Club and more

Dec 30, 2022
Share this post

The 2022 recap for the Web Scraping industry

substack.thewebscraping.club

This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.

In this case, for all The Web Scraping Club Readers, using the discount code WSC25 you can save 25% OFF for your residential proxies buying.


My general feelings about 2022

2022 is closing and as usual, these last days are spent making a recap of what we achieved and what happened during the past year.

As some of you may know from my Linkedin profile, with Re Analytics I’m in the web scraping industry since 2014. Actually, we achieved a dimension of hundreds of e-commerce websites scraped every day so, at least in this niche, I can say I’ve got some experience on the ground.

From my experience, I would say that 2022 is the year where anti-bot solutions gave us the most headaches. The difficulty of web scraping increased significantly and so did the costs of doing it at scale.

But this didn’t stop the web scraping industry, it pushed only the developers to use more sophisticated techniques, more proxies, and headful browsers.

As stated by Shane Evans, CEO of Zyte, during the 2022 Extract Summit, the spending for web data has increased and will continue this trend also in the future.

More and more industries are digitalized and so the need and availability of web data keep increasing and so web scraping will be a hot trend for the next years.

Three trends I’ve seen in 2022

Web scraping is getting legal recognition

During 2022 there have been several sentences that established the terms for legal and illegal web scraping.

The US appeal court reaffirmed that HiQ acted correctly in the Linkedin case and the same happened in Korean Supreme Court decisions on cases where companies were scraping competitors’ websites.

Of course, it stays illegal to gather personal data without consent, like Clearview AI did, or breaking consciously the terms of use of the target website.

The rise of AI

During the last 2-3 years I’ve heard about several solutions that use AI to do some web scraping but I’ve always been skeptical about it. It’s such a generic and open task, with lots of variables like anti-bot solutions and different website designs that it seemed to me quite a utopia to have an AI for at least the 90% of the web scraping cases.

But when I’ve seen ChatGPT in action, with such a powerful engine capable to answer almost everything asked, I was shocked. And probably, if such powerful models become more accessible and trainable, we’re close to transforming a utopia into reality.

No Code tools

As for AI, I was skeptical about the adoption of no-code tools, for more or less the same reasons. The web scraping covers a wide array of cases and variables, and on top I wasn’t sure about the scalability of these solutions.

But during 2022 I’ve seen great improvements in these tools, especially if you need to get data from a medium-sized website with no heavy anti-bot software, they could be the most time-to-value effective solutions.

The birth of The Web Scraping Club

When someone asks me why I started The Web Scraping Club, I always answer that was my need at first. I personally needed a place where all the knowledge and experience on the ground I make is written down. Sometimes I spend days looking for a solution to a particular challenge and after solving it, I didn’t want to lose the discoveries, tools, and references I’ve found.

I could do a personal “digital garden”, accessible only to myself, so I could get an advantage of my discoveries. But I strongly believe that building a community with people facing the same issues, encouraging them to exchange solutions, and learning from all of the members has far more advantages than going solo.

For this reason, beside the newsletter, we have our Discord Server, where we can interact and gather all the ideas and challenges we have. I strongly encourage you to join our server, there are some amazing talents with brilliant ideas.

Join our Discord

Always because I believe in the power of the community, in 2023 I’d like to add some more contributors to The Web Scraping Club, to cover some more topics I’m not proficient in.

If you have experience in:

  • Web Scraping in Node.JS

  • Tools for scraping data from apps

  • SERP scraping

and want to help the club grow by writing some articles, please reach me at pier@thewebscraping.club

Talking about the growth of the club, we have reached 600 subscribers in only 4 months. I’d like to thank you all personally for the trust you have in me. And a special thanks to the paid subscribers, that helped me to believe in this project and demonstrate your support in such an amazing way.

And of course, I cannot forget to mention Oxylabs, the first company that believed in this newsletter and is giving great support to it and concrete help to our readers. Using the discount code WSC25 you can save 25% OFF for your residential proxies buying.

Spoiler alert: together with Oxylabs, other partners are joining to add value both for the Club and for the readers.

The most-read post of the Club

The growth of The Web Scraping Club is fully organic, I actually don’t spend on advertising and I don’t think I would do it in the near future. So the views are strictly connected to the number of subscribers and how the posts get viral on websites like Hacker News or socials, especially Linkedin or Facebook.

If you think the content you’re reading is meaningful for you and others in your community and want help spreading it, please consider sharing it on these channels.

The first post that got huge success was the first “The Lab” article, which ranked on the first page of Hacker News and has been read by more than 12k people.

The Web Scraping Club
THE LAB #1: Scraping data from an app
This is the first post of “THE LAB”: in this series, we'll cover real-world use cases, with code and an explanation of the methodology used. In the future, this kind of content will be available only to paying subscribers. Being the first of the series, this one will be available for free until Sunday 11th of Sept 2022, then will be behind a paywall…
Read more
5 months ago · 5 likes · 4 comments · Pierluigi Vinciguerra

I didn’t have the same luck with other posts, HN seems to shadowban you if you post too many times articles from the same website, so I quit sharing with them my article, but please feel free to do it if you want.

Talking about Facebook, I’d like to mention the group Web Scraping World of my friend Wojciech Żeglin, CEO of Dataminers.co
We have the same focus on sharing knowledge on web scraping and suggest entering this group for creating connections with other professionals.

Hope you’re having a good Xmas break holiday, see you soon with a “The Lab” special issue on scraping data from OpenSea.


The Web Scraping Club is a free weekly newsletter about web scraping.

Once every two weeks, I publish The Lab, paid content with deep dives on more technical aspects and solutions to common issues, with also code on a GitHub repository. You can access to the following articles using a 7 days trial and then subscribe if you find them useful.


The Lab - premium content with real-world cases

  • THE LAB #8: Using Bezier curves for human-like mouse movements

  • THE LAB #7: Scraping PerimeterX protected websites

  • THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting

  • THE LAB #5 - Scraping Airbnb.com using GraphQL

  • THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers

  • THE LAB #3: scraping Cloudflare protected websites

  • THE LAB #2: scraping data from a website with Datadome and xsrf tokens

  • THE LAB #1: scraping data from an app


Our discord server is the place where we can share our experiences interactively or have a chat, find the bargain from our partners, and much more. I’d be glad to see you all there.

Join our Discord Server

Share this post

The 2022 recap for the Web Scraping industry

substack.thewebscraping.club
Previous
Next
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Pierluigi
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing