The Web Scraping Club

Share this post

Web Scraping news recap - February 2023

substack.thewebscraping.club

Web Scraping news recap - February 2023

Legal updates and new tools available in February for the web scraping industry

Pierluigi Vinciguerra
Feb 26
Share this post

Web Scraping news recap - February 2023

substack.thewebscraping.club

This post is sponsored by Smartproxy, the premium proxy and web scraping infrastructure focused on the best price, ease of use, and performance.

Smartproxy
Smartproxy

In this case, for all The Web Scraping Club Readers, using the discount code WEBSCRAPINGCLUB10 you can save 10% OFF for every purchase.


Hi everyone and welcome back to The Web Scraping Club, this post is our monthly review of what happened in the web scraping industry in February.

Legal updates

There’s no monthly recap without news about legal actions involving web scraping and privacy violated.

brown wooden smoking pipe on white surface
Photo by Tingey Injury Law Firm on Unsplash

Meta vs Voyager Labs

The first news of the month is Meta suing the Israeli firm Voyager Labs for creating fake accounts on Facebook used to collect a large amount of personal data. This was then aggregated with other data points and sold to law “agencies tasked with public safety.” Meta states that this behavior breaks the Terms of Services of Facebook and, in my opinion, there’s almost no doubt about it.

The full article can be found here on Arstechnica.

Meta vs Bright Data

In the case involving always Meta and Bright Data, which started several months ago, there are some updates that arrived in the latest weeks.

Meta is suing Bright Data for its scraping activity on Facebook, while Bright Data insists they were scraping public data from Facebook, so they were not breaking any term of service.

During the discussion of the case in the past weeks, a curiosity came out from the papers: Meta has been a customer of Bright Data for 6 years, requesting data to improve its ad system.

Full article here.

Video of the month

Always about the legal aspect of web scraping, one of the most interesting videos I’ve found this month was made by William Whitman, an attorney who just opened his Youtube Channel and talks also about web scraping.

Tech updates

Oxylabs released its web unblocker

Oxylabs revealed its new web unblocker, an AI-powered tool to bypass anti-bot solutions. It selects, rotates, and evaluates the most suitable proxies for a specific site to provide the highest possible success rate along with the lowest response time. The system also selects the right combination of headers, cookies, browser attributes, JavaScript fingerprints, and proxies to appear as a real user, not triggering CAPTCHAs and bypassing target website blocks.

Here’s an image that recaps the functioning.

Oxylabs unblocker functioning
Oxylabs unblocker

New Chrome headless released

Some days ago, Antoin Vastel of Datadome analyzed what will happen when the versions of Chrome will be rolled out and used more broadly.
Basically, Chrome used in Headless mode will have a fingerprint much more similar to the headful one and this will have impacts on anti-bot techniques.

As Vastel says: “The new headless Chrome browser fingerprint is way more realistic than the first/old version of headless Chrome. Depending on the sophistication of your detection engine, it’s going to make it easier for bot developers to bypass detection, particularly detection based on browser fingerprinting signals. As written in Chromium’s code, The new headless mode // is Chrome browser running without any visible UI. Thus, a lot of subtle differences that used to exist between the old headless Chrome and a genuine headful Chrome don’t exist anymore.”

This is an interesting topic to follow for anyone involved in web scraping, you can find the full article here.

The most read article of the month

The most successful article of the month is the Anti-detect Anti-bot matrix, a post where we compared several techniques against different anti-bot solutions to find out which configuration can be the best match for them.

The Web Scraping Club
THE LAB #11: The Anti-Detect Anti-Bot matrix
This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value. In this case, for all The Web Scraping Club Readers, using the discount code WSC25 you can…
Read more
2 months ago · Pierluigi Vinciguerra

This concluded my selection of posts and articles for February, if I’ve missed something important please let me know in comments or in our Discord server.


The Web Scraping Club is a free weekly newsletter about web scraping. Once every two weeks, I publish The Lab, paid content with deep dives on more technical aspects and solutions to common issues, with also code on a GitHub repository. You can access to the following articles using a 7 days trial and then subscribe if you find them useful.


The Lab - premium content with real-world cases

  • THE LAB #12: Reverse-engineering Mobile API

  • THE LAB #11: The Anti-Detect Anti-Bot matrix

  • THE LAB #10: Bypass Cloudflare Bot Protection with GoLogin

  • THE LAB #9: Scraping OpenSea NFT's data

  • THE LAB #8: Using Bezier curves for human-like mouse movements

  • THE LAB #7: Scraping PerimeterX protected websites

  • THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting

  • THE LAB #5 - Scraping Airbnb.com using GraphQL

  • THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers

  • THE LAB #3: scraping Cloudflare protected websites

  • THE LAB #2: scraping data from a website with Datadome and xsrf tokens

  • THE LAB #1: scraping data from an app

Share this post

Web Scraping news recap - February 2023

substack.thewebscraping.club
Previous
Next
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Pierluigi
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing