Web Scraping news recap - February 2023
Legal updates and new tools available in February for the web scraping industry
This post is sponsored by Smartproxy, the premium proxy and web scraping infrastructure focused on the best price, ease of use, and performance.
In this case, for all The Web Scraping Club Readers, using the discount code WEBSCRAPINGCLUB10 you can save 10% OFF for every purchase.
Hi everyone and welcome back to The Web Scraping Club, this post is our monthly review of what happened in the web scraping industry in February.
There’s no monthly recap without news about legal actions involving web scraping and privacy violated.
Meta vs Voyager Labs
The first news of the month is Meta suing the Israeli firm Voyager Labs for creating fake accounts on Facebook used to collect a large amount of personal data. This was then aggregated with other data points and sold to law “agencies tasked with public safety.” Meta states that this behavior breaks the Terms of Services of Facebook and, in my opinion, there’s almost no doubt about it.
The full article can be found here on Arstechnica.
Meta vs Bright Data
In the case involving always Meta and Bright Data, which started several months ago, there are some updates that arrived in the latest weeks.
Meta is suing Bright Data for its scraping activity on Facebook, while Bright Data insists they were scraping public data from Facebook, so they were not breaking any term of service.
During the discussion of the case in the past weeks, a curiosity came out from the papers: Meta has been a customer of Bright Data for 6 years, requesting data to improve its ad system.
Video of the month
Always about the legal aspect of web scraping, one of the most interesting videos I’ve found this month was made by William Whitman, an attorney who just opened his Youtube Channel and talks also about web scraping.
Oxylabs released its web unblocker
Here’s an image that recaps the functioning.
New Chrome headless released
Some days ago, Antoin Vastel of Datadome analyzed what will happen when the versions of Chrome will be rolled out and used more broadly.
Basically, Chrome used in Headless mode will have a fingerprint much more similar to the headful one and this will have impacts on anti-bot techniques.
As Vastel says: “The new headless Chrome browser fingerprint is way more realistic than the first/old version of headless Chrome. Depending on the sophistication of your detection engine, it’s going to make it easier for bot developers to bypass detection, particularly detection based on browser fingerprinting signals. As written in Chromium’s code, The new headless mode // is Chrome browser running without any visible UI
. Thus, a lot of subtle differences that used to exist between the old headless Chrome and a genuine headful Chrome don’t exist anymore.”
This is an interesting topic to follow for anyone involved in web scraping, you can find the full article here.
The most read article of the month
The most successful article of the month is the Anti-detect Anti-bot matrix, a post where we compared several techniques against different anti-bot solutions to find out which configuration can be the best match for them.
This concluded my selection of posts and articles for February, if I’ve missed something important please let me know in comments or in our Discord server.
The Web Scraping Club is a free weekly newsletter about web scraping. Once every two weeks, I publish The Lab, paid content with deep dives on more technical aspects and solutions to common issues, with also code on a GitHub repository. You can access to the following articles using a 7 days trial and then subscribe if you find them useful.
The Lab - premium content with real-world cases
THE LAB #8: Using Bezier curves for human-like mouse movements
THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting
THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers
THE LAB #2: scraping data from a website with Datadome and xsrf tokens