Web Scraping News: October Monthly Recap
Scraping personal data? Not a good idea
Hi, this is Pierluigi from The Web Scraping Club, a newsletter where you can find news, insights, and tutorials with real-world examples about web scraping.
Being a paying user gives:
Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases (view here as an example).
Access to the GitHub repository with the code seen on ‘The LAB”
Access to private channels on our Discord server
But in case you want to read this newsletter for free, you will always get a post per week about:
News about web scraping
Anti-bot software and techniques insights
Interviews with key people in the industry
And you can always join the Web Scraping Club Discord server.
Today we’ll see what happened during this month in the web scraping industry.
Meta settles a lawsuit against two companies scraping Facebook and Instagram data
original source: Techcrunch
Meta, Facebook parent company, has settled a lawsuit against two companies that were scraping data from Facebook and Instagram users for marketing intelligence purposes.
The original complaint was filed in October 2020 against BrandTotal Ltd, which claims to offer its customers a real-time competitive intelligence platform to monitor their competitors' social media strategy and paid campaigns.
The second company named in the suit is Unimania which offered apps to access social networks in different ways, like seeing Instagram stories anonymously.
In order to evade the website’s protections against scraping, these companies exploited users’ access to the service through a set of browser extensions called “UpVoice” and “Ads Feed” designed to access and collect data. When people installed the extensions and visited the websites the browser extensions used automated programs to scrape their name, user ID, gender, date of birth, relationship status, location information, and other information related to their accounts.
According to the filing that detailed the proposed settlement, both companies agreed to stop scraping or assisting others in data collection practices, delete their software and code, and agreed to a ban on distributing or selling any data they collected through their operations, among other things. It also notes that they agreed to pay monetary damages in a confidential settlement.
This doesn't matter that web scraping is always illegal but means that, once you log in to a website, you are accepting its terms and conditions and Facebook ones don't allow web scraping.
French Government Hits Clearview With The Maximum Fine For GDPR Violations
original source: Techdirt
Another day, another fine for Clearview, the US company that maintains a database of ten billion pictures of people's faces, from all over the world, scraped from several sources.
The company uses this biometric data for selling law enforcement and retail services, in what seems the worst possible use of web scraping on Earth. In fact, the French government just fined the company 20 Million USD for GDPR violations, following the example of the Italian one (21 million fine in March) and the UK (9.4 Million fine).
Clearview’s CEO Hoan Ton-That commented on the fine saying:
"There is no way to determine if a person has French citizenship, purely from a public photo from the internet, and therefore it is impossible to delete data from French residents. Clearview AI only collects publicly available information from the internet, just like any other search engine like Google, Bing or DuckDuckGo."
If this is true, Clearview’s product is illegal everywhere in Europe. Clearview is admitting it cannot determine the origin of the images and data it scrapes, which also means it can’t comply with its own agreements/legal settlements where it has agreed to stop collecting in certain locales and delete data pertaining to these residents.
The issue here — especially in countries subject to the GDPR — is consent. While scraping data from the open web breaks no US laws, collecting data without consent does violate some state laws and clearly violates the GDPR. The only privacy standard Clearview appears to recognize is that whatever can be scraped without a login isn’t private and that it has a right to collect it, compile it, make it searchable, and sell it to government agencies and other customers.
Web Scraping Adoption in E-commerce keeps increasing
A new white paper about web data adoption in e-commerce in UK and US, provided by Oxylabs, is just available.
Findings indicate that web scraping has firmly entrenched itself within the e-commerce industry – more than three-quarters (75.7%) of companies employ it in their daily operations. Additionally, most have already seen impressive returns from the data collection method, as 32.4% state that web scraping has generated the most revenue.
You can download at this page the full white paper.
For today is enough, please be respectful of the target websites and the privacy laws when scraping, as we have seen bad things could happen.
Is any of our you working on something spectacular in web scraping and want to share with us? please write to firstname.lastname@example.org and you could be in the next interview!
The Lab - premium content with real world cases
THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers
THE LAB #2: scraping data from a website with Datadome and xsrf tokens
The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.