Legal Zyte-geist #2: Web Scraping and AI 2023 Legal Wrap-Up

Looking back on the legalities of web scraping and AI in 2023

Jan 09, 2024

Welcome to the monthly column about web scraping and legal themes by Sanaea Daruwalla. She is the Chief Legal & People Officer at Zyte. Sanaea has over 15 years of experience representing a wide variety of clients and is one of the leading experts on web data extraction laws.

Disclaimer: This post is for informational purposes only. The content is not legal advice and does not create an attorney-client relationship.

Last year we saw a big increase in lawsuits and regulations relating to web scraping, particularly in the area of Generative AI. While a lot of the case law and regulation is still pending, we did start to see some guidance from the US and EU courts and regulators. Below is an overview of the landscape in 2023 and what is to come in 2024.

book lot on black wooden shelf — Photo by Giammarco Boscaro on Unsplash

US Case Law

While the US did not make much movement on the regulatory side (other than the AI Executive Order), we saw a slew of lawsuits targeted at Generative AI companies. Additionally, there were several notable web scraping cases filed last year that we will be watching closely in 2024.

Web Scraping Cases

On the exclusively scraping front, the cases to watch in 2023 were Meta v. Brightdata, Ryanair v. Bookings Holdings, X Corp v. Brightdata, X Corp v Center for Countering Digital Hate, X Corp v. John Does, Air Canada v. LocalHost, and Jobiak v. Botmakers and Aspen Tech. All of these cases, aside from the Ryanair case, are relatively new and have not yielded any notable results. However, we could see a lot of action in these cases in 2024, so it will be interesting to watch. The most relevant causes of action at play in these suits are copyright infringement, breach of contract, and CFAA violations.

Generative AI Cases

On the GenAI side of the house, the lawsuits are frankly too many to list. But we’ve seen class actions across the US against the likes of OpenAI, Microsoft, Meta, StabilityAI, and Google, with the New York Times rounding out the year with a very high profile case against OpenAI. These lawsuits focus on the legality of the inputs and outputs of the defendants’ GenAI applications. For web scrapers, the outputs aren’t as concerning, but if you’re scraping data to train an LLM, it’s all about the legality of those inputs. Side note, for a bit of detail on the rulings regarding copyright and the outputs of a GenAI system, check out these two posts on the StabilityAI and Meta cases.

The most pressing issues for web scrapers are those of copyright infringement and data protection. If you’re scraping copyrighted data and personal data to train an LLM, what are the boundaries? No court or regulator is yet to fully answer this question.

Copyright

For the copyright infringement claims, we’ve seen some of the big tech companies make the argument that training a model with copyrighted data cannot be considered copyright infringement because the machine is simply learning from the data. Learning from copyrighted data by a human does not constitute a copyright violation, so how can it be a copyright violation by a machine? While a persuasive argument, we are yet to see whether it will be successful in court. Stay tuned, as it’s highly likely we’ll get an answer to this question this year.

Personal Data

The data protection causes of action will prove to be more complex, with various data protection laws across the US and other very strict data protection laws across the globe, it’s hard to see how this will play out. If you’re using training data that contains personal information, what is required? In the US, a lot of states make an exception for public personal data, so does this mean it’s ok to train your model with any and all public personal data? In the EU, GDPR requires a lawful basis, like consent or legitimate interest, will GenAI companies be able to establish a lawful basis and remain compliant with all the other GDPR requirements like notification and data subject access rights?

Do you want to suggest a topic for the next month's edition? Submit your question in The Web Scraping Club Discord Server, on the Legal Zyte-geist dedicated channel.

Join the TWSC discord server

If you want to be sure to don’t miss the new episodes, please consider subscribing for free to the newsletter.

EU Regulation

GDPR

Currently, there are various EU countries looking into OpenAI’s use of personal data, including Poland, Italy, Spain, France, and Germany. Each country’s data protection authority is looking into their use of personal data to train their model and the results will be insightful for all companies that are collecting data to train AI models which may include personal data.

We know that companies will need a lawful basis to process the data, but what lawful basis the data protection authorities will deem persuasive and the notification and access levels are to be determined. Additionally, whether models that have already been developed will need to have personal data removed or are required to start from scratch due to GDPR violations is also yet to be determined. The hope is that the EU takes an approach that protects personal data but that also doesn’t chill the use and development of AI in the EU.

EU AI Act

At the end of 2023, the EU reached a provisional agreement to pass the EU AI Act, which will likely take effect later this year. The AI Act takes a risk-based approach, classifying AI systems into four different risk categories depending on their use cases: (1) unacceptable risk, (2) high risk, (3) limited risk, (4) minimal/no risk. No-risk AI will have no requirements under the Act, limited-risk AI (such as chatbots) will have transparency requirements, higher-risk AI will have to follow more rigid requirements, and unacceptable risk AI will be banned.

High risk systems come with the most stringent requirements and AI that falls into this category includes (1) systems where there is a safety component or a product subject to existing safety standards and assessments, such as toys or medical devices; or, (2) the system is used for a specific sensitive purpose. In these cases, companies will need to implement risk and quality management systems, data governance, security, robust documentation, and human oversight.

As the law progresses and becomes finalized this year, we will get more details on how and when it will be implemented.

Wrap Up

In summary, 2023 was a wild ride for scraping and AI, and 2024 will give us some more insight as the case law and regulatory landscape mature. Zyte will continue to monitor it all and will provide any relevant updates throughout 2024.

For more helpful resources, check out my post on the legality of web scraping and Zyte’s more detailed compliant web scraping checklist.