Can I scrape any public data?

The web is the greatest source of information available, like an enormous library, but this doesn’t mean we can scrape whatever we can read.

black wooden d and c bookshelf — Photo by Giammarco Boscaro on Unsplash

What are the rules for web scraping then?

As we start our web scraping projects we need to respect the rules we’ve already seen in our other post called “Is web scraping legal?”, but they are not the only ones.

In fact, even if our scraper acted properly, there are still some legal potential issues we need to be aware of. Even if data is publicly accessible, it doesn’t mean it can be scraped.

The first rule is to treat personal data carefully, following privacy laws, especially when scraping social networks. Every country has its own privacy regulation and, depending on the nationality of the person whose data is scraped, we must follow the proper legislation.

But also other types of data can hide some issues.

As a second rule, in fact, we must be sure to not scrape data protected by copyright. Depending on the context, images or text cannot be scraped: as an example, we can’t scrape images from Flickr and for sure we cannot resell them. We don’t own the rights to publish those pictures, but the photographers do.

The third rule is that we cannot scrape data from a website to get an advantage against it. As an example, we cannot scrape a listing website in order to fill a competing website for the same listing.

The final rules for ethical web scraping

Let’s summarize the rules we have seen for ethical web scraping.

Don’t harm the target website, by flooding it with requests
Don’t break its ToS explicitly accepted (so no scraping websites when a login is needed)
Don’t scrape private data
Don’t scrape copyrighted data

Here’s a video that explains what is ethical web scraping with some more details

This post is written by Pierluigi Vinciguerra (pier@thewebscraping.club)

The Web Scraping Club

Can I scrape any public data?

What are the rules for web scraping then?

The final rules for ethical web scraping