Web scraping from 0 to hero: Introduction to web scraping

What is web scraping, why is relevant to day and... is it legal?

Oct 22, 2023

Does the world need another web scraping course?

I’ve been in the industry for so long that I would say No, but the reality proves to me every day that I’m wrong.

macbook pro on brown wooden table — Photo by Samantha Borges on Unsplash

Inside our Discord Server, on the webscraping subreddit, I read everyday questions that would need long and detailed answers, because these doubts are symptoms of the lack of basis on the subject. And this is completely normal, for the same reason I decided to open The Web Scraping Club a year ago.

While the academic world is flooded with courses about Machine Learning, AI, and so on, it’s hard to find a course fully dedicated to web scraping. On the other hand, there’s plenty of marketing material online about web scraping, some of excellent quality, but, in the end, the suggested solution is to buy a commercial solution. Hidden gems are inside “hackers’ forums”, out of the sight of indiscreet eyes, because of the fear that once the information is in the public domain, it’s not worth anything because the anti-bot software will be updated.

Last but not least, web scraping is such a large topic: usually people focus only on a certain solution, tool, or issue, missing the great picture from above. Anyone can make a few bucks on Upwork by doing web scraping, but being a web scraping professional means understanding much more than solving a technical challenge: am I doing something illegal? Is there a wiser way to extract data in order to create a more reliable scraper? These are just a few of the questions we need to be able to answer to be considered as an expert in web scraping.

So, at the end of this long incipit, I would say Yes, there’s a need for another web scraping course.

How the course works

The course is and will be always free. As always, I’m here to share and not to make you buy something. If you want to say “thank you”, consider subscribing to this substack with a paid plan. It’s not mandatory but appreciated, and you’ll get access to the whole “The LAB” articles archive, with almost 30 practical articles on more complex topics and its code repository.
We’ll see free-to-use packages and solutions and if there will be some commercial ones, it’s because they are solutions that I’ve already tested and solve issues I cannot do in other ways.
At first, I imagined this course being a monthly issue but as I was writing down the table of content, I realized it would take years to complete writing it. So probably it will have a bi-weekly frequency, filling the gaps in the publishing plan without taking too much space at the expense of more in-depth articles.
The collection of articles can be found using the tag WSF0TH and there will be a section on the main substack page.

Before starting with the course, a reminder for the events coming in the next weeks.

Web Data Extraction Summit - Dublin - 25-26th of October - Hosted by Zyte

I’ll be there in person with

Andrea Squatrito

, which will host a session about Web Data Marketplaces like

Data Boutique

. Feel free to drop a Hi if you’ll be there, I’d be happy to meet in person anyone from this amazing community we’ve built. If you still didn’t get the tickets, go to the official Web Data Extraction summit page.

Scrapecon 2023 - Virtual - 7th of November - Hosted by Bright Data

Besides the great panels you can see on the official page of the event, I’ll host a web scraping contest. The prize? 2000$ in Bright Data credits and a year of paid content on The Web Scraping Club.

You can find the rules for joining the contest on this page, while if you want to attend the event you can request your invitation on its official page.

Enough with the introduction, let’s start with the course.

What is web scraping and why is it becoming more and more relevant?

Web scraping is the process of using programs to extract data from a website in an automated way.

It’s confused sometimes with screen scraping or web crawling, but they are slightly different practices.

Screen scraping only copies pixels displayed onscreen, without considering the underlying HTML code.

Web Crawling it’s the act of browsing the web, collecting only the connections between web pages without entering in their content.

Before proceeding with the technicalities of web scraping, let’s understand first why it’s a trending topic today.

Google search trend for expression “web scraping” over the past 5 years

As shown in the previous image, the volume of searches on Google for the term “web scraping” is slowing but growing over time, meaning there’s more and more interest in this subject.

COVID-19 boosted the process of digitalization of the economy: according to the Markinblog research, from 2020 to 2021, the number of worldwide e-commerce websites doubled and kept growing in later years.

This phenomenon made the economy more observable via web scraping and new players entered the “web arena”, which exposed them to new challenges and needs.

In fact, once the prices of the goods you’re selling are publicly available on the web, you’d need to be sure that they’re competitive, and so the need for services of market intelligence and pricing comparison rises, requiring web scraping professionals to implement them.

Today we’re still in a golden age for web scraping, because of the huge popularity of another technology: artificial intelligence.

Every algorithm needs data to be trained on and, in most cases, the source of it is the web. They could be news, technical articles, e-commerce product data, locations, or reviews, but in the end, they need some web scraping to be collected.

So it’s a good time to become a web scraping professional since chances of making a living out of it are increasing, considering also the fact that there are more and more ways to get an income from it.

Platforms like Upwork, where today there are 2000+ job offers for the keywords “web scraping”, Fiverr, and Freelancer.com are offering a way to prove web scraping skills and earn from these gigs. Data marketplaces like databoutique.com (disclaimer: I’m one of the founders) are just born to deliver high-quality web-scraped data to a wider audience at a fraction of the actual cost, being at the same time convenient for the sellers.

During a gold rush, sell pickaxes

Is web scraping legal?

To answer this question, a whole new chapter of this course would be needed but since I don’t like to reinvent the wheel, I’m sharing one of the best posts you could find online about the web scraping legal context.

To summarize, we might have legal issues on two different aspects: the type of data collected and how we did it.

Types of data

Web scraping itself is a tool, but we decide how to use it, just like a hammer: we can decide to drive a nail or break a car's glass.

On the web, there are some types of data we don’t have the right to collect, and some we’re allowed to.

Personal data, for example, is a big NO. So every time someone collects email for lead generation from various sources, that’s something that cannot be done, because it violates privacy rules, depending on the state where the owner of the email leaves.

Copyrighted data, again is something we cannot collect out of the wild unless we have some agreements with the source. This includes news, books, and everything that is the result of human creativity.

“sui generis” databases, which are databases created with great efforts to be built cannot be scraped, but the rules protecting them only apply in some states.

All the other types of factual data (prices, locations, reviews, and so on) that don’t fall into the previous categories, we’re allowed to scrape, if accessible in the right way.

Data accessibility

The other great aspect to consider before starting a web scraping project is data accessibility. According to its degree, you could incur, or not, in legal issues.

Is considered Non-public data everything that was intended to be used for internal purposes and requires a login to be accessed, like internal company portals, bank accounts, and so on. Of course, this type of data cannot be scraped.

On the other hand, public data is when everyone with an internet connection is capable of seeing it. In this case, if you need explicitly to accept a Terms of Services that prohibits web scraping (usually when you need to create an account to services like Linkedin or Facebook), then you cannot scrape that data source. Otherwise, even if on the footer of the website there’s a Terms of Services that forbid scraping, it has no legal value and the data source could be scraped, since you didn’t do anything to accept it.

Websites, to facilitate the search engine’s job, usually create special files like robots.txt and sitemaps to describe their structure and explicitly tell where bots are welcome and where they are not allowed to crawl, but we’ll see these aspects in a later chapter of the course.

With great power comes great responsibility

The first chapter of the course ends here.

In the next one, we’ll see why robots.txt and sitemaps could help our web scraping practices, what high-level activities we should do before starting scraping, and the most common technologies used today.

If you think you know someone who could benefit from this course, please share with him this article.

Viet

Nov 16, 2023

I’m stoked …I’d love a second course on Build with Pier series . Community votes on category and site and we can build it together following along via twitch or kick even on Discord (use a gated member function) or keep it free as you always planned

Expand full comment

1 reply by Pierluigi Vinciguerra

1 more comment...

The Web Scraping Club