Web scraping from 0 to hero: a guideline for creating your scrapers
Pick the right web scraping tool for starting your project
Welcome back to this free web scraping course provided by The Web Scraping Club.
Let’s have a brief recap of the previous topics we’ve seen:
we started with an introduction to web scraping
we have seen the checklist to complete before starting a web scraping project: auditing the target website, from a technical point of view, to see what’s the best approach to have
we’ve also seen how a modern infrastructure for web scraping should look like
There’s only one step remaining between you and coding your scraper, which is the legal check-up of your project. Are you going to collect copyrighted information? Are you going to scrape personal information? If you answered Yes to these questions, probably it’s better to stop and ask for a legal opinion from professionals in this field, or you could be in great trouble.
Since I’m not a lawyer, I would not cover these themes in the course, even if I’m working on a monthly legal corner, where you can understand better where are the limits of web scraping and submit your questions.
So, assuming you’ve done your homework on the legal side, we can finally put your fingers on the keyboards and start creating our scrapers.
How the course works
The course is and will be always free. As always, I’m here to share and not to make you buy something. If you want to say “thank you”, consider subscribing to this substack with a paid plan. It’s not mandatory but appreciated, and you’ll get access to the whole “The LAB” articles archive, with 30+ practical articles on more complex topics and its code repository.
We’ll see free-to-use packages and solutions and if there will be some commercial ones, it’s because they are solutions that I’ve already tested and solve issues I cannot do in other ways.
At first, I imagined this course being a monthly issue but as I was writing down the table of content, I realized it would take years to complete writing it. So probably it will have a bi-weekly frequency, filling the gaps in the publishing plan without taking too much space at the expense of more in-depth articles.
The collection of articles can be found using the tag WSF0TH and there will be a section on the main substack page.
Creating a scraper: where to start?
Which programming language is the best for scraping?
Technically speaking, you can do some scraping with Ruby, Go, C++, shell scripts, and so on, but the reality is that only two languages become the most used for scraping: Python and Javascript, with Node.js.
I cannot say which, between these two, is the best, for the simple fact that I’ve always used Python for most of my 10+ career in web scraping. The fact that I didn’t need to switch to any other language during this period means that I didn’t face any situation where the Python frameworks I commonly used were not enough.
This doesn’t mean that there aren’t situations where Javascript outperforms Python, but simply this didn’t occur to me.
Given this huge disclaimer, I would say that unless you have a case where speed and request concurrence are the main key factors (here probably Node.js has a small advantage), both Python and Javascript are good choices, you can start with the one you’re more familiar with.
Best web scraping tools in Python
Depending on the fact we need to scrape a website by using simple HTTP requests or browser automation tools to bypass anti-bot solutions, here’s the list of tools I’d suggest to start your web scraping learning path.
Scrapy
Scrapy is a successful open-source project created by Shane Evans, Pablo Hoffman, and other contributors who joined over time, in the mid-2000s, as the internet was growing rapidly. At that time, web scraping was often done using ad-hoc scripts and tools that lacked the reliability and flexibility required for large-scale data extraction projects.
Scrapy's development started in 2008, and its first public release was in 2009. From the beginning, Scrapy was open-source, allowing the broader community to contribute to its growth and improvement. Its open nature attracted a diverse group of developers, data scientists, and web scraping enthusiasts who shared their expertise and ideas.
Nowadays, Scrapy is maintained by Zyte, co-founded by Shane (who’s actually the CEO) and Pablo (actually a director), and it still remains one of the most trusted and widely-used web scraping frameworks in the Python ecosystem.
When a website doesn’t have any advanced anti-bot solution, it’s the preferred tool to use.
Its modular architecture made possible the integration with a series of modules that enhanced its basic features:
Scrapy-Splash: Splash is a headless browser designed for web scraping and rendering JavaScript-heavy websites. You can easily integrate Splash with Scrapy to scrape dynamic web pages effectively.
scrapy-selenium: Selenium is a web testing framework that can be used to automate web interactions. When paired with Scrapy, it's useful for scraping websites that require user interaction, such as login forms or search bars.
Scrapyd: Scrapyd is a service for running Scrapy spiders in a distributed manner. It allows you to deploy and schedule your spiders, making it easier to manage large-scale web scraping projects.
Scrapy-Redis: If you need to distribute your Scrapy spiders across multiple machines or nodes, Scrapy-Redis is a package that enables distributed crawling using Redis as a message broker.
Playwright
Microsoft Playwright is a powerful and innovative automation and testing framework that has gained significant attention and popularity in the world of web development and quality assurance. Developed by Microsoft, Playwright is designed to simplify and streamline browser automation tasks across different web browsers, including Chromium (Google Chrome), Firefox, and WebKit (Safari).
What sets Playwright apart from many other browser automation tools is its ability to provide a consistent and unified API for interacting with multiple browsers. This means that developers and testers can write scripts or code that work seamlessly across various browsers, eliminating the need for complex browser-specific code and reducing the maintenance overhead.
Playwright is not limited to automation but, being used to automate tasks on browsers, it’s an ideal tool for web scraping. It can be used both with Python and Node.js and even inside Scrapy projects, but I preferred to give it the proper space since it’s the tool to choose when we’re facing anti-bot solutions.
In those cases, we need to mimic real browsers used by humans and the wide range of Playwright features help us in doing so.
If you’ve been a reader of this newsletter for some time, you’ve noticed that every month I’m using different Python packages to bypass certain challenges, and they’re all important tools to have in your toolbelt. But if you’re starting today doing web scraping, there are two main tools you need to master:
Scrapy for the simpler website to scrape
Playwright for the websites with anti-bot solutions to bypass.
Best web scraping tools in Node.js
Again, this section is not based on my personal experience but on what I’ve read around in these years. If someone is willing to cooperate in writing this part of the guide, I’d be happy to integrate the article.
Axios
Axios is a popular JavaScript library that allows you to make HTTP requests from a Node.js environment, such as a server-side application or a command-line script. It simplifies the process of sending HTTP requests to web servers and handling the responses. Axios is commonly used for fetching data from APIs, making POST requests, and interacting with web services in Node.js applications.
Here are some key features and benefits of Axios in Node.js:
Promise-based: Axios uses promises, which makes it easy to work with asynchronous operations. You can use
async/await
syntax to write clean and readable code for making HTTP requests.Support for Browser and Node.js: Axios is designed to work in both web browsers and Node.js environments. This versatility allows you to share code between client-side and server-side applications.
HTTP Methods: Axios supports all HTTP request methods, including GET, POST, PUT, DELETE, and more. You can easily customize headers, request data, and query parameters.
Interceptors: Axios provides a powerful feature called interceptors, which allows you to intercept and modify requests and responses globally or for specific requests. This is useful for adding authentication headers, error handling, and other custom logic.
Automatic JSON Parsing: Axios automatically parses JSON responses, making it straightforward to work with JSON data from APIs without manual parsing.
Cheerio
Cheerio is a popular and lightweight JavaScript library for parsing and manipulating HTML and XML documents in a Node.js environment. It provides a simple and efficient way to traverse the Document Object Model (DOM) of web pages and extract data from them, similar to how you would use jQuery to manipulate web pages in a browser. Cheerio is particularly useful for web scraping and data extraction tasks in Node.js applications.
Here are some key features and benefits of Cheerio:
jQuery-like Syntax: Cheerio's API is similar to jQuery, which means that if you are familiar with jQuery, you can quickly get started with Cheerio. You can use CSS selectors to target specific elements in the HTML or XML document.
Server-Side Parsing: Cheerio is designed to work in server-side JavaScript environments like Node.js. It does not require a web browser, making it a lightweight and efficient choice for parsing documents in the back end.
Parsing and Traversal: You can load an HTML or XML document into Cheerio and then use its methods to traverse and manipulate the document's structure. This includes selecting elements, changing attributes and content, and more.
Support for Complex Documents: Cheerio can handle complex HTML and XML documents, making it suitable for web scraping tasks on a wide range of websites.
Modular and Extensible: Cheerio is modular and allows you to extend its functionality using plugins. You can also combine it with other Node.js libraries to enhance its capabilities.
Puppeteer
Puppeteer is the original project which Microsoft Playwright took inspiration from.
The purpose is the same since Puppeteer is also a browser automation tool and it also has a longer history and a wider community
Unlike Playwright, it’s available only for Node.Js.
Web scraping is more than scraping tools
Let’s recap: we have in mind a web scraping project, and we are sure it’s completely legal to scrape data from there, we’ve done our homework and seen that there are no anti-bots installed, and maybe the data we need is there in the HTML.
All these data points make us think that with a Scrapy scraper, we’re done. And it could be true, but we don’t have a 100% probability that this will happen. Maybe the scraper will work on our laptop but won’t work when deployed on our target infrastructure. Or it will work but only for the few initial requests.
That’s why choosing the right tool at the beginning of the project is important to avoid time losses, but there’s much more to understand on how complex is web scraping nowadays.
But for today is enough, in the next episode we’ll start with coding our first scraper using Scrapy.