Web scraping from 0 to hero: before start scraping
Tools, best practices and checklist to apply before start your web scraping project.
Let’s continue our course “Web scraping from 0 to hero”, also known as WSF0TH.
In the latest lesson, we’ve seen why web scraping is becoming more and more important in this age of data and when it is legally feasible.
In this episode, we’ll see the best practices to use when approaching a website, before starting coding our scrapers, given that we are allowed to scrape it.
But why understanding best practices and using the right tools is important and we should spend some time in some preliminary checks before writing our scraper, instead of doing it directly?
Web scraping has become a complex subject and the proof for this claim is this blog. There wouldn’t be almost 2k followers of these pages if web scraping was plain vanilla as several years ago, with no anti-bot, hundreds of tools, and so on.
Things changed, as the digitalization of many industries happened and there are more and more niches to scrape, and even more websites that don’t like being scraped and use anti-bots. This led to a flourishing market of anti-anti-bot solutions, both free and commercials, tech stack detection tools, and best practices to not get lost in this variety of options.
Before going on, let me recap how this course works.
How the course works
The course is and will be always free. As always, I’m here to share and not to make you buy something. If you want to say “thank you”, consider subscribing to this substack with a paid plan. It’s not mandatory but appreciated, and you’ll get access to the whole “The LAB” articles archive, with 30+ practical articles on more complex topics and its code repository.
We’ll see free-to-use packages and solutions and if there will be some commercial ones, it’s because they are solutions that I’ve already tested and solve issues I cannot do in other ways.
At first, I imagined this course being a monthly issue but as I was writing down the table of content, I realized it would take years to complete writing it. So probably it will have a bi-weekly frequency, filling the gaps in the publishing plan without taking too much space at the expense of more in-depth articles.
The collection of articles can be found using the tag WSF0TH and there will be a section on the main substack page.
Define an output data model
It seems an overkill to spend some time in this phase but if you won’t do it, you’ll find yourself later with some rework to do for your scraper, since you’ve seen that the field you want to add is on another page or requires a new approach to the website.
Writing down your desired output and locating where the information is on the website doesn’t require too much time and helps avoid these reworks since it’s functional to the next step.
Locate the desired data
If you’re reading this substack from the beginning, you’re already aware of these steps, but they’re the fundamentals for creating a durable web scraper.
Is there an API with the needed data?
This is the first question we should ask ourselves when starting a new web scraping project.
APIs are everywhere in our daily digital life and, especially when a website needs to handle dynamic content, like showing products on the e-commerce or dots on a map, an API is called to fetch this data from the backend.
If so, opening the network tab on the developers’ tools of the browser should help us see them.
By inspecting the response of the API we can understand if it contains all the data needed to fulfill all the fields of our desired data model.
What are the pros of using the APIs rather than scraping the HTML directly?
They are more resilient over time since they are less prone to changes, allowing us to create a more robust scraper.
Data is already structured and cleaned, so we don’t need any further transformation to get usable data.
They’re often used not only for the website but also internally, and this means we can get some data that is not explicitly shown on the website. One example is the inventory level of products on an e-commerce website.
Is there an API used by an APP?
In case we didn’t see any API used by the target website, we could try to check if an app exists. In fact, apps communicate with the server mainly via APIs and this could be a good walk-around to have the same benefits described before. To detect the calls made by an app, you need to use some programs like Fiddler Everywhere that act like a “man-in-the-middle” between your device and the server, intercepting the traffic between them. We already have seen this approach in this article, where we used Fiddler, and in this other one where we used Charles Proxy written by Fabien Vauchelles.
In both cases, if APIs are not protected by tokens generated inside the app itself, they can be used to scrape the target website by mimicking a call by a mobile device. In most cases, it’s enough to change the headers in the call and, eventually, add a mobile proxy.
Anyway, I’ll write an article about this in the following posts of this course, so we can dig deeper into details.
JSON in the HTML
Some of the modern frameworks used for creating websites like React or Next.js, when they need to handle dynamic parts of websites and load data coming from the backend, generate HTML pages including data in JSON format.
For example, in React-generated websites, depending on the implementation, you can see a JSON starting with “window.__PRELOADED_STATE__ =” .
Just like for the APIs, this approach allows us to create more resilient scrapers, since changes in the resulting JSON are rarer than in pure HTML and we have data that is already structured and cleaned.
If we don’t have APIs available and or JSON inside the HTML, then we need to create selectors that read the website’s pages.
Understanding the website tech architecture
Is there any anti-bot?
This is the most important question because the costs and the success of our web scraping project depend on the answer. Luckily, answering this question is pretty easy: for the sake of simplicity, I use the free Wappalyzer Browser extension.
When you browse a website, by opening the extension, you have already a clear picture if any antibot is used on the website or not.
Here’s an example of a website with Cloudflare installed
and another one with Kasada.
If you don’t see in this section any well-known anti-bot (Cloudflare, PerimeterX, Datadome, Kasada, Shape/F5), then 99% of the time a Scrapy spider would be enough, otherwise we need to consider the proper tool, which is what we do on The Web Scraping Club and it will be a subject for the future lessons.
Is the robots.txt configured correctly?
It’s always worth having a look at the robots.txt file of a website since it could reveal some nice surprises, especially when not correctly configured.
Robots.txt is a file, located at the root level of the website (for example www.website.com/robots.txt), which says how the bots that want to read the website should behave. Depending on the user agent, they are limited or allowed to some sections of the website.
Sometimes it happens that some tools crawling the website have a sort of passe-partout on it and, by mimicking their user agent, we could bypass the anti-bot protection.
To be honest, it didn’t happen too many times, but having a look at the file is something fast and could reveal a quick win.
Is the sitemap helpful for us?
One last piece of advice for today’s course chapter is to think about the sitemap of a website. It basically contains all the pages that the website wants to be indexed by search engines and, in some cases, it could be a help also for our scrapers.
Let’s think about e-commerce web scraping: if we need to scrape all the items from e-commerce and we need to get the data from the product detail page, the sitemap which contains a list of all these pages simplifies the job of crawling the whole website.
The second chapter of the course “Web Scraping From 0 to Hero” ends here, in the next one, approximately coming in two weeks, we’ll see an example of web scraping architecture and the different layers of tools that cannot be missed in it.