What do I need for web scraping?
In this article from The Web Scraping 101 Wiki, after seeing what is web scraping and its legal implications, we’ll see what it’s needed to start our first scraper.
Tools for website analysis
Before starting a web scraping project, the first action to take is to analyze the target website to understand what’s the best approach for it.
The first action is to find out what’s the technology stack of the website: is there an anti-bot? and if so, how can be approached?
Wappalyzer
Wappalyzer is a tool for understanding the technology stack of a website, via its browser extension or API. It can detect most of the anti-bot solutions active on a website and it’s a good starting point for the target analysis.
Lambdatest
Once understood if there’s any anti-bot solution if you find out there’s one of the toughest like Cloudflare, you can check in Lambdatest which is the combination of operating systems and browsers that works for this website.
In fact, depending on the configuration, certain OS could be marked as suspicious from the website. As an example, in some Cloudflare configurations, Windows Server or Ubuntu Server would trigger captchas.
Python tools and framework for web scraping
Scrapy
Scrapy is the most popular python framework for web scraping, an open source maintained by Zyte. Useful where there are no particular anti-bot countermeasures. See more in our article “What is Scrapy?”
Splash
Scrapy Splash is a headless browser with an API that you can use to scrape dynamic web pages with Javascript, useful where there are no particular anti-bot countermeasures and javascript rendering is needed.
See more in our article “What is Splash?”
Selenium
Selenium webdriver is a framework for testing web applications but useful also for web scraping. Since it supports Firefox, Chrome, and Safari it could be a good solution when you need to tackle anti-bot, emulating a human using a browser.
See more in our article “What is Selenium?”
Undetected-Chromedriver
It’s a version of the Chrome Webdriver but specifically intended and modified to make it undetectable from a real Chrome browser instance. Paired with Selenium, can be more effective for scraping than the default Chrome webdriver.
See more in our article “What is Undetected-Chromedriver?”
Playwright
Playwright is another framework for testing web applications but, instead of using webdrivers, it opens real browsers like Edge, Firefox, and Chrome. The scrapers can be written in python but also in Node.Js, Java, and .Net.
See more in our article “What is Playwright?”
Running environment
After deciding on the tools to use for web scraping, another important choice to make is deciding where to run the scrapers. Depending on the anti-bot mechanism of the target site, if it has some IP range limitation or request per IP limit, we could need to select a certain IP or proxy provider.
Proxies
A proxy server is a machine that stands between the client and the server. In this way the server will see a request coming from the IP of the proxy machine and not from the original client. They can be classified by the IP origin, as example if it comes from a datacenter or from a residential area, or by the level of anonymity provided to the client.
See more in our article “What is a proxy?”
Conclusions
In this article, we have seen a brief introduction to what is needed for your web scraping projects. For more details on any specific topic, you can read the corresponding detail page. Here’s a short video about how to start web scraping in Python.
This post is written by Pierluigi Vinciguerra (pier@thewebscraping.club)