When approaching a web scraping project from scratch as a Python developer in 2024, two libraries you cannot miss are Scrapy and Plawyright.
Scrapy is an open-source and powerful web scraping framework written in Python and maintained by the founding team of Zyte. It's designed to extract data from websites and efficiently handle a wide range of web scraping needs.
Playwright is an open-source end-to-end testing and web scraping framework developed by Microsoft. It is designed to automate browser interactions, allowing you to control and interact with web pages in real time using multiple browsers (Chromium, Firefox, and WebKit). Although originally intended for automated testing, Playwright is widely used in web scraping due to its powerful features and flexibility.
Of course, Selenium is still extremely popular, but I prefer using Playwright for its ease of use and speed when writing a scraper. One of its features I’m curious to try is Selenium Grid, which allows you to automate real devices.
This article examines how to properly set up a Scrapy and a Playwright scraper, including the most common options for making them less detectable.
Scrapy starting setup
When setting up a Scrapy project, you can configure several key options to optimize your web scraper's performance, efficiency, and stealth capabilities.
Usually, a Scrapy spider is enough in cases where there’s no advanced anti-bot to protect the target website but only some controls on the request headers, TLS fingerprinting techniques, or some limits in the number of requests from the same IP.
Here are the options you can use in your scraper’s settings for all these cases.
Default headers and User Agent
If not set in your settings.py file, these options will immediately tell the target website that the requests are from a scraper.
The User-agent will clearly say that it’s a request made with Scrapy, while the default headers would be set like this:
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en",
}
Changing these settings to mimic real people browsing a website is super easy: all you need to know is written in your browser’s network inspection tool in your request section.
To save time, I usually copy the request in Fetch format and paste the part of the output containing the headers into my Scrapy Spider.
fetch("https://github.com/scrapy/scrapy/blob/master/scrapy/settings/default_settings.py", {
"headers": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US,en;q=0.8",
"cache-control": "max-age=0",
"if-none-match": "W/\"d6c50c2084e4022b9e0a1b4b96577d82\"",
"priority": "u=0, i",
"sec-ch-ua": "\"Chromium\";v=\"130\", \"Brave\";v=\"130\", \"Not?A_Brand\";v=\"99\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"macOS\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"sec-gpc": "1",
"upgrade-insecure-requests": "1"
},
"referrerPolicy": "strict-origin-when-cross-origin",
"body": null,
"method": "GET",
"mode": "cors",
"credentials": "include"
});
Changing these options seems so trivial that they will be useless in 2024, but they still make a difference. Some months ago, I noticed that some scrapers I wrote several years ago stopped working. During these years, I’ve changed the selectors to adapt to the new website versions, but I’ve never updated the scraper’s headers, referring to a very old Chrome version. After some trial and error, I discovered that updating the headers was enough to make them run again.
Another trick I’ve seen is rotating the user agent during the scraping, especially when a high frequency of requests is needed.
# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.199 Safari/537.36'
# Alternatively, use a list of user agents:
import random
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.199 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15',
'Mozilla/5.0 (iPhone; CPU iPhone OS 15_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Mobile/15E148 Safari/604.1'
]
USER_AGENT = random.choice(USER_AGENTS)
Cookies handling
By default, Scrapy handles the cookies received from a request and passes them to the next one. This is particularly helpful when your target website gives you a clearance cookie stating that your session has a legitimate status that you can transmit to all the following requests.
If you want to debug the cookie flow between requests, you can use the option COOKIES_DEBUG in your settings.py file.
COOKIES_ENABLED = True
COOKIES_DEBUG = True
This article is included in The Web Scraping Club GPT, containing all the 250+ articles I wrote on this blog. If you want to see a practical experiment on creating your custom GPT, don’t miss the next Oxylabs webinar. I’ll show you how to build a knowledge base, from scraping data to training a custom GPT.
You can try The Web Scraping Club GPT to see my test results at this link.
Request Throttling
Scrapy is designed to run multiple requests in parallel and gives you plenty of options for setting throttling.
You can use the Auto Throttle options, as the following.
# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1 # Initial download delay
AUTOTHROTTLE_MAX_DELAY = 10 # Maximum download delay
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0 # Average number of requests Scrapy should send in parallel
AUTOTHROTTLE_DEBUG = False # Enable to see throttling stats
Using these options, every request from a single thread is delayed by a random time between the minimum and the maximum set.
You can set the maximum degree of parallelism, considering the traffic your target website can handle, without flagging your scraper as a bot.
If you’re willing to set a fixed delay between the different requests on a thread, you can use these options instead.
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 4
DOWNLOAD_DELAY = 2
In this way, you can set the grade of parallelism according to different variables, like IP or domain, and how many seconds need to pass between two requests on a single thread.
Proxy usage
You can use proxies on your Scrapy spider (of course), but there are many ways to use them.
If you need to use a proxy for only some requests inside your scraper, you can set it using the meta arguments.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, meta={'proxy': 'http://username:password@proxy1.com:port'})
def parse(self, response):
self.log(f"Response from {response.url}")
This approach saves you tons of money compared to the usual setup via middleware. In fact, if you need a proxy for one request on your scraper, you can set it for that particular one; while using the middleware approach, every request will use the proxies.
On the other hand, there are several packages like Scrapy-Proxies, Scrapy-rotating-proxies, or advanced-scrapy-proxies for handling them on a middleware level, which is the most common approach.
In part two of this post, we’ll see basic Playwright configurations for being undetectable, which you can use as a starting point for your scrapers.
Like this article? Share it with your friends who might have missed it or leave feedback for me about it. It’s important to understand how to improve this newsletter.
You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.