Optimizing Python Scripts for High-Traffic Websites
Tools and libraries for improving your scraper's requests per second.
Scraping high-traffic websites often presents several challenges. This happens because owners want their websites to always be available to users. For this reason, they make web scraping difficult to perform.
In this article, you will learn the challenges associated with web scraping high-traffic websites and how to optimize your Python scripts in this case. Here is what this article covers:
Challenges of scraping high-traffic websites
Strategies for optimizing Python scripts when scraping high-traffic websites
Why Scrapy can be the optimal solution
Let’s dive into it!
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
Challenges of Scraping High-Traffic Websites
Scraping data from high-traffic websites means extracting information from websites that receive many visitors and requests in a short amount of time. Common examples are major e-commerce sites, news portals, social media platforms, and search engines. These sites often use sophisticated measures to prevent or hinder scraping, making the process technically challenging. For this reason, scraping these sites requires advanced techniques, robust infrastructure, and careful handling to avoid detection and blocking.
Common challenges you can face while scraping high-traffic websites are:
Overloaded servers: High-traffic sites are built to handle legitimate user load, so aggressive scraping—sending many requests from one source—can trigger anti-bot systems that make servers unresponsive. More commonly, the unresponsiveness encountered by a scraper is not due to genuine server overload. Rather, it is a site's defense mechanisms that detect scraping activity and deliberately slow down, or temporarily block, the scraper's IP address.
Rate limits: This is a primary technical defense. Websites limit the number of requests allowed from a single IP address within a given time frame to prevent abuse and overloading the servers. Exceeding these limits is a common reason scrapers get blocked.
Sophisticated bot detection systems: To ensure the availability of their websites, owners invest in bot detection systems like CAPTCHAs, IP address analysis, browser fingerprinting, behavioural analysis, and more. Those anti-detection systems make web scraping difficult, aiming to ensure maximum availability for legitimate users.
Honeypot Traps: They were initially built as a defence from cyber attacks. However, they can not distinguish between cyber attacks and web scraping requests because some types can identify scrapers without they falling deep into the traps. In the case of web scraping, honeypots are placed in areas that are only visible to bots. Accessing these traps immediately flags the scraper's IP for blocking.
Geographical restrictions: The content available on a website can differ based on the visitor's location. Scraping specific regional data requires using proxies located in those geographic areas.
Thanks to the gold partners of the month: Smartproxy, IPRoyal, Oxylabs, Massive, Rayobyte, Scrapeless, SOAX, ScraperAPI, and Syphoon. They’re offering great deals to the community. Have a look yourself.
Python Scrapers Performance Optimization Strategies
After learning common challenges when scraping high-traffic websites, this section walks you through practical strategies to optimize your Python scrapers’ performance.
Requirements
To replicate the scripts in this section, you must have Python 3.10.1 or higher installed on your machine.
Manage Concurrency With asyncio
And aiohttp
Web scraping is often I/O-bound. This means you have to wait for network responses before completing a task. Concurrency allows your script to do other work while waiting, speeding up the process. To do so, you can use the following libraries:
asyncio
: Often the most effective approach for I/O-bound tasks.aiohttp: Used for making asynchronous HTTP requests.
Before proceeding with the script, install aiohttp
:
pip install aiohttp
A Python script that manages asynchronous requests can be as follows:
import asyncio
import aiohttp
import time
async def fetch(session, url):
"""Asynchronously fetches a single URL."""
print(f"Starting fetch for: {url}")
try:
# Add a small delay
await asyncio.sleep(0.5)
async with session.get(url, timeout=10) as response:
response.raise_for_status() # Raise an exception for bad status codes
content = await response.text()
print(f"Finished fetch for: {url}, Length: {len(content)}")
# Process content here
return url, len(content)
except aiohttp.ClientError as e:
print(f"Client error fetching {url}: {e}")
return url, None
except asyncio.TimeoutError:
print(f"Timeout fetching {url}")
return url, None
except Exception as e:
print(f"Other error fetching {url}: {e}")
return url, None
async def main(urls):
"""Main function to coordinate concurrent fetches."""
# Create a single session to reuse connections
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True) # Gather results
print("\\n--- Fetching Complete ---")
for result in results:
if isinstance(result, Exception):
print(f"A task failed: {result}")
elif result and result[1] is not None:
print(f"Successfully fetched {result[0]} - Size: {result[1]}")
else:
print(f"Failed or no content for URL: {result[0] if result else 'Unknown'}")
if __name__ == "__main__":
target_urls = [
"<https://httpbin.org/delay/1>", # Simulates slow response
"<https://httpbin.org/delay/2>",
"<https://httpbin.org/html>",
"<https://httpbin.org/status/404>", # Example of a bad status
"<https://httpbin.org/delay/0.5>", # Simulates fast response
]
start_time = time.time()
asyncio.run(main(target_urls))
end_time = time.time()
print(f"\\nTotal time using asyncio: {end_time - start_time:.2f} seconds")
Here is an explanation of the code:
async def
declares an asynchronous function—called a coroutine. It can be paused and resumed usingawait
.The purpose of the
fetch()
function is to fetch the content of a single URL. This is ensured by the methodsession.get()
that uses theaiohttp.ClientSession()
object (defined under themain()
function) to make an asynchronous GET request to the URL.The purpose of the
main()
function is to manage the concurrent fetching of all URLs provided in theurls
list. In particular:tasks = [fetch(session, url) for url in urls]
: This creates a list of coroutines. It doesn't run thefetch
functions yet. It just prepares them, creating afetch
task for each URL in the input list.asyncio.gather()
is the key function for running multiple coroutines concurrently.
target_urls
defines the list of URLs to be fetched. Note that it includes URLs designed to take different amounts of time (/delay/...
) and one that will return a 404 error (/status/404
), to simulate different cases.
So, instead of waiting for each URL to download completely before starting the next, asyncio.gather()
allows the program to initiate all requests and wait for them to complete.
Suppose your code is stored in a file named main.py
. Run it via python main.py
and you will obtain:
The result shows that the URLs are fetched contemporaneously, allowing for asynchronous concurrency.
Maintain Persistent Connections
Another way to increase your Python scrapers’ efficiency is to reuse the underlying TCP connection for multiple requests to the same host. This avoids the overhead of establishing new connections at each request.
This technique is generally used for scraping many pages from the same site and can be done with the method requests.Session()
from the library Requests.
Before writing the code, install Requests:
pip install requests
The following is a script that manages persistent connections with Requests:
import requests
import time
def fetch_urls_with_session(urls):
"""Fetches URLs using a persistent session."""
# Create a session object
with requests.Session() as session:
# Set default headers for the session
session.headers.update({"User-Agent": "MyOptimizedScraper/1.0"})
results = {}
for url in urls:
try:
# Add a small delay
time.sleep(0.5)
response = session.get(url, timeout=10)
response.raise_for_status() # Check for HTTP errors
print(f"Fetched {url} - Status: {response.status_code}")
results[url] = len(response.content)
# Process content
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
results[url] = None
return results
if __name__ == "__main__":
target_urls = [
"<https://httpbin.org/html>",
"<https://httpbin.org/get>", # Same domain, connection can be reused
"<https://httpbin.org/headers>", # Same domain
"<https://httpbin.org/status/404>", # Same domain, but will raise error
]
start_time = time.time()
fetch_results = fetch_urls_with_session(target_urls)
end_time = time.time()
print("\\n--- Fetching Complete ---")
print(fetch_results)
print(f"\\nTotal time using Session: {end_time - start_time:.2f} seconds")
This snippet does the following:
The
Session()
method creates a session object that persists parameters across requests. So, if you make multiple requests to the same host (as you do with thetarget_urls
list ), the underlying TCP connection is reused, saving the time needed for the TCP handshake for subsequent requests.The
headers.update()
method sets a defaultUser-Agent
header for all requests made using thesession
object. This identifies your script to the web server.The
session.get()
method makes an HTTP GET request to the currenturl
using thesession
object.
So, when making multiple requests to the same host (httpbin.org
in this case) within the with
block, the underlying TCP connection is reused, saving time on SSL handshakes and connection setup for subsequent requests.
When running the script with python main.py
, you will obtain:
Retries management
Another way to improve Python scripts, particularly for high-traffic websites, is by implementing retries with backoff. You can implement automatic retries for transient network errors or specific HTTP status codes (for example, for 5xx
errors). Use exponential backoff (increasing delays between retries) to avoid overloading the server.
A library you can use for such cases is tenacity
:
pip install tenacity
An automated retry of failed requests can be implemented as follows:
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import time
# Configure the retry strategy
@retry(
stop=stop_after_attempt(3), # Stop after 3 attempts (1 initial + 2 retries)
wait=wait_exponential(multiplier=1, min=2, max=10), # Wait up to 10s between retries
retry=retry_if_exception_type((requests.exceptions.Timeout, requests.exceptions.ConnectionError, requests.exceptions.HTTPError)), # Retry only on specific errors
reraise=True # Reraise the exception if all retries fail
)
def fetch_with_retry(url, session):
"""Fetches a URL using a session, with retry logic."""
print(f"Attempting to fetch: {url} at {time.strftime('%X')}")
# Add a small delay
time.sleep(0.5)
response = session.get(url, timeout=5) # Shorter timeout for testing retries
# Raise HTTPError for bad responses
response.raise_for_status()
print(f"Successfully fetched {url}")
return response.text
if __name__ == "__main__":
# Use a session for efficiency
with requests.Session() as session:
# This URL simulates a 50% failure rate, good for testing retries
flaky_url = "<https://httpbin.org/status/503>" # Service Unavailable - good candidate for retry
stable_url = "<https://httpbin.org/get>"
print("--- Fetching stable URL ---")
try:
content = fetch_with_retry(stable_url, session)
print(f"Stable URL content length: {len(content)}")
except Exception as e:
print(f"Failed to fetch stable URL after retries: {e}")
print("\\n--- Fetching flaky URL ---")
try:
content = fetch_with_retry(flaky_url, session)
# This line likely won't be reached if the URL always returns 503
print(f"Flaky URL content length: {len(content)}")
except requests.exceptions.RequestException as e: # Catch the final exception if reraise=True
print(f"Failed to fetch flaky URL after all retries: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
In this snippet:
The
@retry
decorator intercepts calls tofetch_with_retry()
function and adds the retry behavior defined by its arguments. In particular, theretry_if_exception_type()
method retries the request if an exception has been raised. The retries are repeated for a precise amount of time defined by the methodstop_after_attempt()
.The
wait_exponential()
method manages the exponential backoff. It uses an exponential formula to wait among retries defined by the number of retries (3, in this case). The values of the exponent vary betweenmin=2
andmax=10
.The
fetch_with_retry()
function fetches a URL using a session object, thanks to thesession.get()
method.The
flaky_url = "<https://httpbin.org/status/503>"
defines a URL designed to return anHTTP 503 error
. This will trigger theHTTPError
and cause the retries.
So, this code attempts to fetch the target URL up to 3 times if HTTP errors occur, waiting exponentially between attempts. If you run the script with python main.py
you will obtain:
As expected, the result shows that https://httpbin.org/status/503
raised an error. The request has been retried three times, then stopped.
Anti-Blocking Techniques: User-agent Rotation
Using different user-agent
strings makes your scraper look less like a single, repetitive bot. This lowers the possibilities of triggering anti-bot systems.
A way to implement this solution is by cycling through a list of realistic browser user-agent
strings like so:
import requests
import random
import time
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
]
def fetch_with_random_ua(url, session):
"""Fetches a URL using a random User-Agent from the list."""
headers = {'User-Agent': random.choice(USER_AGENTS)}
print(f"Fetching {url} with UA: {headers['User-Agent']}")
try:
# Add a small delay
time.sleep(0.5)
response = session.get(url, headers=headers, timeout=10)
response.raise_for_status()
print(f"Fetched {url} - Status: {response.status_code}")
return response.text
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
if __name__ == "__main__":
target_urls = [
"<https://httpbin.org/user-agent>", # This endpoint echoes the UA
"<https://httpbin.org/user-agent>",
"<https://httpbin.org/user-agent>",
]
with requests.Session() as session: # Combine with session for efficiency
for target_url in target_urls:
fetch_with_random_ua(target_url, session)
print("-" * 20)
In this snippet:
USER_AGENTS
contains a list of common browseruser-agent
strings.Before each request, the
random.choice()
method selects oneuser-agent
randomly from the list. This ensures the rotation of the user agents in the list.
This technique helps mimic traffic from different browsers/users, reducing the chance of being blocked based on a repetitive user-agent
signature. After running python main.py
you will obtain the following result:
As you can see, the result shows that the target URLs received requests with different user agents.
Anti-Blocking Techniques: Proxy Rotation
Websites—especially, high-traffic ones— monitor the number of requests coming from individual IP addresses. If too many requests arrive from the same IP in a short period, the site may throttle the connection or block the IP.
A way to overcome this is by implementing proxy rotation: a solution that routes requests through a pool of different proxy servers. This way, each request appears to originate from a unique IP address. This technique mimics traffic from multiple distinct users, making it harder for websites to identify and block the scraping activity based solely on IP address.
A simple way to implement proxy rotation in Python is as follows:
import random
import requests
proxies_list = [
"http://PROXY_1:PORT_X",
"http://PROXY_2:PORT_Y",
"http://PROXY_3:PORT_Z",
# Add more proxies as needed
]
def get_random_proxy():
proxy = random.choice(proxies_list)
return {
"http": proxy,
"https": proxy,
}
for i in range(3):
proxy = get_random_proxy()
response = requests.get("<https://httpbin.org/ip>", proxies=proxy)
print(response.text)
The above code:
Stores a pool of proxies as a list in
proxies_list
.Selects a random proxy URL from the list with the method
random.choice()
. This is the simplest way to ensure proxy rotation.
Below is the expected response:
{
"origin": "PROXY_3:PORT_K"
}
{
"origin": "PROXY_1:PORT_N"
}
{
"origin": "PROXY_2:PORT_P"
}
Why Scrapy Can Be The Optimal Solution For Scraping High-traffic Websites With Python
As you have learned, there are several challenges to overcome when scraping high-traffic websites. So, you might be wondering if there is a way to implement all of these solutions—and maybe more than those—in one place.
In this case, Scrapy can be considered your best choice. Scrapy is a comprehensive framework built from the ground up specifically for large-scale, efficient web crawling and scraping. Unlike using individual libraries like requests
and BeautifulSoup
—which require you to manually implement many features—Scrapy provides an integrated architecture that directly addresses the core challenges when scraping high-traffic websites.
Here is why Scrapy excels in these scenarios:
Asynchronous networking core: Scrapy is built on top of Twisted, a mature, event-driven networking engine. This means Scrapy handles network requests asynchronously by default without needing any further implementations.
Built-in concurrency and delay management: Scrapy provides fine-grained control over concurrency and request delays. Settings like
CONCURRENT_REQUESTS
,CONCURRENT_REQUESTS_PER_DOMAIN
,CONCURRENT_REQUESTS_PER_IP
, andDOWNLOAD_DELAY
allow precise tuning. Also, theAutoThrottle
extension can dynamically adjust delays based on server load, helping you to scrape as fast as possible without getting blocked. Those can be set in thesettings.py
file that Scrapy automatically creates when you start a new Scrapy project.Robust middleware architecture: Scrapy has a middleware system for customizing request and response handling. Custom middleware, for example, can manage retries, redirects, proxy rotation, user-agent rotation, and more.
Efficient resource management: Scrapy manages memory efficiently, handling request queues, duplicate filtering—ensuring URLs aren't scraped multiple times unnecessarily via the
DupeFilter
Class—, and data processing pipelines. It is designed for large crawls that, otherwise, would consume excessive RAM.
Conclusion
Scraping high-traffic websites poses significant challenges. In this article, you learned that overcoming them requires optimizing Python using different techniques.
While these techniques enhance Python scrapers, Scrapy is a powerful solution that provides some enhancements by default or facilitates their implementation due to its architecture. For this reason, you can consider Scrapy an optimal choice when scraping high-traffic websites in Python.
Happy scraping to you all! Oh, and remember to always respect the robots.txt
file.
Scrapy is a great tool I agree. I also used this spider in one of my webinars. However when you need to bypass strict anti-bot systems like Cloudflare Turnstile you might need to utilize some trick: Use an headless or anti-detect browser to get the cf_clearance cookie and inject it to your Scrapy requests. This cookie will work as a "pass" and the anti-bot system won't stop you. See how: https://kameleo.io/blog/how-to-bypass-cloudflare-turnstile-with-scrapy
Great article. I think you have a small typo on the urls.
target_urls = [
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/2",
"https://httpbin.org/html",
"https://httpbin.org/status/404",
"https://httpbin.org/delay/0.5",
]