Optimizing Python Scripts for High-Traffic Websites

Tools and libraries for improving your scraper's requests per second.

May 04, 2025

Scraping high-traffic websites often presents several challenges. This happens because owners want their websites to always be available to users. For this reason, they make web scraping difficult to perform.

In this article, you will learn the challenges associated with web scraping high-traffic websites and how to optimize your Python scripts in this case. Here is what this article covers:

Challenges of scraping high-traffic websites
Strategies for optimizing Python scripts when scraping high-traffic websites
Why Scrapy can be the optimal solution

Let’s dive into it!

Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.

Claim the offer

Challenges of Scraping High-Traffic Websites

Scraping data from high-traffic websites means extracting information from websites that receive many visitors and requests in a short amount of time. Common examples are major e-commerce sites, news portals, social media platforms, and search engines. These sites often use sophisticated measures to prevent or hinder scraping, making the process technically challenging. For this reason, scraping these sites requires advanced techniques, robust infrastructure, and careful handling to avoid detection and blocking.

Common challenges you can face while scraping high-traffic websites are:

Overloaded servers: High-traffic sites are built to handle legitimate user load, so aggressive scraping—sending many requests from one source—can trigger anti-bot systems that make servers unresponsive. More commonly, the unresponsiveness encountered by a scraper is not due to genuine server overload. Rather, it is a site's defense mechanisms that detect scraping activity and deliberately slow down, or temporarily block, the scraper's IP address.
Rate limits: This is a primary technical defense. Websites limit the number of requests allowed from a single IP address within a given time frame to prevent abuse and overloading the servers. Exceeding these limits is a common reason scrapers get blocked.
Sophisticated bot detection systems: To ensure the availability of their websites, owners invest in bot detection systems like CAPTCHAs, IP address analysis, browser fingerprinting, behavioural analysis, and more. Those anti-detection systems make web scraping difficult, aiming to ensure maximum availability for legitimate users.
Honeypot Traps: They were initially built as a defence from cyber attacks. However, they can not distinguish between cyber attacks and web scraping requests because some types can identify scrapers without they falling deep into the traps. In the case of web scraping, honeypots are placed in areas that are only visible to bots. Accessing these traps immediately flags the scraper's IP for blocking.
Geographical restrictions: The content available on a website can differ based on the visitor's location. Scraping specific regional data requires using proxies located in those geographic areas.

Thanks to the gold partners of the month: Smartproxy, IPRoyal, Oxylabs, Massive, Rayobyte, Scrapeless, SOAX, ScraperAPI, and Syphoon. They’re offering great deals to the community. Have a look yourself.

Claim the deals

Python Scrapers Performance Optimization Strategies

After learning common challenges when scraping high-traffic websites, this section walks you through practical strategies to optimize your Python scrapers’ performance.

Requirements

To replicate the scripts in this section, you must have Python 3.10.1 or higher installed on your machine.

Manage Concurrency With `asyncio` And `aiohttp`

Web scraping is often I/O-bound. This means you have to wait for network responses before completing a task. Concurrency allows your script to do other work while waiting, speeding up the process. To do so, you can use the following libraries:

asyncio: Often the most effective approach for I/O-bound tasks.
aiohttp: Used for making asynchronous HTTP requests.

Before proceeding with the script, install aiohttp :

pip install aiohttp

A Python script that manages asynchronous requests can be as follows:

import asyncio
import aiohttp
import time

async def fetch(session, url):
    """Asynchronously fetches a single URL."""
    print(f"Starting fetch for: {url}")
    try:
        # Add a small delay 
        await asyncio.sleep(0.5)
        async with session.get(url, timeout=10) as response:
            response.raise_for_status() # Raise an exception for bad status codes 
            content = await response.text() 
            print(f"Finished fetch for: {url}, Length: {len(content)}")
            # Process content here
            return url, len(content)
    except aiohttp.ClientError as e:
        print(f"Client error fetching {url}: {e}")
        return url, None
    except asyncio.TimeoutError:
        print(f"Timeout fetching {url}")
        return url, None
    except Exception as e:
        print(f"Other error fetching {url}: {e}")
        return url, None

async def main(urls):
    """Main function to coordinate concurrent fetches."""
    # Create a single session to reuse connections
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True) # Gather results
        print("\\n--- Fetching Complete ---")
        for result in results:
            if isinstance(result, Exception):
                print(f"A task failed: {result}")
            elif result and result[1] is not None:
                print(f"Successfully fetched {result[0]} - Size: {result[1]}")
            else:
                 print(f"Failed or no content for URL: {result[0] if result else 'Unknown'}")

if __name__ == "__main__":
    target_urls = [
        "<https://httpbin.org/delay/1>", # Simulates slow response
        "<https://httpbin.org/delay/2>",
        "<https://httpbin.org/html>",
        "<https://httpbin.org/status/404>", # Example of a bad status
        "<https://httpbin.org/delay/0.5>", # Simulates fast response
    ]
    start_time = time.time()
    asyncio.run(main(target_urls))
    end_time = time.time()
    print(f"\\nTotal time using asyncio: {end_time - start_time:.2f} seconds")

Here is an explanation of the code:

async def declares an asynchronous function—called a coroutine. It can be paused and resumed using await.
The purpose of the fetch() function is to fetch the content of a single URL. This is ensured by the method session.get() that uses the aiohttp.ClientSession() object (defined under the main() function) to make an asynchronous GET request to the URL.
The purpose of the main() function is to manage the concurrent fetching of all URLs provided in the urls list. In particular:
- tasks = [fetch(session, url) for url in urls]: This creates a list of coroutines. It doesn't run the fetch functions yet. It just prepares them, creating a fetch task for each URL in the input list.
- asyncio.gather() is the key function for running multiple coroutines concurrently.
target_urls defines the list of URLs to be fetched. Note that it includes URLs designed to take different amounts of time (/delay/...) and one that will return a 404 error (/status/404), to simulate different cases.

So, instead of waiting for each URL to download completely before starting the next, asyncio.gather() allows the program to initiate all requests and wait for them to complete.

Suppose your code is stored in a file named main.py. Run it via python main.py and you will obtain:

The result shows that the URLs are fetched contemporaneously, allowing for asynchronous concurrency.

Need help with your scraping project?

Maintain Persistent Connections

Another way to increase your Python scrapers’ efficiency is to reuse the underlying TCP connection for multiple requests to the same host. This avoids the overhead of establishing new connections at each request.

This technique is generally used for scraping many pages from the same site and can be done with the method requests.Session() from the library Requests.

Before writing the code, install Requests:

pip install requests

The following is a script that manages persistent connections with Requests:

import requests
import time

def fetch_urls_with_session(urls):
    """Fetches URLs using a persistent session."""
    # Create a session object
    with requests.Session() as session:
        # Set default headers for the session
        session.headers.update({"User-Agent": "MyOptimizedScraper/1.0"})
        results = {}
        for url in urls:
            try:
                # Add a small delay 
                time.sleep(0.5)
                response = session.get(url, timeout=10)
                response.raise_for_status() # Check for HTTP errors
                print(f"Fetched {url} - Status: {response.status_code}")
                results[url] = len(response.content)
                # Process content 
            except requests.exceptions.RequestException as e:
                print(f"Error fetching {url}: {e}")
                results[url] = None
    return results

if __name__ == "__main__":
    target_urls = [
        "<https://httpbin.org/html>",
        "<https://httpbin.org/get>", # Same domain, connection can be reused
        "<https://httpbin.org/headers>", # Same domain
        "<https://httpbin.org/status/404>", # Same domain, but will raise error
    ]
    start_time = time.time()
    fetch_results = fetch_urls_with_session(target_urls)
    end_time = time.time()
    print("\\n--- Fetching Complete ---")
    print(fetch_results)
    print(f"\\nTotal time using Session: {end_time - start_time:.2f} seconds")

This snippet does the following:

The Session() method creates a session object that persists parameters across requests. So, if you make multiple requests to the same host (as you do with the target_urls list ), the underlying TCP connection is reused, saving the time needed for the TCP handshake for subsequent requests.
The headers.update() method sets a default User-Agent header for all requests made using the session object. This identifies your script to the web server.
The session.get() method makes an HTTP GET request to the current url using the session object.

So, when making multiple requests to the same host (httpbin.org in this case) within the with block, the underlying TCP connection is reused, saving time on SSL handshakes and connection setup for subsequent requests.

When running the script with python main.py , you will obtain:

Retries management

Another way to improve Python scripts, particularly for high-traffic websites, is by implementing retries with backoff. You can implement automatic retries for transient network errors or specific HTTP status codes (for example, for 5xx errors). Use exponential backoff (increasing delays between retries) to avoid overloading the server.

A library you can use for such cases is tenacity :

pip install tenacity

An automated retry of failed requests can be implemented as follows:

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import time

# Configure the retry strategy
@retry(
    stop=stop_after_attempt(3), # Stop after 3 attempts (1 initial + 2 retries)
    wait=wait_exponential(multiplier=1, min=2, max=10), # Wait up to 10s between retries
    retry=retry_if_exception_type((requests.exceptions.Timeout, requests.exceptions.ConnectionError, requests.exceptions.HTTPError)), # Retry only on specific errors
    reraise=True # Reraise the exception if all retries fail
)
def fetch_with_retry(url, session):
    """Fetches a URL using a session, with retry logic."""
    print(f"Attempting to fetch: {url} at {time.strftime('%X')}")
    # Add a small delay
    time.sleep(0.5)
    response = session.get(url, timeout=5) # Shorter timeout for testing retries
    # Raise HTTPError for bad responses 
    response.raise_for_status()
    print(f"Successfully fetched {url}")
    return response.text

if __name__ == "__main__":
    # Use a session for efficiency
    with requests.Session() as session:
        # This URL simulates a 50% failure rate, good for testing retries
        flaky_url = "<https://httpbin.org/status/503>" # Service Unavailable - good candidate for retry
        stable_url = "<https://httpbin.org/get>"

        print("--- Fetching stable URL ---")
        try:
            content = fetch_with_retry(stable_url, session)
            print(f"Stable URL content length: {len(content)}")
        except Exception as e:
            print(f"Failed to fetch stable URL after retries: {e}")

        print("\\n--- Fetching flaky URL ---")
        try:
            content = fetch_with_retry(flaky_url, session)
            # This line likely won't be reached if the URL always returns 503
            print(f"Flaky URL content length: {len(content)}")
        except requests.exceptions.RequestException as e: # Catch the final exception if reraise=True
            print(f"Failed to fetch flaky URL after all retries: {e}")
        except Exception as e:
             print(f"An unexpected error occurred: {e}")

In this snippet:

The @retry decorator intercepts calls to fetch_with_retry() function and adds the retry behavior defined by its arguments. In particular, the retry_if_exception_type() method retries the request if an exception has been raised. The retries are repeated for a precise amount of time defined by the method stop_after_attempt().
The wait_exponential() method manages the exponential backoff. It uses an exponential formula to wait among retries defined by the number of retries (3, in this case). The values of the exponent vary between min=2 and max=10.
The fetch_with_retry() function fetches a URL using a session object, thanks to the session.get() method.
The flaky_url = "<https://httpbin.org/status/503>" defines a URL designed to return an HTTP 503 error. This will trigger the HTTPError and cause the retries.

So, this code attempts to fetch the target URL up to 3 times if HTTP errors occur, waiting exponentially between attempts. If you run the script with python main.py you will obtain:

As expected, the result shows that https://httpbin.org/status/503 raised an error. The request has been retried three times, then stopped.

Check the TWSC YouTube Channel

Anti-Blocking Techniques: User-agent Rotation

Using different user-agent strings makes your scraper look less like a single, repetitive bot. This lowers the possibilities of triggering anti-bot systems.

A way to implement this solution is by cycling through a list of realistic browser user-agent strings like so:

import requests
import random
import time

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
]

def fetch_with_random_ua(url, session):
    """Fetches a URL using a random User-Agent from the list."""
    headers = {'User-Agent': random.choice(USER_AGENTS)}
    print(f"Fetching {url} with UA: {headers['User-Agent']}")
    try:
        # Add a small delay
        time.sleep(0.5)
        response = session.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        print(f"Fetched {url} - Status: {response.status_code}")
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

if __name__ == "__main__":
    target_urls = [
        "<https://httpbin.org/user-agent>", # This endpoint echoes the UA
        "<https://httpbin.org/user-agent>",
        "<https://httpbin.org/user-agent>",
    ]
    with requests.Session() as session: # Combine with session for efficiency
         for target_url in target_urls:
            fetch_with_random_ua(target_url, session)
            print("-" * 20)

In this snippet:

USER_AGENTS contains a list of common browser user-agent strings.
Before each request, the random.choice() method selects one user-agent randomly from the list. This ensures the rotation of the user agents in the list.

This technique helps mimic traffic from different browsers/users, reducing the chance of being blocked based on a repetitive user-agent signature. After running python main.py you will obtain the following result:

As you can see, the result shows that the target URLs received requests with different user agents.

Anti-Blocking Techniques: Proxy Rotation

Websites—especially, high-traffic ones— monitor the number of requests coming from individual IP addresses. If too many requests arrive from the same IP in a short period, the site may throttle the connection or block the IP.

A way to overcome this is by implementing proxy rotation: a solution that routes requests through a pool of different proxy servers. This way, each request appears to originate from a unique IP address. This technique mimics traffic from multiple distinct users, making it harder for websites to identify and block the scraping activity based solely on IP address.

A simple way to implement proxy rotation in Python is as follows:

import random
import requests

proxies_list = [
    "http://PROXY_1:PORT_X",
    "http://PROXY_2:PORT_Y",
    "http://PROXY_3:PORT_Z",
    # Add more proxies as needed
]

def get_random_proxy():
    proxy = random.choice(proxies_list)
    return {
        "http": proxy,
        "https": proxy,
    }

for i in range(3):
    proxy = get_random_proxy()
    response = requests.get("<https://httpbin.org/ip>", proxies=proxy)
    print(response.text)

The above code:

Stores a pool of proxies as a list in proxies_list.
Selects a random proxy URL from the list with the method random.choice(). This is the simplest way to ensure proxy rotation.

Below is the expected response:

{
  "origin": "PROXY_3:PORT_K"
}
{
  "origin": "PROXY_1:PORT_N"
}
{
  "origin": "PROXY_2:PORT_P"
}

Why Scrapy Can Be The Optimal Solution For Scraping High-traffic Websites With Python

As you have learned, there are several challenges to overcome when scraping high-traffic websites. So, you might be wondering if there is a way to implement all of these solutions—and maybe more than those—in one place.

In this case, Scrapy can be considered your best choice. Scrapy is a comprehensive framework built from the ground up specifically for large-scale, efficient web crawling and scraping. Unlike using individual libraries like requests and BeautifulSoup—which require you to manually implement many features—Scrapy provides an integrated architecture that directly addresses the core challenges when scraping high-traffic websites.

Here is why Scrapy excels in these scenarios:

Asynchronous networking core: Scrapy is built on top of Twisted, a mature, event-driven networking engine. This means Scrapy handles network requests asynchronously by default without needing any further implementations.
Built-in concurrency and delay management: Scrapy provides fine-grained control over concurrency and request delays. Settings like CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS_PER_IP, and DOWNLOAD_DELAY allow precise tuning. Also, the AutoThrottle extension can dynamically adjust delays based on server load, helping you to scrape as fast as possible without getting blocked. Those can be set in the settings.py file that Scrapy automatically creates when you start a new Scrapy project.
Robust middleware architecture: Scrapy has a middleware system for customizing request and response handling. Custom middleware, for example, can manage retries, redirects, proxy rotation, user-agent rotation, and more.
Efficient resource management: Scrapy manages memory efficiently, handling request queues, duplicate filtering—ensuring URLs aren't scraped multiple times unnecessarily via the DupeFilter Class—, and data processing pipelines. It is designed for large crawls that, otherwise, would consume excessive RAM.

Conclusion

Scraping high-traffic websites poses significant challenges. In this article, you learned that overcoming them requires optimizing Python using different techniques.

While these techniques enhance Python scrapers, Scrapy is a powerful solution that provides some enhancements by default or facilitates their implementation due to its architecture. For this reason, you can consider Scrapy an optimal choice when scraping high-traffic websites in Python.

Happy scraping to you all! Oh, and remember to always respect the robots.txt file.

Tamas Deak

May 5

Scrapy is a great tool I agree. I also used this spider in one of my webinars. However when you need to bypass strict anti-bot systems like Cloudflare Turnstile you might need to utilize some trick: Use an headless or anti-detect browser to get the cf_clearance cookie and inject it to your Scrapy requests. This cookie will work as a "pass" and the anti-bot system won't stop you. See how: https://kameleo.io/blog/how-to-bypass-cloudflare-turnstile-with-scrapy

Expand full comment