How Reverse Proxies Route and Protect Web Traffic
A web scraper's guide to understanding the placement of anti-bot systems.
In web scraping, proxy is probably one of the most used words. In this context, the word “proxy” is used interchangeably as “proxy server,” but you could also name them as “forward proxy”.
The reason why you could name them as “forward proxy” is to differentiate them from reverse proxies, which is a topic that is probably less discussed in web scraping.
In this article, you’ll read about what reverse proxies are, how they are used, and how they differ from forward proxies.
Ready? Let’s dive in!
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
What Is a Reverse Proxy?
A reverse proxy is a server that sits in front of one or more web servers, intercepting all incoming requests from clients:
With this configuration, the reverse proxy appears as the actual server to the client. This means that a reverse proxy server acts like a middleman. It communicates with clients so that they never interact directly with the web servers.
Reverse proxy servers improve security, shielding web servers from direct exposure to the Internet. They also work as load balancers, as they can split requests among multiple servers.
This episode is brought to you by our Gold Partners. Be sure to have a look at the Club Deals page to discover their generous offers available for the TWSC readers.
🧞 - Reliable APIs for the hard to knock Web Data Extraction: Start the trial here
💰 - Use the coupon WSC50 for a 50% off on mobile proxies and proxy builder software
How Reverse Proxies Work?
To appreciate their benefits, you have to understand the journey a user's request takes when a reverse proxy manages the traffic:
The client initiates a request: It all begins when a user's client—a web browser or a mobile application—sends a request to access a web application's domain. This request travels across the internet and arrives first at the reverse proxy, which serves as the application's public-facing front door.
The reverse proxy evaluates the request: When receiving the request, the reverse proxy inspects its content. It analyzes information like the requested URL path, headers, and cookies to determine the correct course of action. Based on its pre-configured rules, it makes a decision: can it fulfill this request using its cached data, or must it forward the request deeper into the internal network?
The reverse proxy forwards the request: If the proxy can’t serve the request from its cache, it forwards the request to an appropriate backend server. This process works differently, depending on the architecture and goals you need to achieve (for example, load balancing requests).
The backend server processes the request: The selected backend server receives the request from the proxy and performs the necessary work. This could involve running business logic, querying a database, calling other internal services, or others.
The proxy caches and delivers the final response: The reverse proxy receives the response from the backend server. If its caching rules permit, it stores a copy of this response for a set duration. Then, the proxy delivers the response to the original client, completing the process.
When the process is completed, the user receives the requested data without any visibility into the backend process. The communication between the proxy, web servers, and other services connected is completely hidden from the client.
Pros and Cons of Using Reverse Proxies
Below is a list of the benefits of reverse proxies:
Protection from cyber attacks: Reverse proxies intercept requests coming from clients. This makes them a first line of defense against any malicious requests, blocking them before they can reach the servers. Also, the client sees the proxy’s IP, not the web service’s. This makes it harder for malicious actors to perform direct-to-IP attacks like DDoS (Distributed Denial of Service).
SSL encryption: Encrypting and decrypting SSL communications for each client can be computationally expensive for servers. Reverse proxies can be configured to decrypt all incoming requests and encrypt all outgoing responses.
Caching: Reverse proxies can cache content. This improves the performance of the request-response process.
Load balancing: Websites’ backends can be distributed across different servers for several reasons. For example, because they receive high traffic. Reverse proxies can distribute the incoming traffic among servers to prevent any single one from becoming overloaded.
Facilitating modern deployment strategies: Reverse proxies can provide granular control over traffic flow. This means that they can be configured to perform a "canary release" by routing a small, controlled percentage of live traffic to the new version, while the majority of users continue to use the stable, existing version. This allows teams to monitor the new code's performance and perform A/B testing.
Using reverse proxies doesn’t make you immune to downsides. The common ones are:
Single point of failure: Reverse proxies route clients’ requests to the servers, acting as middlemen. If the proxy fails, anything after it becomes inaccessible, and the request fails.
Traffic interception: If the reverse proxy is compromised, attackers can intercept all traffic in plaintext. This gives them access to sensitive data like passwords and session tokens for every backend service the proxy protects.
Data caching risks: Improperly configured caching can cause the proxy to store private, user-specific content. This creates the risk of serving one user's sensitive data or session cookies to another.
Difficulties in setup: Many reverse proxy systems require technical know-how and skills to set them up with existing systems.Error Messages and HTTP Status Codes That Indicate Proxy IP Bans
Forward Proxies Vs Reverse Proxies: A Comparison
Proxies always act as middlemen between parties. What changes is where they are positioned:
Forward proxy: The proxy is placed after the client, before the request can access the Internet.
Reverse proxy: The proxy is placed after the request accesses the internet, before the web servers.
Below is a schema for visual representation:
Let’s discuss the differences between these two types of proxies:
Configuration actor:
Forward: The client-side administrator or the end-user.
Reverse: The server-side administrator, DevOps, or backend engineer.
Configuration target:
Forward: The configuration is applied to the client machine or application (a web browser, an OS-level setting, or an application's SDK). The client must be explicitly told to use the proxy.
Reverse: The configuration is applied to the proxy server itself and the DNS records for the public-facing service. The client is completely unaware of its existence.
Core logic:
Forward: The proxy is configured with rules about outbound traffic. It can also require authentication to identify the user making the request.
Reverse: The proxy is configured with rules about inbound traffic. It maps public hostnames and paths to internal backend servers.
SSL/TLS Handling:
Forward: A forward proxy that needs to inspect HTTPS traffic performs a "man-in-the-middle" action. The client must trust a special root certificate installed by the network administrator. The proxy decrypts, inspects, and then re-encrypts the traffic to the destination.
Reverse: A reverse proxy typically performs SSL/TLS Termination. The client establishes a secure HTTPS connection with the reverse proxy. The proxy decrypts the traffic and then forwards it to the backend servers.
Categorization:
Forward: There are different "types" of forward proxies—residential, mobile, datacenter. They are all categorized by the nature of their outbound IP address.
Reverse: Reverse proxies use a stable IP address. So, you can categorize them by their function, such as a load balancer, API gateway, or Web Application Firewall (WAF).
Common uses:
Forward: You use a forward proxy when you want to control, monitor, or mask traffic originating from a client or a group of clients. For example, for anonymity and geo-unblocking, or for large-scale web scraping.
Reverse: You use a reverse proxy when you want to protect, manage, and scale your backend services. For example, as a load balancer, security shield, or API gateway.
Before continuing with the article, I wanted to let you know that I've started my community in Circle. It’s a place where we can share our experiences and knowledge, and it’s included in your subscription. Enter the TWSC community at this link.
Forward and Reverse Proxies in Web Scraping
The distinction between forward and reverse proxies can become clearer when viewed through the lens of web scraping. In this scenario:
Forward proxies represent the tools for data extraction.
Reverse proxies represent the protective architecture.
From the scraper's perspective, the forward proxy is an indispensable tool to scrape data from webpages. You use a pool of forward proxies to orchestrate your scraping strategy. The goal is to distribute your requests across different IP addresses, making your traffic appear as if it originates from thousands of unique and legitimate users.
From the site administrator's perspective, the reverse proxy is the primary shield. On the protective side, the reverse proxy acts as the gatekeeper for the entire backend infrastructure. You configure it to identify and neutralize threats before they can reach the servers. This involves enforcing rate limits to stop aggressive bots, deploying a Web Application Firewall (WAF) to analyze traffic for suspicious patterns, and hiding the true IPs of your web servers.
Reverse Proxies in Web Scraping Scenarios
From a web scraping perspective, website owners can use reverse proxies in different situations:
Implementing rate limiting: The reason why scrapers are banned when they perform too many requests is that an administrator has set a rate limit. The administrator configures the reverse proxy to track the number of requests coming from a single IP address within a specific time window. If an IP exceeds this limit, the proxy automatically blocks it temporarily or permanently. The typical cases are e-commerce websites.
Deploying a WAF: A web application firewall (WAF) can be considered a form of reverse proxy, as it sits between the client and the webserver to protect the latter. WAFs can analyze the User-Agent header to detect patterns associated with known malicious bots or attack tools. Based on the detected patterns, WAFs can block requests from specific user agents or subject them to additional security checks like CAPTCHAs.
Complete protection: While you can code your reverse proxy, commercially available CDNs—like Cloudflare—and security solutions function as reverse proxies and can go a step further. They can act as a WAF, provide DDoS protection, and do a lot more without writing a line of code.
Conclusion
In this article, you’ve gone through the definition and some applications of reverse proxies. In web scraping scenarios, reverse proxies are basically the ones that are set up to block your scrapers.
So, let us know: are you using reverse proxies? And if yes, what application are you building using them?