The Stealth Stack: A Guide to Preventing Data Leaks in Web Scraping Infrastructure
A four-layer defense strategy for making your web scraping infrastructure indistinguishable from real users
When hearing about “data leaks”, I’m sure you think about cybersecurity, databases, and personal information lost due to malicious intent. But what if I tell you your web scraper is leaking data? But in the specific context of web scraping, no one is stealing your data. Rather, this means that your scraper is revealing its automated nature through a set of signals.
In particular, your scrapers leak information at four distinct layer levels. Modern anti-bot systems, in fact, fingerprint your browser, analyze your TLS handshake, trace your network infrastructure, and track your behavioral patterns. And a single inconsistency across these layers triggers permanent blocking.
This means your scrapers aren’t competing only against rate limits anymore. Today, they are competing against machine learning models trained on billions of legitimate requests, and any deviation from the expected pattern is a signal. So, if you want to scrape at scale, your infrastructure must be indistinguishable from a real user’s browser, network stack, and behavior.
This article guides you through a systematic approach: First, understanding where leaks occur, then learning how anti-bot systems detect them, and finally building a layered defense that makes your scraper invisible.
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
Identifying the Leaks: Where Your Scraper Exposes Itself
Before fixing anything, you need to understand the complete attack surface. Modern anti-bot systems analyze your scraper at four distinct layers, and a leak at any layer can expose you.
Layer 1: The Browser Level
Headless browsers are loud by default. Launch a Puppeteer instance and check the navigator.webdriver flag. It surely returns true, and that’s a signal every major anti-bot system checks in the first 100ms of page load.
But this obvious flag is just the beginning. Anti-bot systems probe deeper:
Error messages and stack traces: They differ between headless and headed modes. The execution context leaves fingerprints in error objects.
Window dimensions: Properties like window.outerWidth and window.outerHeight reveal a headless operation because headless mode doesn’t render a visible window frame.
Canvas rendering: They can produce pixel-level differences. Software rendering (headless) creates different anti-aliasing and color values than GPU-accelerated rendering (headed). Color channels can differ by 1-2 units per pixel.
WebGL shader timing: This can vary a lot, depending on the underlying technology. GPU-accelerated browsers complete WebGL operations in microseconds. Software-rendered headless browsers take milliseconds.
Font rendering: Headless environments often lack the full system font stack. This creates detectable layout differences when JavaScript measures text dimensions.
Performance benchmarks: When run, they can reveal software rendering. For example, there are websites that run JavaScript stress tests, creating thousands of DOM elements, calculating layouts, and triggering reflows. In such scenarios, real browsers with GPU acceleration show consistent performance. Headless browsers, instead, show different timing patterns.
The window.chrome object behaves differently: Real Chrome populates this object with specific properties for extension management and runtime APIs. Headless Chrome, instead, either lacks this object or provides an incomplete implementation.
Layer 2: The Network Level
Your SSL/TLS handshake identifies you before you send any application data. When your scraper connects over HTTPS, it sends a TLS Client Hello message containing supported encryption methods, protocol versions, and extensions. All in a specific order.
Here’s what makes this dangerous:
Every browser and HTTP library has a unique TLS pattern: Real browsers send their TLS parameters in a specific sequence that matches their version and underlying platform. Python’s standard HTTP libraries send a completely different pattern. So do Node.js, Go, and any other programming language you use for coding your scrapers.
Anti-bot systems fingerprint your TLS handshake: They capture these patterns and convert them into a fingerprint, commonly called a JA3 hash. They maintain databases of known fingerprints for every major browser and HTTP library.
Mismatches between User-Agent and TLS fingerprint are instant red flags: When you claim to be Chrome in your User-Agent header but your TLS handshake matches Python’s urllib library, that inconsistency triggers blocking.
Detection happens before you send any application data: The first TCP connection already identifies you as automated traffic.
HTTP/2 fingerprinting adds another layer: Beyond TLS, the order and priority of HTTP/2 frames, settings, and window updates create additional fingerprints. Your HTTP library’s frame ordering must match your claimed browser identity.
For your scraping needs, having a reliable proxy provider like Decodo with high reputatation IPs, on your side improves the chances of success.
Layer 3: The Infrastructure Level
Your proxy configuration can expose your real infrastructure through network-level leaks via the following main mechanisms:
DNS leaks: They happen when your browser resolves domain names using your local DNS server instead of routing through the proxy. Your scraper might send requests through a Miami residential proxy, but if DNS queries go through your AWS datacenter in Virginia, the target site knows your real location.
WebRTC leaks: WebRTC is a browser API designed for peer-to-peer communication. Even with a proxy configured, WebRTC will attempt to discover your real local IP and public IP through STUN servers, completely bypassing your proxy.
IP reputation: Not all IPs are created equal. Cloudflare and similar services maintain databases of every AWS, Google Cloud, and Azure IP range. Requests from known cloud providers receive instant higher suspicion scores before any other analysis happens.
Layer 4: The Behavioral Level
Even if your browser, network, and infrastructure are perfectly disguised, your behavior patterns can still expose you:
Timing patterns: Requesting data at fixed and precise intervals creates a perfect periodicity. No human browses with mathematical precision.
Mouse and scroll behavior: Real humans accelerate and decelerate smoothly. Instant jumps from point A to point B are mechanically impossible.
Session state: Stateless scrapers that never accumulate cookies or maintain persistent sessions across days look like fresh bots on every run.
Interaction sequences: The time between page load and first click, between mouse-over and click, or the pattern of how you scroll through content. They all follow detectable human patterns.
Understanding the Detection: How Anti-Bot Systems Catch You
Now that you know where leaks occur, let’s understand how anti-bot systems actually detect them.
Fingerprint Consistency Checks
Anti-bot systems cross-reference your claimed identity with actual behavior. If your User-Agent says “Chrome 120 on Windows 10,” they verify that your JavaScript features, WebGL capabilities, canvas rendering, and TLS handshake all match Chrome 120 on Windows 10.
A single mismatch anywhere flags the entire request. You can’t be Chrome in your User-Agent, Firefox in your TLS handshake, and headless Chrome in your canvas fingerprint. Anti-bot systems create composite fingerprints combining dozens of properties, then compare them against databases of known legitimate and bot patterns.
Machine Learning Pattern Recognition
Modern anti-bot systems use ML models trained on billions of requests. They learn what “normal” looks like for each type of visitor. This means that consumer browsers from residential IPs have different behavioral patterns than datacenter scrapers.
For ML models, statistical anomalies trigger investigation. Perfect timing intervals, impossible mouse movements, or timing patterns that don’t match human variance distributions are scored as anomalous. These models adapt continuously, so when new stealth techniques emerge, the models retrain on that data. This means that what works today might fail tomorrow.
Progressive Trust Scoring
Anti-bot systems block or allow requests, but they also score. This means that lower trust scores receive degraded service: slower response times, rate limits, or CAPTCHA challenges before blocking.
Also, scores accumulate across sessions. If you leak information across multiple visits, the system builds a profile associating your various identities. In other words, one leak can poison future requests, and even fixing the leak might not restore trust if your IP or fingerprint is already marked.
Building the Defense: A Layered Approach to Stealth
Building a defense from data leaks in web scraping requires addressing each layer systematically. Your stealth stack must work from the inside out: browser → network → infrastructure → behavior. Each layer must remain consistent with your claimed identity.
Defense Layer 1: Hardening the Browser
The goal at this layer is to make the browser fingerprint indistinguishable from a real user’s browser and ensure every property is consistent with your claimed identity.
Step 1: Mask Automation Signals
Start with stealth libraries that patch the most common detection vectors:
For Puppeteer: Use puppeteer-extra-plugin-stealth to automatically override navigator.webdriver, DevTools Protocol signatures, and plugin arrays.
For Selenium: Use undetected-chromedriver, which patches automation signals and uses real Chrome binaries instead of ChromeDriver.
For Playwright: Leverage native evasion features that handle many detection vectors out of the box.
Additionally, disable automation flags at launch. For example, in Playwright:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
'--disable-blink-features=AutomationControlled',
'--disable-dev-shm-usage',
'--no-sandbox'
]
)But remember: Stealth libraries handle the most common 20-30 leak vectors but miss advanced fingerprinting techniques. They’re your foundation, not your complete solution.
Step 2: Spoof Hardware Signatures
Cloud server canvas and WebGL fingerprints are obvious red flags. AWS, GCP, and Azure rendering signatures are well-known to anti-bot systems.
You have two approaches for your defense here:
Add consistent noise: Inject deterministic noise into canvas operations so the fingerprint remains stable across sessions but doesn’t match your server’s real hardware. Override canvas methods to modify pixel data slightly before it’s read back. Keep noise minimal: just enough to mask the real hardware signature without appearing obviously manipulated.
Emulate common consumer hardware: Spoof WebGL parameters to mimic common consumer GPUs. Override vendor and renderer strings returned by WebGL APIs to match your chosen hardware profile. Use existing libraries designed for canvas fingerprint defense or implement your own parameter overrides.
Step 3: Ensure Version Consistency
This is where most scrapers fail, even with stealth libraries. Your User-Agent string must match your actual browser engine behavior precisely. Consider the following rules of thumb:
Use real browser binaries instead of spoofing: Tools like Playwright can launch actual Chrome, ensuring perfect consistency between claimed version and actual behavior.
If you must spoof, maintain complete version profiles: Track which JavaScript features, WebGL capabilities, and API behaviors correspond to each browser version. Every property must align.
Never mix components from different versions: If you claim Chrome 120 on Windows 10, every single API, from JavaScript features to WebGL renderers, must behave exactly like Chrome 120 on Windows 10.
Defense Layer 2: Hardening the Network Stack
Your goal at this layer is to make your TLS handshake and HTTP traffic indistinguishable from the browser you’re claiming to be.
Step 4: Match TLS Fingerprints to Your Browser Identity
Standard HTTP libraries can’t mimic browser TLS fingerprints because they use different SSL/TLS implementations. The solution requires specialized libraries that replicate browser behavior at the protocol level:
For Python: Use curl_cffi or similar wrappers. These libraries use libcurl compiled with BoringSSL, which is the same SSL library Chrome uses. This creates identical JA3 fingerprints to real browsers.
For Node.js: Use cycletls or equivalent libraries that allow you to specify exact JA3 fingerprint strings matching real browsers.
Critical requirement: Your TLS fingerprint must match your User-Agent. Chrome 120’s JA3 fingerprint is different from Firefox 115’s fingerprint. The browser identity must be consistent across all layers.
Step 5: Match HTTP/2 Fingerprints
Beyond TLS, HTTP/2 frame ordering creates additional fingerprints. Libraries like curl_cffi handle this automatically when you specify a browser to impersonate, but verify that:
Settings frames match your target browser.
Window update sequences align.
Priority headers follow the correct pattern.
In Python, you can do so with the following code:
response = requests.get(
'<https://tls.peet.ws/api/all>',
impersonate='chrome120'
)
print(response.json()['http2']['sent_frames'])
Defense Layer 3: Hardening Infrastructure
Your goal at this layer is to ensure your network traffic originates from legitimate-looking IPs and doesn’t leak your real location or identity.
Step 6: Choose the Right Proxy Type
IP reputation is the first filter that anti-bot systems check. This means that your proxy choice determines your baseline trust score. Consider the following guidelines:
Datacenter IPs = instant red flag: Requests from AWS, Google Cloud, and Azure IP ranges receive instant higher suspicion scores.
Residential proxies = highest legitimacy: These IPs come from real ISP connections, so they look legitimate because they are legitimate consumer connections.
Mobile proxies = premium legitimacy: These IPs originate from cellular networks (4G/5G) and receive the highest trust scores. Mobile IPs rotate naturally as devices move between cell towers, making them appear even more organic than static residential connections.
Step 7: Prevent DNS Leaks
Force all DNS resolution through your proxy tunnel. For SOCKS5 proxies, use the SOCKS5h protocol variant, which forces DNS resolution on the remote proxy server instead of locally.
For example, in Python, write the following:
import requests
proxies = {
'http': 'socks5h://proxy.example.com:1080',
'https': 'socks5h://proxy.example.com:1080'
}
response = requests.get('<https://example.com>', proxies=proxies)
For browser automation, configure DNS-over-HTTPS to prevent local DNS leakage. The following is an example that applies to Playwright:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
args=[
'--dns-over-https-server=https://cloudflare-dns.com/dns-query'
]
)
Step 8: Disable WebRTC Completely
WebRTC will expose your real IP unless you completely disable it in browser automation. For example, in Playwright, you can do so as follows:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Remove WebRTC entirely
await page.add_init_script("""
delete window.RTCPeerConnection;
delete window.RTCSessionDescription;
delete window.RTCIceCandidate;
delete navigator.mediaDevices;
""")
When youìve done this, verify it’s actually disabled before deploying your scraper. Visit browserleaks.com/webrtc with your scraper. You should see “WebRTC is not supported by your browser”, or only your proxy IP should be visible. Never your real IP.
Defense Layer 4: Mimicking Human Behavior
Your goal at this layer is to make your interaction patterns indistinguishable from those of real human users.
Step 9: Add Timing Jitter and Randomization
Humans are inconsistent. Perfect patterns are robotic. The solution here is not to just add randomness. You also need to match the statistical distribution of real human behavior. To do so, consider the following example in Python:
import numpy as np
import time
# Wrong example (do not use this)
# Fixed interval
time.sleep(5) # Always 5 seconds - DETECTABLE
# Random uniform
time.sleep(random.uniform(3, 7)) # Still doesn't match human patterns
------------
# Correct example (use this!)
# Log-normal distribution (matches real human reaction times)
delay = np.random.lognormal(mean=1.5, sigma=0.5)
time.sleep(delay)
For improving randomization, model different action types with appropriate distributions. Use the following rules of thumb:
Clicks: 0.3-2 seconds (short delays)
Reading: 5-45 seconds (high variance)
Scrolling: 1-8 seconds (irregular intervals)
Step 10: Implement Realistic Mouse and Scroll Behavior
High-security sites like banking, ticketing, and heavily protected e-commerce websites track interaction patterns in real-time. To defend from leaking your information on such websites, you have to define mouse movements and scrolling for your automated scripts.
For mouse movements, you can:
Use Bezier curves to create natural arcing movements between points.
Add slight randomness to destination coordinates.
Include hover delays before clicking.
Vary the number of intermediate steps based on distance.
The following is an example you can try in Python:
import numpy as np
from playwright.sync_api import sync_playwright
def bezier_curve(start, end, control_points, num_steps=20):
"""Generate points along a Bezier curve for natural mouse movement"""
t = np.linspace(0, 1, num_steps)
points = []
# Simplified cubic Bezier
for t_val in t:
x = (1-t_val)**3 * start[0] + \\
3*(1-t_val)**2*t_val * control_points[0][0] + \\
3*(1-t_val)*t_val**2 * control_points[1][0] + \\
t_val**3 * end[0]
y = (1-t_val)**3 * start[1] + \\
3*(1-t_val)**2*t_val * control_points[0][1] + \\
3*(1-t_val)*t_val**2 * control_points[1][1] + \\
t_val**3 * end[1]
points.append((x, y))
return points
async def human_like_click(page, selector):
element = await page.query_selector(selector)
box = await element.bounding_box()
# Add slight randomness to destination
target_x = box['x'] + box['width']/2 + np.random.normal(0, 2)
target_y = box['y'] + box['height']/2 + np.random.normal(0, 2)
# Move mouse along curve
current_pos = await page.mouse.position()
control_points = [
(current_pos['x'] + np.random.uniform(-50, 50),
current_pos['y'] + np.random.uniform(-50, 50)),
(target_x + np.random.uniform(-20, 20),
target_y + np.random.uniform(-20, 20))
]
points = bezier_curve(
(current_pos['x'], current_pos['y']),
(target_x, target_y),
control_points
)
for x, y in points:
await page.mouse.move(x, y)
await page.wait_for_timeout(np.random.uniform(5, 15))
# Hover briefly before clicking
await page.wait_for_timeout(np.random.uniform(100, 300))
await page.mouse.click(target_x, target_y)
For scrolling, you can:
Pause between scroll actions for variable amounts of time (simulating reading).
Scroll in chunks of varying size, not uniform pixels.
Occasionally scroll backwards (humans re-read).
Don’t scroll in perfect increments or at constant speeds.
Use the following Python code to try such scrolling behaviour:
async def human_like_scroll(page, total_distance):
"""Scroll with human-like patterns"""
scrolled = 0
while scrolled < total_distance:
# Vary chunk size
chunk = np.random.randint(100, 400)
await page.mouse.wheel(0, chunk)
scrolled += chunk
# Pause to simulate reading
pause = np.random.lognormal(mean=1.2, sigma=0.8)
await page.wait_for_timeout(pause * 1000)
# Occasionally scroll backwards (humans re-read)
if np.random.random() < 0.15:
await page.mouse.wheel(0, -np.random.randint(50, 150))
await page.wait_for_timeout(np.random.uniform(500, 1500))
Step 10: Maintain Persistent Session State
Stateless scrapers look like stateless bots. Real browsers, instead, accumulate state over time because:
Cookies persist across requests and sessions.
LocalStorage accumulates tracking data over time.
Session IDs remain stable across days or weeks.
To mimic real browser states, you can use the following Python code:
import pickle
import requests
# Save cookies to disk after each session
session = requests.Session()
# ... perform scraping ...
with open('cookies.pkl', 'wb') as f:
pickle.dump(session.cookies, f)
# Before next scraping session
with open('cookies.pkl', 'rb') as f:
session.cookies.update(pickle.load(f))
In case you use a browser automation tool:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
# Save browser storage state
context = browser.new_context()
# ... perform scraping ...
context.storage_state(path='state.json')
# Reload in next session
context = browser.new_context(storage_state='state.json')
As a final note, consider keeping sessions alive for weeks to allow third-party tracking cookies to build up. Long-lived sessions with accumulated tracking data appear more legitimate than constantly refreshed clean states.
Conclusion
In this article, you learned that, if you don’t want your data to be leaked while scraping, you have to take several defensive measures, as no single technique makes you invisible. Anti-bot systems analyze multiple signals simultaneously, and any inconsistency across layers triggers detection and blocks your scrapers.
Also, detection methods evolve. So, what works today might fail tomorrow. This means you should also monitor the defenses you implemented and test new ones.
Now, let us know: How do you prevent data leaks in your scrapers? Did we miss some technique?



