<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Web Scraping Club]]></title><description><![CDATA[News, solutions and interviews about web scraping.
In this substack you will find weekly content about:
- Web Scraping techniques
- Interviews with key people in the industry
- Anti bot infos and counter measures
- Real world examples and code]]></description><link>https://substack.thewebscraping.club</link><image><url>https://substackcdn.com/image/fetch/$s_!gJt2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1e343ec9-7946-4440-8c00-57209a1d99a1_1024x1024.png</url><title>The Web Scraping Club</title><link>https://substack.thewebscraping.club</link></image><generator>Substack</generator><lastBuildDate>Wed, 15 Apr 2026 06:56:06 GMT</lastBuildDate><atom:link href="https://substack.thewebscraping.club/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Web Scraping Club SRL]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[pier@thewebscraping.club]]></webMaster><itunes:owner><itunes:email><![CDATA[pier@thewebscraping.club]]></itunes:email><itunes:name><![CDATA[Pierluigi Vinciguerra]]></itunes:name></itunes:owner><itunes:author><![CDATA[Pierluigi Vinciguerra]]></itunes:author><googleplay:owner><![CDATA[pier@thewebscraping.club]]></googleplay:owner><googleplay:email><![CDATA[pier@thewebscraping.club]]></googleplay:email><googleplay:author><![CDATA[Pierluigi Vinciguerra]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Stealth Stack: A Guide to Preventing Data Leaks in Web Scraping Infrastructure]]></title><description><![CDATA[A four-layer defense strategy for making your web scraping infrastructure indistinguishable from real users]]></description><link>https://substack.thewebscraping.club/p/the-stealth-stack-web-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-stealth-stack-web-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 12 Apr 2026 03:00:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ef273b12-ade2-4ba6-a14a-701876041775_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When hearing about &#8220;data leaks&#8221;, I&#8217;m sure you think about cybersecurity, databases, and personal information lost due to malicious intent. But what if I tell you your web scraper is leaking data? But in the specific context of web scraping, no one is stealing your data. Rather, this means that your scraper is revealing its automated nature through a set of signals. </p><p>In particular, your scrapers leak information at four distinct layer levels. Modern anti-bot systems, in fact, fingerprint your browser, analyze your TLS handshake, trace your network infrastructure, and track your behavioral patterns. And a single inconsistency across these layers triggers permanent blocking.</p><p>This means your scrapers aren&#8217;t competing only against rate limits anymore. Today, they are competing against <a href="https://substack.thewebscraping.club/p/machine-learning-for-detecting-bots">machine learning models trained on billions of legitimate requests</a>, and any deviation from the expected pattern is a signal. So, if you want to scrape at scale, your infrastructure must be indistinguishable from a real user&#8217;s browser, network stack, and behavior.</p><p>This article guides you through a systematic approach: First, understanding where leaks occur, then learning how anti-bot systems detect them, and finally building a layered defense that makes your scraper invisible.</p><p></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>Identifying the Leaks: Where Your Scraper Exposes Itself</strong></h2><p>Before fixing anything, you need to understand the complete attack surface. Modern anti-bot systems analyze your scraper at four distinct layers, and a leak at any layer can expose you.</p><h3><strong>Layer 1: The Browser Level</strong></h3><p>Headless browsers are loud by default. Launch a <a href="https://pptr.dev/">Puppeteer</a> instance and check the  <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">navigator.webdriver</a> </em>flag. It surely returns <em>true</em>, and that&#8217;s a signal every major anti-bot system checks in the first 100ms of page load.</p><p>But this obvious flag is just the beginning. Anti-bot systems probe deeper:</p><ul><li><p><strong>Error messages and stack traces</strong>: They differ between headless and headed modes. The execution context leaves fingerprints in error objects.</p></li><li><p><strong>Window dimensions</strong>: Properties like <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/outerWidth#:~:text=outerWidth%20read%2Donly%20property%20returns,and%20window%20resizing%20borders%2Fhandles.">window.outerWidth</a></em> and <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/outerHeight">window.outerHeight</a></em> reveal a headless operation because headless mode doesn&#8217;t render a visible window frame.</p></li><li><p><strong>Canvas rendering</strong>: They can produce pixel-level differences. Software rendering (headless) creates different anti-aliasing and color values than GPU-accelerated rendering (headed). Color channels can differ by 1-2 units per pixel.</p></li><li><p><strong><a href="https://developer.mozilla.org/en-US/docs/Web/API/WebGLShader">WebGL shader</a> timing</strong>: This can vary a lot, depending on the underlying technology. GPU-accelerated browsers complete WebGL operations in microseconds. Software-rendered headless browsers take milliseconds.</p></li><li><p><strong>Font rendering</strong>: Headless environments often lack the full system font stack. This creates detectable layout differences when JavaScript measures text dimensions.</p></li><li><p><strong>Performance benchmarks</strong>: When run, they can reveal software rendering. For example, there are websites that run JavaScript stress tests, creating thousands of DOM elements, calculating layouts, and triggering reflows. In such scenarios, real browsers with GPU acceleration show consistent performance. Headless browsers, instead, show different timing patterns.</p></li><li><p><strong>The </strong><em><strong><a href="https://developer.chrome.com/docs/extensions/reference/api/windows">window.chrome</a></strong></em><strong> object behaves differentl</strong>y: Real Chrome populates this object with specific properties for extension management and runtime APIs. Headless Chrome, instead, either lacks this object or provides an incomplete implementation.</p><p></p></li></ul><h3><strong>Layer 2: The Network Level</strong></h3><p>Your SSL/TLS handshake identifies you before you send any application data. When your scraper connects over HTTPS, it sends a TLS Client Hello message containing supported encryption methods, protocol versions, and extensions. All in a specific order.</p><p>Here&#8217;s what makes this dangerous:</p><ul><li><p><strong>Every browser and HTTP library has a unique TLS pattern:</strong> Real browsers send their TLS parameters in a specific sequence that matches their version and underlying platform. Python&#8217;s standard HTTP libraries send a completely different pattern. So do Node.js, Go, and any other programming language you use for coding your scrapers.</p></li><li><p><strong>Anti-bot systems fingerprint your TLS handshake:</strong> They capture these patterns and convert them into a fingerprint, commonly called a <a href="https://github.com/salesforce/ja3">JA3 hash</a>. They maintain databases of known fingerprints for every major browser and HTTP library.</p></li><li><p><strong>Mismatches between User-Agent and TLS fingerprint are instant red flags:</strong> When you claim to be Chrome in your User-Agent header but your TLS handshake matches Python&#8217;s urllib library, that inconsistency triggers blocking.</p></li><li><p><strong>Detection happens before you send any application data:</strong> The first TCP connection already identifies you as automated traffic.</p></li><li><p><strong>HTTP/2 fingerprinting adds another layer:</strong> Beyond TLS, the order and priority of HTTP/2 frames, settings, and window updates create additional fingerprints. Your HTTP library&#8217;s frame ordering must match your claimed browser identity.</p></li></ul><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo </strong>with high reputatation IPs<strong>,</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3><strong>Layer 3: The Infrastructure Level</strong></h3><p>Your proxy configuration can expose your real infrastructure through network-level leaks via the following main mechanisms:</p><ul><li><p><strong>DNS leaks:</strong> They happen when your browser resolves domain names using your local DNS server instead of routing through the proxy. Your scraper might send requests through a Miami residential proxy, but if DNS queries go through your AWS datacenter in Virginia, the target site knows your real location.</p></li><li><p><strong>WebRTC leaks:</strong> <a href="https://webrtc.org/">WebRTC </a>is a browser API designed for peer-to-peer communication. Even with a proxy configured, WebRTC will attempt to discover your real local IP and public IP through STUN servers, completely bypassing your proxy.</p></li><li><p><strong>IP reputation:</strong> Not all IPs are created equal. Cloudflare and similar services maintain databases of every AWS, Google Cloud, and Azure IP range. Requests from known cloud providers receive instant higher suspicion scores before any other analysis happens.</p></li></ul><h3><strong>Layer 4: The Behavioral Level</strong></h3><p>Even if your browser, network, and infrastructure are perfectly disguised, your behavior patterns can still expose you:</p><ul><li><p><strong>Timing patterns:</strong> Requesting data at fixed and precise intervals creates a perfect periodicity. No human browses with mathematical precision.</p></li><li><p><strong>Mouse and scroll behavior:</strong> Real humans accelerate and decelerate smoothly. Instant jumps from point A to point B are mechanically impossible.</p></li><li><p><strong>Session state:</strong> Stateless scrapers that never accumulate cookies or maintain persistent sessions across days look like fresh bots on every run.</p></li><li><p><strong>Interaction sequences:</strong> The time between page load and first click, between mouse-over and click, or the pattern of how you scroll through content. They all follow detectable human patterns.</p></li></ul><h2><strong>Understanding the Detection: How Anti-Bot Systems Catch You</strong></h2><p>Now that you know where leaks occur, let&#8217;s understand how anti-bot systems actually detect them.</p><h3><strong>Fingerprint Consistency Checks</strong></h3><p>Anti-bot systems cross-reference your claimed identity with actual behavior. If your <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent">User-Agent</a> says &#8220;Chrome 120 on Windows 10,&#8221; they verify that your JavaScript features, WebGL capabilities, canvas rendering, and TLS handshake all match Chrome 120 on Windows 10.</p><p>A single mismatch anywhere flags the entire request. You can&#8217;t be Chrome in your User-Agent, Firefox in your TLS handshake, and headless Chrome in your canvas fingerprint. <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">Anti-bot systems create composite fingerprints combining dozens of properties</a>, then compare them against databases of known legitimate and bot patterns.</p><h3><strong>Machine Learning Pattern Recognition</strong></h3><p>Modern anti-bot systems use ML models trained on billions of requests. They learn what &#8220;normal&#8221; looks like for each type of visitor. This means that consumer browsers from residential IPs have different behavioral patterns than datacenter scrapers.</p><p>For ML models, statistical anomalies trigger investigation. Perfect timing intervals, impossible mouse movements, or timing patterns that don&#8217;t match human variance distributions are scored as anomalous. These models adapt continuously, so when new stealth techniques emerge, the models retrain on that data. This means that what works today might fail tomorrow.</p><h3><strong>Progressive Trust Scoring</strong></h3><p>Anti-bot systems block or allow requests, but they also score. This means that lower trust scores receive degraded service: slower response times, rate limits, or <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">CAPTCHA challen</a>ges before blocking.</p><p>Also, scores accumulate across sessions. If you leak information across multiple visits, the system builds a profile associating your various identities. In other words, one leak can poison future requests, and even fixing the leak might not restore trust if your IP or fingerprint is already marked.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2><strong>Building the Defense: A Layered Approach to Stealth</strong></h2><p>Building a defense from data leaks in web scraping requires addressing each layer systematically. Your stealth stack must work from the inside out: browser &#8594; network &#8594; infrastructure &#8594; behavior. Each layer must remain consistent with your claimed identity.</p><h3><strong>Defense Layer 1: Hardening the Browser</strong></h3><p>The goal at this layer is to make the browser fingerprint indistinguishable from a real user&#8217;s browser and ensure every property is consistent with your claimed identity.</p><p><strong>Step 1: Mask Automation Signals</strong></p><p>Start with stealth libraries that patch the most common detection vectors:</p><ul><li><p><strong>For Puppeteer:</strong> Use <em><a href="https://www.npmjs.com/package/puppeteer-extra-plugin-stealth">puppeteer-extra-plu</a>gin-stealth</em> to automatically override <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">navigator.webdriver</a></em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">,</a> DevTools Protocol signatures, and plugin arrays.</p></li><li><p><strong>For <a href="https://www.selenium.dev/">Selenium</a>:</strong> Use <em><a href="https://pypi.org/project/undetected-chromedriver/">undetected-chromedriver</a>,</em> which patches automation signals and uses real Chrome binaries instead of ChromeDriver.</p></li><li><p><strong>For Playwright:</strong> Leverage native evasion features that handle many detection vectors out of the box.</p></li></ul><p>Additionally, disable automation flags at launch. For example, in Playwright:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--no-sandbox'
        ]
    )</code></code></pre><p>But remember: Stealth libraries handle the most common 20-30 leak vectors but miss advanced fingerprinting techniques. They&#8217;re your foundation, not your complete solution.</p><p><strong>Step 2: Spoof Hardware Signatures</strong></p><p>Cloud server canvas and WebGL fingerprints are obvious red flags. AWS, GCP, and Azure rendering signatures are well-known to anti-bot systems.</p><p>You have two approaches for your defense here:</p><ul><li><p><strong>Add consistent noise:</strong> Inject deterministic noise into canvas operations so the fingerprint remains stable across sessions but doesn&#8217;t match your server&#8217;s real hardware. Override canvas methods to modify pixel data slightly before it&#8217;s read back. Keep noise minimal: just enough to mask the real hardware signature without appearing obviously manipulated.</p></li><li><p><strong>Emulate common consumer hardware:</strong> Spoof WebGL parameters to mimic common consumer GPUs. Override vendor and renderer strings returned by WebGL APIs to match your chosen hardware profile. Use existing libraries designed for canvas fingerprint defense or implement your own parameter overrides.</p></li></ul><p><strong>Step 3: Ensure Version Consistency</strong></p><p>This is where most scrapers fail, even with stealth libraries. Your User-Agent string must match your actual browser engine behavior precisely. Consider the following rules of thumb:</p><ul><li><p><strong>Use real browser binaries instead of spoofing:</strong> Tools like Playwright can launch actual Chrome, ensuring perfect consistency between claimed version and actual behavior.</p></li><li><p><strong>If you must spoof, maintain complete version profiles:</strong> Track which JavaScript features, WebGL capabilities, and API behaviors correspond to each browser version. Every property must align.</p></li><li><p><strong>Never mix components from different versions:</strong> If you claim Chrome 120 on Windows 10, every single API, from JavaScript features to WebGL renderers, must behave exactly like Chrome 120 on Windows 10.</p></li></ul><h3><strong>Defense Layer 2: Hardening the Network Stack</strong></h3><p>Your goal at this layer is to make your TLS handshake and HTTP traffic indistinguishable from the browser you&#8217;re claiming to be.</p><p><strong>Step 4: Match TLS Fingerprints to Your Browser Identity</strong></p><p>Standard HTTP libraries can&#8217;t mimic browser TLS fingerprints because they use different SSL/TLS implementations. The solution requires specialized libraries that replicate browser behavior at the protocol level:</p><ul><li><p><strong>For Python:</strong> Use <em><a href="https://curl-cffi.readthedocs.io/en/latest/">curl_cffi</a></em> or similar wrappers. These libraries use <em><a href="https://curl.se/libcurl/">libcurl</a></em> compiled with <em><a href="https://github.com/google/boringssl">BoringSSL</a></em>, which is the same SSL library Chrome uses. This creates identical JA3 fingerprints to real browsers.</p></li><li><p><strong>For Node.js:</strong> Use <em><a href="https://www.npmjs.com/package/cycletls">cycletls</a></em> or equivalent libraries that allow you to specify exact JA3 fingerprint strings matching real browsers.</p></li></ul><p><strong>Critical requirement:</strong> Your TLS fingerprint must match your User-Agent. Chrome 120&#8217;s JA3 fingerprint is different from Firefox 115&#8217;s fingerprint. The browser identity must be consistent across all layers.</p><p><strong>Step 5: Match HTTP/2 Fingerprints</strong></p><p>Beyond TLS, HTTP/2 frame ordering creates additional fingerprints. Libraries like <em>curl_cffi</em> handle this automatically when you specify a browser to impersonate, but verify that:</p><ul><li><p>Settings frames match your target browser.</p></li><li><p>Window update sequences align.</p></li><li><p>Priority headers follow the correct pattern.</p></li></ul><p>In Python, you can do so with the following code:</p><pre><code><code>response = requests.get(
    '&lt;https://tls.peet.ws/api/all&gt;',
    impersonate='chrome120'
)
print(response.json()['http2']['sent_frames'])
</code></code></pre><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3><strong>Defense Layer 3: Hardening Infrastructure</strong></h3><p>Your goal at this layer is to ensure your network traffic originates from legitimate-looking IPs and doesn&#8217;t leak your real location or identity.</p><p><strong>Step 6: Choose the Right Proxy Type</strong></p><p>IP reputation is the first filter that anti-bot systems check. This means that your<a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies"> proxy choice determines your baseline trust score</a>. Consider the following guidelines:</p><ul><li><p><strong>Datacenter IPs = instant red flag:</strong> Requests from AWS, Google Cloud, and Azure IP ranges receive instant higher suspicion scores. </p></li><li><p><strong>Residential proxies = highest legitimacy:</strong> These IPs come from real ISP connections, so they look legitimate because they are legitimate consumer connections.</p></li><li><p><strong>Mobile proxies = premium legitimacy</strong>: These IPs originate from cellular networks (4G/5G) and receive the highest trust scores. Mobile IPs rotate naturally as devices move between cell towers, making them appear even more organic than static residential connections.</p></li></ul><p><strong>Step 7: Prevent DNS Leaks</strong></p><p>Force all DNS resolution through your proxy tunnel. For SOCKS5 proxies, use the SOCKS5h protocol variant, which forces DNS resolution on the remote proxy server instead of locally.</p><p>For example, in Python, write the following:</p><pre><code><code>import requests

proxies = {
    'http': 'socks5h://proxy.example.com:1080',
    'https': 'socks5h://proxy.example.com:1080'
}

response = requests.get('&lt;https://example.com&gt;', proxies=proxies)
</code></code></pre><p>For browser automation, configure DNS-over-HTTPS to prevent local DNS leakage. The following is an example that applies to Playwright:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        args=[
            '--dns-over-https-server=https://cloudflare-dns.com/dns-query'
        ]
    )
</code></code></pre><p><strong>Step 8: Disable WebRTC Completely</strong></p><p>WebRTC will expose your real IP unless you completely disable it in browser automation. For example, in Playwright, you can do so as follows:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    
    # Remove WebRTC entirely
    await page.add_init_script("""
        delete window.RTCPeerConnection;
        delete window.RTCSessionDescription;
        delete window.RTCIceCandidate;
        delete navigator.mediaDevices;
    """)
</code></code></pre><p>When you&#236;ve done this, verify it&#8217;s actually disabled before deploying your scraper. Visit <a href="http://browserleaks.com/webrtc">browserleaks.com/webrtc</a> with your scraper. You should see &#8220;WebRTC is not supported by your browser&#8221;, or only your proxy IP should be visible. Never your real IP.</p><h3><strong>Defense Layer 4: Mimicking Human Behavior</strong></h3><p>Your goal at this layer is to make your interaction patterns indistinguishable from those of real human users.</p><p><strong>Step 9: Add Timing Jitter and Randomization</strong></p><p>Humans are inconsistent. Perfect patterns are robotic. The solution here is not to just add randomness. You also need to match the statistical distribution of real human behavior. To do so, consider the following example in Python:</p><pre><code><code>import numpy as np
import time

# Wrong example (do not use this)

# Fixed interval
time.sleep(5)  # Always 5 seconds - DETECTABLE

# Random uniform
time.sleep(random.uniform(3, 7))  # Still doesn't match human patterns

------------

# Correct example (use this!)

# Log-normal distribution (matches real human reaction times)
delay = np.random.lognormal(mean=1.5, sigma=0.5)
time.sleep(delay)
</code></code></pre><p>For improving randomization, model different action types with appropriate distributions. Use the following rules of thumb:</p><ul><li><p>Clicks: 0.3-2 seconds (short delays)</p></li><li><p>Reading: 5-45 seconds (high variance)</p></li><li><p>Scrolling: 1-8 seconds (irregular intervals)</p></li></ul><p><strong>Step 10: Implement Realistic Mouse and Scroll Behavior</strong></p><p>High-security sites like banking, ticketing, and heavily protected e-commerce websites track interaction patterns in real-time. To defend from leaking your information on such websites, you have to define mouse movements and scrolling for your automated scripts.</p><p>For mouse movements, you can:</p><ul><li><p>Use Bezier curves to create natural arcing movements between points.</p></li><li><p>Add slight randomness to destination coordinates.</p></li><li><p>Include hover delays before clicking.</p></li><li><p>Vary the number of intermediate steps based on distance.</p></li></ul><p>The following is an example you can try in Python:</p><pre><code><code>import numpy as np
from playwright.sync_api import sync_playwright

def bezier_curve(start, end, control_points, num_steps=20):
    """Generate points along a Bezier curve for natural mouse movement"""
    t = np.linspace(0, 1, num_steps)
    points = []
    
    # Simplified cubic Bezier
    for t_val in t:
        x = (1-t_val)**3 * start[0] + \\
            3*(1-t_val)**2*t_val * control_points[0][0] + \\
            3*(1-t_val)*t_val**2 * control_points[1][0] + \\
            t_val**3 * end[0]
        y = (1-t_val)**3 * start[1] + \\
            3*(1-t_val)**2*t_val * control_points[0][1] + \\
            3*(1-t_val)*t_val**2 * control_points[1][1] + \\
            t_val**3 * end[1]
        points.append((x, y))
    
    return points

async def human_like_click(page, selector):
    element = await page.query_selector(selector)
    box = await element.bounding_box()
    
    # Add slight randomness to destination
    target_x = box['x'] + box['width']/2 + np.random.normal(0, 2)
    target_y = box['y'] + box['height']/2 + np.random.normal(0, 2)
    
    # Move mouse along curve
    current_pos = await page.mouse.position()
    control_points = [
        (current_pos['x'] + np.random.uniform(-50, 50), 
         current_pos['y'] + np.random.uniform(-50, 50)),
        (target_x + np.random.uniform(-20, 20), 
         target_y + np.random.uniform(-20, 20))
    ]
    
    points = bezier_curve(
        (current_pos['x'], current_pos['y']), 
        (target_x, target_y), 
        control_points
    )
    
    for x, y in points:
        await page.mouse.move(x, y)
        await page.wait_for_timeout(np.random.uniform(5, 15))
    
    # Hover briefly before clicking
    await page.wait_for_timeout(np.random.uniform(100, 300))
    await page.mouse.click(target_x, target_y)
</code></code></pre><p>For scrolling, you can:</p><ul><li><p>Pause between scroll actions for variable amounts of time (simulating reading).</p></li><li><p>Scroll in chunks of varying size, not uniform pixels.</p></li><li><p>Occasionally scroll backwards (humans re-read).</p></li><li><p>Don&#8217;t scroll in perfect increments or at constant speeds.</p></li></ul><p>Use the following Python code to try such scrolling behaviour:</p><pre><code><code>async def human_like_scroll(page, total_distance):
    """Scroll with human-like patterns"""
    scrolled = 0
    
    while scrolled &lt; total_distance:
        # Vary chunk size
        chunk = np.random.randint(100, 400)
        
        await page.mouse.wheel(0, chunk)
        scrolled += chunk
        
        # Pause to simulate reading
        pause = np.random.lognormal(mean=1.2, sigma=0.8)
        await page.wait_for_timeout(pause * 1000)
        
        # Occasionally scroll backwards (humans re-read)
        if np.random.random() &lt; 0.15:
            await page.mouse.wheel(0, -np.random.randint(50, 150))
            await page.wait_for_timeout(np.random.uniform(500, 1500))
</code></code></pre><p><strong>Step 10: Maintain Persistent Session State</strong></p><p>Stateless scrapers look like stateless bots. Real browsers, instead, accumulate state over time because:</p><ul><li><p>Cookies persist across requests and sessions.</p></li><li><p>LocalStorage accumulates tracking data over time.</p></li><li><p>Session IDs remain stable across days or weeks.</p></li></ul><p>To mimic real browser states, you can use the following Python code:</p><pre><code><code>import pickle
import requests

# Save cookies to disk after each session
session = requests.Session()

# ... perform scraping ...

with open('cookies.pkl', 'wb') as f:
    pickle.dump(session.cookies, f)

# Before next scraping session
with open('cookies.pkl', 'rb') as f:
    session.cookies.update(pickle.load(f))
</code></code></pre><p>In case you use a browser automation tool:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    
    # Save browser storage state
    context = browser.new_context()
    # ... perform scraping ...
    context.storage_state(path='state.json')
    
    # Reload in next session
    context = browser.new_context(storage_state='state.json')
</code></code></pre><p>As a final note, consider keeping sessions alive for weeks to allow third-party tracking cookies to build up. Long-lived sessions with accumulated tracking data appear more legitimate than constantly refreshed clean states.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Conclusion</strong></h2><p>In this article, you learned that, if you don&#8217;t want your data to be leaked while scraping, you have to take several defensive measures, as no single technique makes you invisible. Anti-bot systems analyze multiple signals simultaneously, and any inconsistency across layers triggers detection and blocks your scrapers.</p><p>Also, detection methods evolve. So, what works today might fail tomorrow. This means you should also monitor the defenses you implemented and test new ones.</p><p>Now, let us know: How do you prevent data leaks in your scrapers? Did we miss some technique?</p>]]></content:encoded></item><item><title><![CDATA[rayobrowse: A Hands-On Look at the Stealth Browser From Rayobyte]]></title><description><![CDATA[Looking for a Camoufox alternative? Here&#8217;s an interesting stealth browser worth checking out!]]></description><link>https://substack.thewebscraping.club/p/rayobrowse-browser-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/rayobrowse-browser-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 05 Apr 2026 03:00:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/442d19ad-ddc9-4b14-afda-71c81a91ffc4_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The open&#8209;source nature of Camoufox is what made the project so popular and appealing. Unfortunately, that same openness is also what allowed anti&#8209;bot giants to study it closely and eventually crack down on it.</p><p>Rayobyte, the proxy and web scraping solutions provider, has taken a different approach. They recently released <em>rayobrowse</em>, a closed&#8209;source yet Docker&#8209;based, self&#8209;hostable stealth browser built for local browser automation and web scraping.</p><p>In this post, I&#8217;ll take a deep look at this solution and walk you through everything you need to know about it. By the end, you&#8217;ll understand what rayobrowse is, how its stealth browser approach works, how to set it up, and whether it&#8217;s actually worth paying attention to.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>An Introduction to rayobrowse</h2><p>Let me introduce you to the world of rayobrowse, helping you understand what it is and what makes this project special.</p><h3>What is rayobrowse?</h3><p><a href="https://github.com/rayobyte-data/rayobrowse">rayobrowse</a> is a self-hosted, Chromium-based stealth browser engineered for web scraping, AI agents, and automation workflows. It&#8217;s available as a Docker image, with optional support via a Python SDK (<em><a href="https://pypi.org/project/rayobrowse">rayobrowse</a></em> on PyPI) for simplified connection. The project is developed and maintained by Rayobyte.</p><p>The stealth browser runs inside Docker and is available via the <a href="https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi">Chrome DevTools Protocol (CDP)</a>. That means tools like Playwright, Puppeteer, and Selenium (or any other tool that speaks CDP) can natively connect to it for automation purposes.</p><p>What makes it noteworthy is its approach to <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">device fingerprinting</a>. User agents, screen size, WebGL, fonts, timezone, and other signals are tuned so each session looks like a real browser. That way, it helps your automation avoid detection on protected websites.</p><h3>Core Principles Driving the Solution</h3><p>These are the core principles and goals behind the project:</p><ol><li><p>It should run on Linux server environments without GPUs or a GUI/desktop interface.</p></li><li><p>It should patch Chromium at the C++ level, rather than at higher layers like CDP, which are easier for anti-bot systems to detect.</p></li><li><p>It should work with Playwright, a common framework in <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browsing automation stacks</a>.</p></li><li><p>It should support both headful mode (via <a href="https://www.x.org/archive/X11R7.7/doc/man/man1/Xvfb.1.xhtml">Xvfb</a>) and headless mode.</p></li><li><p>It should emulate fingerprints from real-world devices across different regions.</p></li><li><p>It should be self-hostable, so you can run it locally without relying on cloud infrastructure.</p></li><li><p>It should be free to test and use for certain user segments.</p></li><li><p>It should reliably bypass major anti-bot systems and scraping targets, including complex ecommerce and SERP platforms.</p></li></ol><p><strong>Note</strong>: If you&#8217;re not familiar with Xvfb, that&#8217;s an in&#8209;memory display server for Unix-like systems that implements the X11 display protocol without requiring a physical display or input devices. In simpler terms, it allows GUI applications to run in headless environments. rayobrowse relies on it to launch headful browser sessions even on servers without a graphical interface (that&#8217;s beneficial as headful sessions are harder to detect than purely headless ones).</p><h2>Main Features for Stealth Browsing and More</h2><p>Here is a list of the most relevant rayobrowse features:</p><ul><li><p><strong>Fingerprint spoofing</strong>:<strong> </strong>Each browser session comes with a real-world realistic device fingerprint drawn from a database of thousands of profiles. Signals include user agent, OS metadata, screen resolution, fonts, WebGL, hardware concurrency, and timezone.</p></li><li><p><strong>Human&#8209;like mouse movement</strong>: Optional human&#8209;style cursor behavior (inspired by <a href="https://github.com/riflosnake/HumanCursor">HumanCursor</a>) makes automation appear more natural. When using standard Playwright actions like <em>page.click()</em> or <em>page.mouse.move()</em>, the library applies realistic curves and timing.</p></li><li><p><strong>Proxy Integration</strong>: Traffic can be routed through any HTTP proxy, including authenticated and rotating proxies.</p></li><li><p><strong>Headless and headful Support</strong>: rayobrowse supports both execution modes, even on GUI-less Linux servers.</p></li><li><p><strong>Live session viewer</strong>:<strong> </strong>A built&#8209;in noVNC interface (available at http://localhost:6080) lets you watch browser sessions in real time directly from the browser. This is particularly useful for debugging scraping flows and visually verifying fingerprint behavior.</p></li><li><p><strong>Official integrations</strong>:<strong> </strong>The browser integrates with common automation frameworks, namely Playwright, Puppeteer, Selenium, and Scrapy (via <em><a href="https://substack.thewebscraping.club/p/basic-scrapy-configuration">scrapy-playwright</a></em>), as well as emerging <a href="https://substack.thewebscraping.club/p/my-first-week-with-openclaw">AI&#8209;driven tools such as OpenClaw</a>. As of this writing, additional integrations (e.g., Firecrawl and LangChain) are planned.</p></li><li><p><strong>Remote/Cloud mode</strong>: rayobrowse can run as a <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#remote--cloud-mode-beta">remote browser service</a>. Your server requests new browser instances through a REST API, and workers connect directly to the returned CDP WebSocket endpoint. This is still a beta feature.</p></li><li><p><strong>API&#8209;driven browser management</strong>:<strong> </strong>The daemon exposes REST endpoints for creating, listing, and deleting browser sessions, allowing you to orchestrate multiple browsers across a distributed scraping infrastructure.</p></li></ul><h2>Technical Details About the Project</h2><p>Now that you know what the project is and the features it provides, you&#8217;re ready to dive into the technical aspects.</p><h3>How rayobrowse Works</h3><p>At a high level, rayobrowse follows these steps:</p><ol><li><p><strong>Chromium patching</strong>:<strong> </strong>The project tracks upstream Chromium releases and applies a focused set of patches (relying on an <a href="https://github.com/brave/brave-core/blob/master/tools/cr/plaster.py">approach similar to Brave&#8217;s &#8220;plaster&#8221; model</a>). These patches normalize exposed browser APIs, reduce fingerprint entropy leaks, improve automation compatibility, and preserve native Chromium behavior whenever possible.</p></li><li><p><strong>Fingerprint assignment</strong>: When a browser session starts, rayobrowse assigns a realistic device fingerprint.</p></li><li><p><strong>Automation integration</strong>: Browser automation libraries connect to rayobrowse through the native CDP.</p></li></ol><h3>Architecture</h3><p>Architecturally, rayobrowse follows a clean separation between the browser runtime and the automation code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vdVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vdVO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 424w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 848w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png" width="1456" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;rayobrowse&#8217;s architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="rayobrowse&#8217;s architecture" title="rayobrowse&#8217;s architecture" srcset="https://substackcdn.com/image/fetch/$s_!vdVO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 424w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 848w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">rayobrowse&#8217;s architecture</figcaption></figure></div><p>In particular, the system runs as a Docker container that bundles three core components:</p><ol><li><p>A daemon server that manages browser sessions.</p></li><li><p>A browser manager that downloads and retrieves the correct version of Chromium, a fingerprint engine that injects realistic device profiles, and a stealth browser layer containing a custom Chromium build with stealth patches.</p></li><li><p>A <a href="https://github.com/novnc/noVNC">noVNC viewer</a>, which lets you watch browser sessions in real time. This is useful for debugging and demos.</p></li></ol><p>As you can see, the automation scripts don&#8217;t run inside the container. Instead, they run on the host machine and connect to the browser remotely through the Chrome DevTools Protocol.</p><p>When a new session starts, rayobrowse assigns a real-user-looking fingerprint from a large database of actual devices, containing thousands of permutations collected from websites Rayobyte owns.</p><h3>Requirements</h3><p>The rayobrowse project is designed to run on Linux servers without GPUs (which is a common deployment environment).</p><p>These are the required prerequisites:</p><ul><li><p>Docker, as the browser runs entirely inside a container.</p></li><li><p>~2GB of available RAM, as each browser instance uses ~300MB.</p></li></ul><p>The main benefit of this Docker-based approach is that you don&#8217;t need to install Chromium locally, configure fonts, or set up Xvfb manually. All of those dependencies live inside the container, which keeps the host machine clean, portable, and reproducible.</p><p>It also makes the project well-suited for self-hosted environments without exposing its internal Chromium patching logic, making it much harder for anti-bot solution providers to reverse engineer how it works.</p><p>In terms of compatibility, rayobrowse works on Linux, Windows (native or WSL2), and macOS. The supported architectures are <em>x86_64 (amd64)</em> and <em>ARM64</em> (Apple Silicon and AWS Graviton). Still, you don&#8217;t have to worry about the architecture, as Docker automatically pulls the correct image for the host machine.</p><p><strong>Optional</strong>: If you plan to use the stealth browser through the Python SDK, an additional requirement is Python 3.10+.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>How to Access rayobrowse</h2><p>There are two main ways you can access rayobrowse:</p><ol><li><p>The <em>/connect</em> endpoint.</p></li><li><p>The built-in Python SDK.</p></li></ol><h3>Method #1: Use the /connect Endpoint</h3><p>The first rayobrowse usage method involves connecting directly to the <em>/connect</em> endpoint. This allows any CDP&#8209;compatible tool (including Selenium, Playwright, and Puppeteer) to open a browser session simply by pointing to a WebSocket URL like <em>ws://localhost:9222/connect</em>.</p><p>For instance, take a look at the Playwright connection example below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Connect to rayobrowse via CDP
    browser = p.chromium.connect_over_cdp("ws://localhost:9222/connect")
    page = browser.new_context().new_page()

    # Automation logic...

    browser.close()</code></pre></div><p>Keep in mind that the WebSocket browser connection URL can be customized using query parameters, as follows:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">ws://localhost:9222/connect?headless=false&amp;os=android&amp;proxy=http://user:pass@host:port</code></pre></div><p>This URL creates a rayobrowse Chromium browser session in headful mode, using Android-based fingerprints, while routing all requests through the proxy <em><a href="http://user:pass@host:port">http://user:pass@host:port</a></em>.</p><p>Explore all <em>/connect</em> query parameters <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#using-connect-simplest">in the docs</a>.</p><h3>Method #2: Use the Python SDK</h3><p>You can also interact with rayobrowse through the built-in Python SDK. This exposes a <em><a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#api-reference">create_browser()</a></em> function that returns a CDP WebSocket URL for a newly created browser instance. From there, connect using Playwright or another automation framework, as shown below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from rayobrowse import create_browser
from playwright.sync_api import sync_playwright

# Configure the rayobrowse connection to run in headful mode 
# while simulating a Windows-based fingerprint
ws_url = create_browser(headless=False, target_os="windows")

with sync_playwright() as p:
    # Connect to rayobrowse with the configured URL via CDP
    browser = p.chromium.connect_over_cdp(ws_url)
    page = browser.contexts[0].pages[0]
 
    # Automation logic...

    browser.close()</code></pre></div><p>This approach gives you more control over the browser lifecycle, but it also involves more configuration and setup.</p><p>For more examples (e.g., proxy integration, multi-browser management, etc.), <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#using-the-python-sdk">check out the docs</a>.</p><h2>Get Started with rayobrowse: Step-by-Step Guide</h2><p>In this guided section, I&#8217;ll show you how to build a simple Playwright script that connects to rayobrowse.</p><p>For the sake of simplicity, I&#8217;ll assume you already have:</p><ul><li><p>A Unix-based system (Linux, macOS, or Windows via WSL).</p></li><li><p>Docker installed and running on your machine.</p></li><li><p>Git installed locally.</p></li><li><p>A Python environment set up <a href="https://substack.thewebscraping.club/p/scraping-vs-playwright-web-scraping">with Playwright installed</a>.</p></li></ul><p>Follow the instructions below!</p><h3>Step #1: Clone the rayobrowse Repository</h3><p>The first step is to clone the rayobrowse repository to your machine:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">git clone https://github.com/rayobyte-data/rayobrowse</code></pre></div><p>Then, enter the project folder with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">cd rayobrowse</code></pre></div><p>The cloned folder already includes everything you need to get started, including:</p><ul><li><p><em>docker-compose.yml</em>:<strong> </strong>For running the browser container.</p></li><li><p><em>requirements.txt</em>: For installing the Python SDK.</p></li></ul><h3>Step #2: Set Up the Environment</h3><p>rayobrowse requires a .env file that contains the configuration needed to run the browser daemon. For a full list of available environment variables and what they enable, <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#environment-variables">explore the official documentation</a>.</p><p>Start by creating a <em>.env</em> file as a copy of the <em>.env.example</em> file coming with the repository:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">cp .env.example .env</code></pre></div><p>Then open the <em>.env</em> file and make sure it contains:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">STEALTH_BROWSER_ACCEPT_TERMS=true</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zjWr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zjWr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Setting the STEALTH_BROWSER_ACCEPT_TERMS env&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Setting the STEALTH_BROWSER_ACCEPT_TERMS env" title="Setting the STEALTH_BROWSER_ACCEPT_TERMS env" srcset="https://substackcdn.com/image/fetch/$s_!zjWr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Setting the STEALTH_BROWSER_ACCEPT_TERMS env</figcaption></figure></div><p>This confirms that you accept the project&#8217;s <a href="https://github.com/rayobyte-data/rayobrowse/blob/main/LICENSE">LICENSE</a>. Without that setting, the daemon will refuse to create browser sessions.</p><h3>Step #3: Start the Docker Container</h3><p>Launch the rayobrowse Docker container:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">docker compose up -d</code></pre></div><p>Docker will automatically pull the appropriate image for your system architecture (<em>x86_64</em> or <em>ARM64</em>). Then, it&#8217;ll start the container, as explained earlier.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FB1x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FB1x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 424w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 848w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1272w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png" width="1456" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output of the &#8220;docker compose up -d&#8221; command&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output of the &#8220;docker compose up -d&#8221; command" title="The output of the &#8220;docker compose up -d&#8221; command" srcset="https://substackcdn.com/image/fetch/$s_!FB1x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 424w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 848w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1272w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output of the &#8220;docker compose up -d&#8221; command</figcaption></figure></div><h3>Step #4: Connect via CDP and Apply the Automation Logic</h3><p>You can now connect to the running rayobrowse instance through the <em>/connect</em> endpoint using any CDP-compatible client. In this example, I&#8217;ll use Playwright with Python:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Connect to the rayobrowse browser through the CDP WebSocket endpoint
    browser = p.chromium.connect_over_cdp(
        "ws://localhost:9222/connect?headless=false&amp;os=windows"
    )

    # Create a new browser context and page
    page = browser.new_context().new_page()

    # Navigate to the target (sample) page
    page.goto("https://quotes.toscrape.com/")

    # Print the page title to verify the session is working
    print(page.title()) # Output: "Quotes to Scrape"

    # Add your scraping logic here...

    # Close the browser session
    browser.close()</code></pre></div><p>At this point, write your scraping or automation logic, which will run inside the stealth Chromium browser provided by rayobrowse.</p><p>For debugging, you can watch the browser session live through noVNC at <em><a href="http://localhost:6080/vnc.html">http://localhost:6080/vnc.html</a></em>. While the script is running, you should see a headful Chromium session opening and navigating to the target page specified in the script:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v1V8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v1V8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Monitoring the target browser session at http://localhost:6080/vnc.html&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Monitoring the target browser session at http://localhost:6080/vnc.html" title="Monitoring the target browser session at http://localhost:6080/vnc.html" srcset="https://substackcdn.com/image/fetch/$s_!v1V8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Monitoring the target browser session at http://localhost:6080/vnc.html</figcaption></figure></div><p>As you can tell, the server creates a headful Chromium session (due to the <em>headless=false</em> query parameter) and connects it to the page requested by the script.</p><p><strong>Optional</strong>: If you want more control over the browser lifecycle, install the Python SDK with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">pip install -r requirements.txt</code></pre></div><p>Take a look at the <a href="https://github.com/rayobyte-data/rayobrowse/tree/main/examples">official examples in the repository</a> for more guidance.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Pricing and Limitations</h3><p>This is how the rayobrowse pricing model works:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bDvq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bDvq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 424w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 848w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202193,&quot;alt&quot;:&quot;rayobrowse&#8217;s pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/190103610?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="rayobrowse&#8217;s pricing model" title="rayobrowse&#8217;s pricing model" srcset="https://substackcdn.com/image/fetch/$s_!bDvq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 424w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 848w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">rayobrowse&#8217;s pricing model</figcaption></figure></div><p>What matters most for us, developers, is that you can run rayobrowse for free via self&#8209;hosting. In practice, the only real cost comes from proxies, which are necessary for scaling scraping workloads and avoiding IP bans (something that&#8217;s standard in most production scraping setups).</p><p>The main thing to keep in mind is that rayobrowse is still in beta. Rayobyte already uses it to scrape millions of pages per day, but results can vary depending on the target site and configuration.</p><p>Fingerprint coverage is currently strongest for Windows and Android, while macOS and Linux profiles are less mature. In addition, Canvas and WebGL fingerprinting are still evolving, which means some websites may detect the current implementation.</p><h2>Benchmarks and Final Comment</h2><p>To put rayobrowse to the test, I ran a simple script against a single page for each of the most popular anti&#8209;bot detection systems. These are the results I obtained:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lZAd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lZAd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 424w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 848w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1272w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png" width="1456" height="369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80344,&quot;alt&quot;:&quot;Playright vs rayobrowse: Benchmark comparison table&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/190103610?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Playright vs rayobrowse: Benchmark comparison table" title="Playright vs rayobrowse: Benchmark comparison table" srcset="https://substackcdn.com/image/fetch/$s_!lZAd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 424w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 848w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1272w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Playwright vs rayobrowse: Benchmark comparison table</figcaption></figure></div><p><strong>Note:</strong> These tests were performed on my local machine using my ISP&#8217;s IP address.</p><p>As you can see, in this simple experiment rayobrowse achieved a 100% success rate, while Playwright failed consistently in headless mode and even struggled in some headful scenarios.</p><p>This suggests that the project is definitely worth keeping an eye on, especially thanks to its self&#8209;hosted nature.</p><p><em>To be honest, and this is just my personal opinion as an expert who works in this field, I don&#8217;t usually get very excited about projects like this&#8230;. In my experience, many libraries of this type either get cracked down on or simply don&#8217;t receive the long&#8209;term support they deserve. In this case, however, things are a bit different. The project is closed&#8209;source and backed by a well&#8209;known company in the industry, which makes the expectations for its future understandably much higher!</em></p><p>Here, I covered what the project is about, what it offers, how it works, and how to use it. As always, remember to use rayobrowse only for legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical web scraping</a>. Until next time!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>FAQ</h2><h3>Why is rayobrowse based on Chromium and not Chrome?</h3><p>rayobrowse is based on Chromium simply because Chrome is closed-source. Plus, tests performed on difficult websites show no meaningful difference in detection rates between Chrome and Chromium. Using Chromium also avoids false positives and reflects the broader ecosystem of Chromium-based browsers like Brave, Edge, and Samsung Internet.</p><h3>Is rayobrowse open source?</h3><p>rayobrowse isn&#8217;t open-source to prevent anti-bot companies from reverse-engineering it. Similar projects, like <a href="https://github.com/daijro/camoufox">Camoufox</a>, were quickly studied and countered once their code became public. Rayobyte decided to keep the project closed-source to help maintain its effectiveness and reliability over the long term.</p><h3>Can everyone use rayobrowse?</h3><p>No, not all companies can use rayobrowse. Its license prohibits organizations listed in <a href="https://cdn.sb.rayobyte.com/list-of-prohibited-companies.txt">Rayobyte&#8217;s restricted list</a> from using the software. For everyone else, the project is free to download and run locally.</p><h3>Does rayobrowse support proxy integration?</h3><p>Yes, Rayobrowse fully supports proxy integration. You can route traffic through any HTTP proxy using the <em>proxy </em>query parameter on the <em>/connect</em> endpoint or via the <em>proxy </em>option exposed by the <em>create_browser() </em>function from the Python SDK. The proxy support includes authentication and rotating proxies.</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #101: Building an Internal Knowledge Base for Your Scraping Team]]></title><description><![CDATA[Every scraping team that survives long enough develops the same disease.]]></description><link>https://substack.thewebscraping.club/p/building-knowledge-base-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/building-knowledge-base-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 02 Apr 2026 19:17:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3dba6c6a-f027-4c60-ad27-2c2378c217c6_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every scraping team that survives long enough develops the same disease. Someone figures out how to bypass Cloudflare&#8217;s latest challenge, writes it up in Notion, and moves on. Three months later, a teammate runs into the same problem, spends two days reinventing the solution, and documents it in a Google Doc. Meanwhile, the original Notion page has become outdated because Cloudflare changed its challenge flow, and nobody updated it.</p><p>We have seen this pattern in every scraping operation we have worked with. The knowledge exists. It is just scattered across wikis, Slack threads, internal repos, and people&#8217;s heads. The real problem is not documentation; it is retrieval. People write things down. They just cannot find them when it matters.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>In <a href="https://substack.thewebscraping.club/p/ingest-web-data-rag-llm">THE LAB #77</a>, we explored the concept of RAG (Retrieval-Augmented Generation) applied to scraped data and showed how to build a basic knowledge assistant using FAISS. That was a proof of concept. This time we are going deeper. We are showing the production system we actually built and use daily, and we are explaining the reasoning behind each design choice: why markdown, how embeddings work, which chunking strategy actually performs better, and what role auto-tagging plays in retrieval.</p><p>After reading this article, we hope you will understand the mechanics well enough to build the same system for your team.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>What we are building and why</h2><p>At TWSC, we have published around 300 articles over the past four years. Tutorials, reverse-engineering deep dives, tool comparisons, anti-bot analysis. When we sit down to write a new article, we need to remember what we have already covered, find previous work to link to, and check whether a technique we are about to describe was already explained in a past issue. Doing this by memory or by searching Substack&#8217;s archive stops working after the first hundred articles. </p><p>We also follow what the broader community publishes. Projects like <a href="https://crawl4ai.dev">Crawl4AI</a>, which appeared on Hacker News, show that the need to ingest web content into structured, LLM-ready knowledge bases is shared across the industry. The tools for crawling and extracting content keep getting better, but the retrieval side, finding the right piece of information in a growing archive, still requires a purpose-built system.</p><p>So we built one. Here is what the complete pipeline looks like:<br></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;631f6d4b-586d-4ef5-ba12-640a3cb186b0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Sources                                  Processing              Storage &amp; Retrieval
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;                                &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;              &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
Substack articles                   &#9472;&#9472;&#9488;
                                      &#9500;&#9472;&#9472;&gt; HTML-to-Markdown &#9472;&#9472;&gt; Frontmatter + Tagging &#9472;&#9472;&gt; Markdown files
Hacker News and other sources       &#9472;&#9472;&#9496;

Markdown files &#9472;&#9472;&gt; Chunker &#9472;&#9472;&gt; Embedder (e5-large-v2) &#9472;&#9472;&gt; PostgreSQL + pgvector

Search query &#9472;&#9472;&gt; Query embedding &#9472;&#9472;&gt; Cosine similarity search &#9472;&#9472;&gt; Ranked results</code></pre></div><p>Three stages, each independent and replaceable. You scrape content from your sources. You process and embed it. You search it. </p><p>If your team writes in Confluence instead of Substack, you swap the scraper. If you prefer Qdrant over pgvector, you swap the vector store. The architecture remains the same.<br><br>And here&#8217;s the hardware used for most of the steps, from embedding to the storage and retrieval: my DGX Spark.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yhsw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg" width="566" height="511.689557855127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:961,&quot;width&quot;:1063,&quot;resizeWidth&quot;:566,&quot;bytes&quot;:181504,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192358785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40edcbd4-e6c6-4172-bf4c-ee62da325b0f_1280x1707.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yes, I know, probably an overkill.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>The tools</h2><p><strong>Playwright</strong> handles browser-based scraping for our own Substack articles. Substack serves content dynamically and requires authentication for premium posts, so a plain HTTP client is not an option.</p><p><strong>Algolia API</strong> (via Hacker News) provides structured search over HN stories. No scraping needed: HN exposes its full search index through public endpoints.</p><p><strong><a href="https://scrapegraphai.com/">ScrapegraphAI</a> and <a href="https://www.firecrawl.dev/">Firecrawl</a></strong> convert external article URLs into clean markdown. ScrapegraphAI is the primary extractor, Firecrawl is the fallback.</p><p><strong>sentence-transformers</strong> with the <code>intfloat/e5-large-v2</code> model generates 1024-dimensional embeddings. We will explain why we chose this model later in the article.</p><p><strong>PostgreSQL with pgvector</strong> stores embeddings and handles similarity search. We chose it over dedicated vector databases because we already need PostgreSQL for metadata, and pgvector with HNSW indexing handles our scale without adding infrastructure.</p><p><strong>Docker Compose</strong> ties everything together as three containers: PostgreSQL, the API server, and the indexer.</p><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">101.KNOWLEDGE_BASE</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Why markdown as the universal format</h2><p>The first design choice we had to make was what format our knowledge base would store. We had content from Substack (HTML), Hacker News links (various formats), and potentially Confluence, Google Docs, or Slack in the future. We needed a common representation.</p><p>We chose markdown for three reasons.</p><p><strong>First</strong>, markdown preserves document structure without carrying rendering noise. An HTML page contains navigation bars, ad slots, JavaScript, CSS classes, and layout dividers. None of that is content. When you convert to markdown, you keep headings, paragraphs, code blocks, links, and lists. Everything the embedding model needs, nothing it would choke on.</p><p><strong>Second</strong>, markdown is readable by humans and machines alike. When something goes wrong in the pipeline, you can open a markdown file and immediately see what the system is working with. Try doing that with a serialized HTML DOM or a JSON blob from an API response.</p><p><strong>Third</strong>, YAML frontmatter is a natural fit for markdown and gives us a structured metadata header without mixing it into the content. Each file gets an `id`, `type`, `title`, `publish_date`, `topics`, and `visibility` field. This metadata drives filtering at search time and never enters the embedding model. The separation is important: embeddings capture meaning, frontmatter captures facts.</p><p>There are two paths to get content into markdown. You can build your own converter using open-source libraries, or you can use commercial services that handle extraction and conversion for you. In this article we show both approaches deliberately. For our own Substack articles, we built a converter from scratch with BeautifulSoup and markdownify. It costs nothing, we control every detail, and it works because we know the source HTML structure intimately. For external content discovered on Hacker News, we use commercial services like ScrapegraphAI and Firecrawl instead, because every URL leads to a different site with a different HTML structure. Building custom converters for thousands of unknown domains would be impractical. The trade-off is clear: when you control the source, build your own; when you are scraping the open web, commercial extraction services save an enormous amount of development time.</p><p>Our Substack HTML-to-markdown converter is deliberately simple. It strips scripts, styles, buttons, navigation, and footers, then converts the remaining HTML:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;aa028f4f-e1d2-412f-88bc-29153974e70e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def html_to_markdown(html: str) -&gt; str:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.find_all(["script", "style", "button", "form", "nav", "footer"]):
        tag.decompose()

    md = markdownify(
        str(soup),
        heading_style="ATX",
        bullets="-",
        strip=["script", "style", "button", "form", "nav"],
    )
    md = re.sub(r"\n{4,}", "\n\n\n", md)
    return md.strip()</code></pre></div><p>The final output for each document looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;95188632-dd66-4b1e-a5fe-167c1807dcdc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">---
id: a1b2c3d4e5f6...
type: twsc_article
title: "THE LAB #94: Using Cookies and Session Persistence"
slug: the-lab-94-using-cookies-and-session
canonical_url: https://substack.thewebscraping.club/p/the-lab-94-using-cookies-and-session
publish_date: 2025-11-15
visibility: premium
topics:
  - browser-automation
  - cloudflare
  - scraping-infra
---

[article body in markdown]</code></pre></div><h2>Scraping your own content</h2><p>The first source we built was a scraper for our own Substack articles. The pattern applies to any CMS: discover URLs, authenticate if needed, extract content, convert to markdown with frontmatter.</p><h3>URL discovery and authentication</h3><p>Most publishing platforms expose a sitemap. We fetch it, filter for article URLs (Substack uses <code>/p/</code> in the path), and track the <code>lastmod</code> date to detect changes:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;cf6dd5c0-8f99-4466-bc88-5bfe8f8b109a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def fetch_sitemap(sitemap_url: str) -&gt; list[dict]:
    req = Request(sitemap_url)
    req.add_header("User-Agent", "Mozilla/5.0 ...")
    with urlopen(req) as response:
        content = response.read()

    root = ET.fromstring(content)
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    articles = []
    for url_elem in root.findall("sm:url", ns):
        loc = url_elem.find("sm:loc", ns)
        lastmod = url_elem.find("sm:lastmod", ns)
        if loc is not None and "/p/" in loc.text:
            articles.append({"url": loc.text.strip(), "lastmod": lastmod.text or ""})
    return articles</code></pre></div><p>Substack gates premium content behind authentication. We handle this with a persistent Playwright browser context that stores cookies across runs. On the first run you log in manually; after that, the saved session keeps you authenticated. For cron jobs, we verify the session by loading a known premium article and checking if the full content appears.</p><p>We try multiple CSS selectors for extraction because Substack has changed its DOM structure over time. The extracted HTML goes through the markdown converter we showed earlier.</p><h2>Ingesting external sources: Hacker News</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/building-knowledge-base-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Data Scraping for Market Research: A Developers Guide]]></title><description><![CDATA[Build scrapers that deliver real market intelligence, not just raw data dumps]]></description><link>https://substack.thewebscraping.club/p/data-scraping-market-research</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/data-scraping-market-research</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 29 Mar 2026 20:38:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e95388da-deb3-4a33-9e90-438b2658fddd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Market research has always been about answering a simple question: &#8220;<em>What&#8217;s happening in the market, and how do I use that to make better decisions?&#8221;</em></p><p>The traditional way to answer that question involved surveys, focus groups, and expensive reports from firms that charge you a fortune for data that&#8217;s already a few months old by the time you read it. Today, the data you need is sitting on public web pages: You just need to collect it.</p><p>In this article, we&#8217;ll discuss how to scrape data for market research, what sources actually matter, how to build a pipeline that doesn&#8217;t fall apart after a week, and where the legal lines are.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What &#8220;Market Research&#8221; Actually Means Web Scraping Professionals</h2><p>Market research needs to answer three questions:</p><ul><li><p>&#8220;<em>What are our competitors doing?</em>&#8221;</p></li><li><p>&#8220;<em>What are our customers saying?</em>&#8221;</p></li><li><p>&#8220;<em>How is the market moving?</em>&#8221;</p></li></ul><p>That&#8217;s it. Everything else is a variation of those three. And if you think about it, the web gives you access to all three, if you know where to look.</p><p>In practice, scraped market intelligence sits on three pillars:</p><ul><li><p><strong>Competitive data</strong>: Pricing, product catalogs, feature changes, hiring signals. This is the &#8220;what are they doing?&#8221; pillar.</p></li><li><p><strong>Customer sentiment</strong>: Reviews, forum discussions, social media posts. This is the &#8220;what are people saying?&#8221; pillar.</p></li><li><p><strong>Market signals</strong>: Job postings, regulatory filings, trend volumes, new product launches. This is the &#8220;where is the market going?&#8221; pillar.</p></li></ul><p>Now, why scraping instead of traditional research? Because scraping is real-time, it&#8217;s continuous, and it doesn&#8217;t depend on people filling out forms. A survey tells you what 500 people said last month. A scraper tells you what thousands of customers are saying right now, every single day, without anyone having to opt in.</p><p>That&#8217;s the competitive advantage. And it&#8217;s a big one.</p><div><hr></div><blockquote><p><em>For your scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>Where to Scrape: Sources That Actually Matter</h2><p>Not all sources are worth your time. You could scrape the entire Internet and still end up with nothing useful if you&#8217;re not targeting the right places. Below is a list of high-value targets for market research and what you can extract from each:</p><ul><li><p><strong>Competitor websites</strong>: Pricing pages, product pages, feature matrices, changelog, and blog posts. This is your primary source for understanding what competitors are offering and how they position themselves. Pricing pages, in particular, are gold. They change more often than you&#8217;d think, and tracking those changes over time tells you a lot about a competitor&#8217;s strategy.</p></li><li><p><strong>Review platforms</strong> <strong>(G2, Trustpilot, Amazon, Yelp)</strong>: Customer pain points, feature requests, sentiment shifts. Reviews are unfiltered customer feedback. Nobody writes a G2 review because they were asked nicely in a survey. They write it because they feel strongly about something&#8212;and that&#8217;s exactly the kind of signal you want.</p></li><li><p><strong>Job boards</strong> <strong>(LinkedIn, Indeed)</strong>: Hiring patterns reveal where a company is investing. If a competitor suddenly posts 20 machine learning engineer roles, that tells you something no press release will. Job postings are one of the most underrated market research signals out there.</p></li><li><p><strong>Social media and forums (Reddit, X, niche communities)</strong>: Unfiltered opinions, emerging trends, early complaints about products. Reddit threads and niche forums are where people say what they actually think, not what they&#8217;d say in a focus group.</p></li><li><p><strong>Government and public data portals</strong>: SEC filings, patent databases, import/export records. These are slower-moving signals, but they&#8217;re authoritative. A patent filing can tell you what a competitor is building 18 months before it ships.</p></li></ul><p>Here&#8217;s the key question to ask yourself before adding a source to your scraper: <em>&#8220;Does this data answer a specific research question, or am I just hoarding?&#8221;</em>. If you can&#8217;t tie a source to a concrete insight, skip it. You&#8217;ll save yourself storage costs, maintenance headaches, and potential legal issues.</p><h2>Building the Pipeline: From Raw HTML to Market Intelligence</h2><p>A market research scraper is not a one-off script you run from your terminal. It&#8217;s a pipeline. And pipelines need structure. If you treat it like a quick script, you&#8217;ll end up with a mess of cron jobs, inconsistent data formats, and no idea whether your data is fresh or stale. So, build it properly from the start.</p><p>A scraping for market intelligence pipeline should have four stages:</p><ol><li><p><strong>Collection</strong>: Fetch the pages, extract the fields you need, throw the rest away. Don&#8217;t store raw HTML &#8220;just in case&#8221; (you&#8217;ll learn why in the legal section of this article).</p></li><li><p><strong>Storage</strong>: Store facts and metadata (source URL, timestamp, extracted fields). Use a structure that makes deduplication and versioning easy. In practice, this means designing your schema around a composite key (for example: <em>source </em>+ <em>entity ID</em> + <em>scraped timestamp</em>) so you can track how a data point changes over time without overwriting previous records.</p></li><li><p><strong>Transformation</strong>: Normalize the data across sources, deduplicate records, and enrich with additional context (geocoding, industry classification, entity linking).</p></li><li><p><strong>Analysis</strong>: Turn rows into insights. This is where the actual market research happens. And to be clear: &#8220;Analysis&#8221; doesn&#8217;t mean opening a CSV and scrolling through it. The goal is to turn your pipeline&#8217;s output into dashboards, scheduled reports, or Slack alerts that reach the people who make decisions. If the data sits in a database and nobody looks at it, the whole pipeline is wasted effort.</p></li></ol><h3>Scheduling Matters More Than You Think</h3><p>Different data types have different freshness requirements. Getting this wrong means either wasting resources or working with stale data. The main ideas to consider when engineering the triggering times are the following:</p><ul><li><p><strong>Price tracking</strong>: Daily or hourly, depending on the market. Consider that e-commerce prices can change multiple times a day. SaaS pricing pages, instead, change less often. But when they do, it&#8217;s significant.</p></li><li><p><strong>Review monitoring</strong>: Monitoring reviews daily is usually enough. Reviews don&#8217;t appear in real-time, and sentiment trends are measured in weeks, not minutes.</p></li><li><p><strong>Job postings</strong>: A weekly schedule works for trend analysis of the job market. Remember that you&#8217;re looking for patterns, not individual listings.</p></li><li><p><strong>Social media</strong>: This depends on your use case. If you&#8217;re tracking a product launch or a PR crisis, you might need near-real-time. For general trend analysis, daily or even weekly batches work fine.</p></li></ul><h3>Tools That Work Well for Market Research Scraping</h3><p>You don&#8217;t need to reinvent the wheel. The software industry already provides you with the best tools for your market research scraping pipeline. Here&#8217;s a solid stack for a market research pipeline:</p><ul><li><p><strong><a href="https://www.scrapy.org/">Scrapy</a></strong> for structured crawling. <a href="https://substack.thewebscraping.club/p/scrapy-ten-years-of-scraping-framework">Scrapy&#8217;s architecture is designed for exactly this kind of work</a>: You define spiders per source, plug in middleware for proxy rotation and retry logic, and use item pipelines to clean and store data as it flows through. For market research specifically, Scrapy&#8217;s built-in feed exports let you dump results straight to JSON, CSV, or even S3 without writing custom I/O code. And if you need to coordinate multiple spiders (say, one per competitor), Scrapy&#8217;s project structure keeps things organized as your source list grows.</p></li><li><p><strong><a href="https://playwright.dev/">Playwright</a></strong> or <strong><a href="https://pptr.dev/">Puppeteer</a></strong> for JS-heavy pages. The key difference from Scrapy is that <a href="https://substack.thewebscraping.club/p/handling-infinite-scrolling-python-js">you&#8217;re running a real browser, which means you can handle dynamic content, infinite scroll</a>, and client-side rendering. The trade-off is resource cost: Each browser instance eats memory and CPU, so you don&#8217;t want to use this for targets that serve static HTML.</p></li><li><p><strong>A</strong> <strong>task queue</strong> for scheduling and orchestration. This is what turns a collection of scrapers into an actual pipeline. Instead of running scripts manually or relying on cron jobs, a task queue lets you schedule scrapes per source at different intervals, retry failed jobs automatically, and <a href="https://substack.thewebscraping.club/p/python-async-for-faster-scraping">control concurrency so you&#8217;re not overwhelming a target site with parallel requests.</a> It also gives you visibility: you can see what&#8217;s queued, what&#8217;s running, what failed, and why.</p></li><li><p><strong><a href="https://www.postgresql.org/">PostgreSQL</a></strong> for structured market data that needs querying and versioning. Relational databases shine here because market research data is inherently relational: competitors have products, products have prices, prices change over time.</p></li></ul><p>The point is this: Pick tools that let you build a maintainable system, not just a working script. Every tool in this stack solves a specific problem, and none of them requires you to build infrastructure from scratch. The best market research pipeline is the one that&#8217;s boring to operate, because boring means reliable.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Scaling Without Getting Blocked</h2><p>If you&#8217;re scraping one competitor once a week, you don&#8217;t need this section. If you&#8217;re tracking 50 competitors daily across thousands of pages, you do.</p><p>Here&#8217;s the reality: The moment you start scraping at scale, you become visible. But sites don&#8217;t like bots, even polite ones. So you need to be smart about how you scale. Consider the following rules of thumb to avoid getting blocked:</p><ul><li><p><strong>Proxy rotation</strong>: <a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies">Residential proxies for sensitive targets (sites with aggressive anti-bot systems), datacenter proxies for everything else</a>. Rotate per request or per session, depending on the site&#8217;s detection mechanisms. The key is to not send thousands of requests from the same IP in an hour.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/rate-limit-scraping-exponential-backoff">Rate limiting and backoff</a></strong>: Be a good citizen. If you hammer a site with concurrent requests, you&#8217;ll get blocked, and you&#8217;ll deserve it. Implement exponential backoff on failures, and set reasonable delays between requests. A 2-3 second delay between requests is a good starting point for most sites.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">Fingerprint management</a></strong>: Headers, TLS fingerprint, and browser-level signals matter on sites with serious anti-bot systems. Make sure your request headers look consistent and realistic.</p></li><li><p><strong>CAPTCHAs</strong>: <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">If you&#8217;re hitting CAPTCHAs regularly, your approach is too aggressive</a>. Fix the root cause (rate, fingerprint, proxy quality) before reaching for solver services. CAPTCHA solvers are a band-aid, not a solution.</p></li></ul><p>The general principle is simple: Scrape at a pace that doesn&#8217;t degrade the target site&#8217;s performance.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Turning Scraped Data into Actual Market Insights</h2><p>Let&#8217;s be clear about something: Raw scraped data is not market research. It&#8217;s just data. A CSV with 50&#8217;000 rows of competitor prices is not an insight. A chart showing that competitor X has dropped their enterprise tier price by 15% over three months: That&#8217;s an insight.</p><p>Here&#8217;s where the value gets created:</p><ul><li><p><strong>Price tracking and competitive benchmarking</strong>: Track changes over time, visualize trends, and set alerts for significant moves. The goal is not to know what a competitor charges today. It&#8217;s to understand their pricing trajectory. Are they moving upmarket? Are they running more frequent discounts? Are they simplifying their tier structure? This is where predictive <a href="https://substack.thewebscraping.club/p/predictive-analytics-web-scraped-data">analytics meets scraped data with the goal of predicting future moves</a> from your competitors.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/sentiment-analysis-amazon-reviews">Sentiment analysis on reviews</a></strong><a href="https://substack.thewebscraping.club/p/sentiment-analysis-amazon-reviews">: Use NLP to extract themes from customer reviews</a>. This is powerful for product teams who want to understand what customers love and hate about competitors. But remember: You&#8217;re analyzing the data internally, not republishing the reviews.</p></li><li><p><strong>Hiring signal analysis</strong>: Aggregate job postings by role type, department, and location. A competitor suddenly posting 15 ML engineer roles tells you they&#8217;re investing in AI. A wave of sales hiring in EMEA tells you they&#8217;re expanding geographically. This is a signal that&#8217;s almost impossible to get from any other source.</p></li><li><p><strong>Trend detection</strong>: Time-series analysis on product launches, feature changes, pricing moves, or social media mentions. <a href="https://substack.thewebscraping.club/p/scraping-data-anomaly-detection">The goal is to spot patterns or anomalies</a> before they become obvious. If three competitors all add the same feature within two months, that&#8217;s a market trend, not a coincidence.</p></li></ul><p>Overall, the <a href="https://substack.thewebscraping.club/p/building-a-scraper-dashboard-streamlit">output of your scraping pipeline should be dashboards</a>, reports, or automated alerts, not a database dump that someone has to manually dig through. If the insights don&#8217;t reach decision-makers in a usable format, the whole pipeline is wasted effort.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Legal and Ethical Considerations: Don&#8217;t Skip This Section</h2><p>I know, I know. You&#8217;re a developer, not a lawyer. But here&#8217;s a thing I&#8217;m sure you know: Most legal problems in scraping are self-inflicted. They happen because someone scraped &#8220;everything on the page,&#8221; stored it &#8220;for later,&#8221; and only then asked: <em>&#8220;Wait, can we actually use this?&#8221;</em></p><p>As discussed in detail in &#8220;<a href="https://substack.thewebscraping.club/p/avoid-copyright-violations-scraping">How to Avoid Copyright Violations While Scraping</a>&#8221;, let&#8217;s go through the key legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical principles of web scraping</a> shortly:</p><ul><li><p><strong>Scrape facts, not expression</strong>: Copyright protects expression, not facts. Prices, SKUs, dates, availability, and job titles are facts. No one owns the fact that a SaaS product costs $49/month. On the other hand, product descriptions, review text, and blog posts are creative expressions.</p></li><li><p><strong>Don&#8217;t store raw pages by default</strong>: Storing the HTML of entire pages means creating copies of copyrighted content. Instead, parse in-memory, extract only the fields you need, and discard the rest. If you need to debug, store a small sample with short retention.</p></li><li><p><strong>Respect </strong><em><strong>robots.txt</strong></em>: <a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">The </a><em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">robots.txt</a></em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications"> file is not the law, but ignoring it is evidence of bad faith if things go sideways</a>. In disputes, it can be used to show that you knew you were unwelcome and kept going anyway.</p></li><li><p><strong>Terms of Service matter</strong>: If the ToS explicitly forbids scraping and you scrape anyway, you may have a breach-of-contract problem. This is often easier for the site owner to prove than copyright infringement, because the argument is straightforward: you agreed to a contract, then you violated it.</p></li><li><p><strong>Don&#8217;t scrape behind a login</strong>: Once you log in, you&#8217;ve affirmatively agreed to a contract. Breaking that contract to scrape is a fast track to legal trouble. If your plan requires authenticated access, treat it as a licensing problem, not an engineering challenge.</p></li><li><p><strong>GDPR/CCPA</strong>: If you&#8217;re scraping anything that could be personal data (usernames, reviewer names, profile information), you need to know which privacy laws apply. This is especially relevant for review scraping and social media monitoring.</p></li></ul><p>Here&#8217;s the mental model that works: A price comparison tool that shows prices and links back to the source? Generally safe. A product catalog that copies descriptions, images, and reviews so users never need to visit the original site? That&#8217;s where you get into trouble, even if you don&#8217;t publicly display the results because you use them for internal analysis.</p><h2>Keeping Your Scrapers Alive: Monitoring and Maintenance</h2><p>Scrapers in production break for several reasons. Sites change layouts, add anti-bot measures, restructure their URLs, or just go down for maintenance. If you don&#8217;t monitor your scrapers, your data goes stale silently, and you won&#8217;t know until someone asks why the pricing dashboard hasn&#8217;t updated in three weeks.</p><p>Here&#8217;s a breakdown of what you need:</p><ul><li><p><strong>Dead selector detection</strong>: Alert when a CSS selector or XPath returns empty across multiple consecutive runs. A selector that worked yesterday and returns nothing today means the site changed its HTML structure. The keyword here is &#8220;multiple consecutive runs&#8221;. A single empty result could be a transient issue, so consider not triggering alerts on the first failure. Instead, set a threshold, like three consecutive empty results, before flagging it. When it does fire, you need to inspect the current page structure and update your selectors. Alternatively, try to <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">go beyond the DOM using AI and LLMs</a>, to make your extraction more resilient to layout changes in the first place.</p></li><li><p><strong>HTTP status monitoring</strong>: A spike in 403s means you&#8217;re getting blocked. A spike in 429s means you&#8217;re hitting rate limits. A spike in 404s means URLs have changed. Each of these requires a different response. For 403s, check your proxy pool and rotation logic: You might need fresher IPs or a lower request rate. For 429s, back off and increase your delays between requests; the site is telling you exactly what the problem is. For 404s, the target has likely restructured its URL patterns, which means you need to update your URL generation logic, not just retry the same broken links. Log these status codes per source and per run so you can spot trends early. A gradual increase in 403s over a week is a warning sign that your current setup is losing effectiveness, even if individual runs still return some data.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/ensuring-data-quality-in-web-scraping">Data quality checks</a></strong>: Row counts, null rates, value distributions. If your price tracker suddenly shows all prices as $0 or your review scraper returns empty text fields, you want to know immediately. Build quality checks into your pipeline as a post-scrape validation step, not as something you run manually. Compare each run&#8217;s output against baseline expectations: If you normally get 200 rows from a source and today you got 12, something is wrong, even if those 12 rows look fine individually.</p></li><li><p><strong>Automated tests against fixture HTML</strong>: Save sample HTML pages from your targets and write tests against them. When a test fails, you know the site has changed before your production scraper breaks. Treat your scrapers like production code, because they are. In practice, this means saving a snapshot of a relevant section in the target page as a local HTML file. Then, write unit tests that run your extraction logic against that fixture and assert expected outputs. Store these fixtures in version control alongside your scraper code. When a site changes and your production scraper breaks, update the fixture with the new HTML. This gives you a repeatable workflow for handling site changes instead of scrambling every time something breaks.</p></li></ul><p>The goal is simple: You should know when something breaks before your stakeholders do. A Slack alert that says &#8220;Competitor X pricing scraper returned 0 results&#8221; is infinitely better than a product manager asking why the dashboard is empty.</p><h2>Conclusion</h2><p>In this article, you learned that market research scraping is about building a reliable pipeline that collects the right facts, transforms them into insights, and doesn&#8217;t get you in legal trouble.</p><p>The competitive advantage of scraping for market research is in what you do with the data. Anyone can code a scraper. But building a system that delivers reliable, actionable market intelligence week after week? That&#8217;s where the real value is!</p><p>So, let us know: Are you using web scraping for market research? What sources have you found most valuable? How did you structure your scraping pipeline? Let&#8217;s discuss in the comments!</p>]]></content:encoded></item><item><title><![CDATA[Two stealth browsers just dropped. Also, your proxy provider might be overcharging you.]]></title><description><![CDATA[Use the new TWSC tools to discover proxy prices and news in the web scraping industry]]></description><link>https://substack.thewebscraping.club/p/two-stealth-browsers-proxy-prices</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/two-stealth-browsers-proxy-prices</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Wed, 25 Mar 2026 15:39:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8b292fc2-af4f-4f3e-8e94-0204b7fd08bb_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few things landed on my desk this week that I did not want to wait until the next issue to share. So here is a quick bonus edition: an update to a tool I have been building, and two projects from the scraping world that caught my attention.</p><div><hr></div><h2><strong>The Proxy Price Benchmark is now updated weekly</strong></h2><p><br>If you haven't checked it yet, the <a href="https://proxyprice.thewebscraping.club/">Proxy Price Benchmark</a> is the tool I built to answer a simple but important question: how much should you actually be paying for your proxies?<br><br>Every week, I (or better, my fleet of agents) update the pricing data directly from the vendors, so you always have a reliable reference to compare offers or negotiate with your current provider.<br><br>This week, we added two new vendors: <strong>Dataimpulse</strong> and <strong>AnyIP</strong>, bringing the total number of monitored providers to 27.<br><br><a href="https://proxyprice.thewebscraping.club/">Check the latest prices</a><br><br>If you use proxies at scale and would find API access to this data useful, I am considering a paid API plan. If you are interested, join the waitlist and tell me about your use case. I want to understand demand before I build it.<br></p><div><hr></div><h2><br><strong>This week on Scraping News: stealth browsers are getting serious</strong></h2><p><br>The <a href="https://news.thewebscraping.club/">Scraping News feed</a> has been tracking an interesting trend this week: two new stealth browser projects worth watching.<br><br><strong><a href="https://owlbrowser.net/">Owl Browser</a></strong> is a purpose-built browser engine for automation at scale. Not a Playwright wrapper but a full engine built on Chromium (CEF) with a custom C99 HTTP server, 256 parallel contexts, and sub-12ms cold start. Self-hosted, Docker-ready, with Python and TypeScript SDKs. If you are running high-volume scraping and hitting the limits of standard headless setups, this is worth a closer look.<br><br><strong><a href="https://github.com/rayobyte-data/rayobrowse">Rayobrowse</a></strong> is Rayobyte's open-source stealth Chromium browser, released from their production scraping infrastructure. It handles fingerprint randomization at the browser level (user agent, WebGL, fonts, screen resolution, timezone) and connects via CDP, so it works with Playwright, Puppeteer, Selenium, or any custom script. Runs on headless Linux with no GPU required.<br><br>Both address the same problem from different angles: standard headless Chromium is detected, and the solution is now moving from patch-level evasion to full browser-level stealth. We will be covering both in depth on TWSC soon.<br><br><a href="https://news.thewebscraping.club/">See all the latest news on Scraping News</a><br></p><div><hr></div><p>Keep in mind that both the Proxy Price Benchmark tools and Scraping News are in an early version; feel free to suggest improvements and bug fixes.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[WebSocket Bot Detection Techniques and How to Bypass Them]]></title><description><![CDATA[You may already know generic anti-bot techniques, but what about WebSocket-specific ones? Let&#8217;s find out!]]></description><link>https://substack.thewebscraping.club/p/websocket-bot-detection-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/websocket-bot-detection-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 22 Mar 2026 09:30:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/48bfc637-7402-4dd9-b7ab-d007f6fa773d_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Websites and web applications are becoming more complex than ever, with live data powering features that deliver fast insights. If you&#8217;re wondering which technology makes those live updates possible, the answer is WebSockets.</p><p>You might think that, in a web scraping scenario, the solution is simply to connect directly to the WebSocket channels. Sure, that&#8217;s possible, but there are a few obstacles along the way. The main ones are WebSocket anti-bot techniques and bot detection measures.</p><p>In this post, I&#8217;ll walk through the most common ones, explain how they work, and share proven tips and tricks to help you avoid them.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KA5B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KA5B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png" width="560" height="315.38461538461536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:560,&quot;bytes&quot;:1650775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656767?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KA5B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><h2>A Quick Intro to WebSockets</h2><p>Before diving into WebSocket bot detection, let me first provide some context about WebSocket as a protocol and its role in web scraping.</p><h3>What Is the WebSocket Protocol?</h3><p><a href="https://websocket.org/guides/websocket-protocol/">WebSocket</a>, also abbreviated as <em>WS</em> for short, is a web protocol standardized in <a href="https://datatracker.ietf.org/doc/html/rfc6455">RFC 6455 </a>that enables full-duplex, bidirectional communication between clients and servers over a single, persistent TCP connection.</p><p>Unlike HTTP, which is stateless and request-driven, WebSockets establish a long-lived connection through an initial HTTP handshake. After the handshake, both client and server can send messages independently, with data transmitted in frames that can be text, binary, or control frames (ping, pong, close).</p><p>WebSockets support fragmentation, masking, and optional compression via extensions like per-message-deflate, while newer HTTP/2 and <a href="https://substack.thewebscraping.club/p/faster-web-scraping-with-http3">HTTP/3 mechanisms</a> allow multiplexing, reduced latency, and better proxy traversal.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WRwD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WRwD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 424w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 848w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1272w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;HTTP vs WebSocket&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="HTTP vs WebSocket" title="HTTP vs WebSocket" srcset="https://substackcdn.com/image/fetch/$s_!WRwD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 424w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 848w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1272w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">HTTP vs WebSocket</figcaption></figure></div><div><hr></div><blockquote><p><em><br>For your ethical scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><blockquote><div><hr></div></blockquote><h3>Why and When Web Pages Use WebSockets</h3><p>The WebSocket protocol opens the door to live, bidirectional web communication. Unlike HTTP&#8217;s request-response model, it lets servers and clients exchange data continuously over a single, persistent connection.</p><p>In general, WebSockets are essential for any application where low latency and frequent updates are required. Common use cases include:</p><ul><li><p><strong>Live streaming</strong>: YouTube Live, TikTok LIVE, Kick, Twitch, and similar platforms.</p></li><li><p><strong>Chat applications</strong>: Slack, Discord, and other messaging services.</p></li><li><p><strong>Collaboration tools</strong>: Google Docs, Figma, and online whiteboards.</p></li><li><p><strong>Gaming and multiplayer experiences</strong>: Browser-based MMO games, turn-based games, and PvP games.</p></li><li><p><strong>Financial data feeds</strong>: Stock tickers, cryptocurrency price updates, and trading dashboards.</p></li><li><p><strong>IoT and telemetry</strong>: Sensor updates, home automation, and device monitoring.</p></li><li><p><strong>Notifications and alerts</strong>: Push updates for social networks, dashboards, or monitoring systems.</p></li></ul><p>In short, WebSocket comes into play wherever instant, continuous communication is necessary (and standard HTTP polling would be too slow or resource-intensive).</p><h3>Main Challenges of Scraping Data from WebSockets</h3><p>Connecting to a WebSocket server for collecting data isn&#8217;t as straightforward as <a href="https://substack.thewebscraping.club/p/apis-in-web-scraping">spoofing API requests for web scraping</a>. In particular, the main challenges of scraping data straight from WebSockets include:</p><ul><li><p><strong>Finding the right client implementation</strong>: You must use a WebSocket client (and there are way fewer than HTTP clients&#8230;) that supports the correct protocol version and any negotiated extensions, such as compression or subprotocols.</p></li><li><p><strong>Limited documentation and examples</strong>: WebSocket scraping is less common than API scraping, so there are fewer guides, tools, and community resources available.</p></li><li><p><strong>Proxy integration complexity</strong>: Not all clients support proxy integrations, making IP rotation a challenge.</p></li><li><p><strong>No request&#8211;response model</strong>: You can&#8217;t simply send a request and receive a response, as with API scraping. Instead, you must send the right messages and then listen to a continuous stream of events.</p></li><li><p><strong>Real-time data handling</strong>: You require a system to collect, process, and store messages in real time, often dealing with high-frequency updates.</p></li></ul><h2>Main WebSocket Anti-Bot Techniques and Solutions</h2><p>Now you&#8217;re ready to discover the most important WebSocket-specific bot detection techniques, along with practical tips to avoid and bypass them. The idea here is to target a WebSocket server from an automated script, relying on a WS client in Python, Node.js, or another programming language of your choice.</p><h3>WebSocket Handshake Issues</h3><p>The <a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers#client_handshake_request">WebSocket handshake</a> is a transition phase in which an HTTP connection is upgraded to a persistent WebSocket connection. During this step, both the client and the server negotiate the connection parameters, and either side can abort the process if the conditions aren&#8217;t acceptable.</p><p>Because the handshake is where the protocol upgrade happens, it&#8217;s also a pivotal security and bot-detection point. The server must carefully validate everything the client requests. Otherwise, protocol misuse or security issues may occur.</p><p>In detail, during the handshake, a WebSocket client must send a valid HTTP/1.1 GET request with specific headers, for example:</p><pre><code>GET /live-data HTTP/1.1
Host: example.com:9000
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: JKjeFfYU8mti9re0prPQrw==
Sec-WebSocket-Protocol: chat, superchat
Sec-WebSocket-Version: 13</code></pre><p>In practice, browsers also include additional headers such as <em>Origin</em>, <em>User-Agent</em>, <em>Referer, Cookie</em>, as well as authentication headers (e.g., <em>Authorization</em>). While these HTTP headers aren&#8217;t strictly required by the WebSocket specification, they are extremely valuable for <a href="https://substack.thewebscraping.club/p/browser-fingerprinting-test-online">fingerprinting and bot detection</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6chN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6chN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 424w, https://substackcdn.com/image/fetch/$s_!6chN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 848w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png" width="1456" height="1239" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1239,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note all extra HTTP headers&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note all extra HTTP headers" title="Note all extra HTTP headers" srcset="https://substackcdn.com/image/fetch/$s_!6chN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 424w, https://substackcdn.com/image/fetch/$s_!6chN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 848w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note all extra HTTP headers</figcaption></figure></div><p>Now, the server should respond with <em>400 Bad Request </em>and immediately close the connection if it encounters:</p><ul><li><p>An unknown or malformed header.</p></li><li><p>An invalid <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Sec-WebSocket-Key">Sec-WebSocket-Key</a></em>.</p></li><li><p>An unsupported WebSocket version.</p></li></ul><p>Instead, if the WebSocket version is unsupported, the server should return a <em>Sec-WebSocket-Version</em> header listing the versions it supports (most modern servers only accept <a href="https://datatracker.ietf.org/doc/html/rfc6455#section-1.2">version </a><em><a href="https://datatracker.ietf.org/doc/html/rfc6455#section-1.2">13</a></em>).</p><p>In practice, repeated handshake failures or non-browser-like handshake patterns are often treated as a bot indicator. Those may result in blocking, particularly after repeated handshake attempts from the same IP or when fingerprinting enables identification even across IP changes.</p><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Always send a valid </strong><em><strong>Origin</strong></em><strong> header</strong>: All major browsers include it, and many servers automatically reject WebSocket requests without one.</p></li><li><p><strong>Replicate real browser handshakes as closely as possible</strong>: Inspect the WebSocket request made by a real browser and match all headers (e.g., <em>User-Agent </em>and similar extra headers).</p></li><li><p><strong>Avoid excessive handshake attempts from the same machine</strong>: Too many connection attempts in a short time window are a common bot signal.</p></li><li><p><strong>Use IP rotation carefully</strong>: Rotation can help avoid rate-based blocks, but it doesn&#8217;t protect against fingerprint-based detection if the handshake remains identical.</p></li></ul><h3>Honeypot WebSocket Events and Channels</h3><p>If you&#8217;re familiar with <a href="https://substack.thewebscraping.club/p/scraping-high-frequency-python">common anti-bot techniques</a>, you&#8217;ve probably heard of honeypots. A honeypot is a decoy mechanism designed to attract bots by exposing fake or hidden resources, allowing systems to detect automated behavior when those resources are accessed or interacted with (e.g., invisible links or fake pages created to study bots).</p><p>In the context of WebSockets, honeypot events are a possible anti-bot technique to detect automated clients. With this approach, the server deliberately sends fake, misleading, or non-actionable events over the WebSocket connection. Similarly, the server might expose channels that aren&#8217;t meant to be accessed by regular clients.</p><p>Yet, automated scraping bots may react incorrectly to WebSocket honeypots by:</p><ul><li><p>Processing incoming data that is fake or intentionally invalid.</p></li><li><p>Requesting access to or subscribing to channels they aren&#8217;t supposed to use.</p></li></ul><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Study real browser behavior carefully</strong>: Inspect WebSocket traffic in your browser&#8217;s DevTools (&#8220;Network&#8221; &#8594; &#8220;Socket&#8221;) and observe which server messages actually trigger data flow or UI updates.</p></li><li><p><strong>Avoid assuming every message is meaningful</strong>: Remember that reacting to every event can lead to detection.</p></li></ul><h3>Connection Lifecycle Anomalies and Patterns</h3><p>Since WebSocket channels are stateful (unlike stateless HTTP requests), servers can detect bots by analyzing connection behavior over time. Scraping bots tend to prioritize speed over realistic user behavior, which can produce identifiable patterns.</p><p>In this regard, popular bot-like indicators include:</p><ul><li><p><strong>Very short-lived connections</strong>: Opening and closing sockets rapidly to collect data.</p></li><li><p><strong>Immediate reconnections after closure</strong>: Reconnecting instantly without human-like delays.</p></li><li><p><strong>High connection churn per IP</strong>: Multiple connections from the same IP within a short period.</p></li><li><p><strong>Missing browser events</strong>: Typical browser WebSocket clients trigger events like proper socket closure, whereas bots often skip them.</p></li><li><p><strong>Unnatural latency patterns</strong>: Servers use ping frames as heartbeats to check responsiveness. Real users on home Wi-Fi or mobile networks exhibit variable latency (jitter), while automated scripts deployed on data centers generally show extremely stable, low-latency responses.</p></li></ul><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Introduce some randomness</strong>: Introduce realistic delays between connections and reconnections.</p></li><li><p><strong>Replicate intended behavior</strong>: Emulate browser close events if testing automated clients.</p></li><li><p><strong>Add latency variation</strong>: Consider latency variation when sending and receiving frames to mimic real-world network jitter.</p></li><li><p><strong>Rotate connection IPs</strong>: Use proxies to <a href="https://substack.thewebscraping.club/p/how-many-ip-needed-scraping">distribute WebSocket connections across multiple IPs</a>.</p></li></ul><h3>WebSocket Binary Data Transmission</h3><p>WebSocket servers sometimes choose to send binary data instead of plain text or JSON. The main technical reasons for this are:</p><ul><li><p><strong>Reduced bandwidth</strong>: Binary messages omit field names and whitespace, making packets smaller than JSON strings and supporting high-frequency updates.</p></li><li><p><strong>Faster parsing</strong>: Binary data can be read as typed arrays or fixed-size fields, avoiding JSON parsing overhead.</p></li><li><p><strong>Custom protocols</strong>: Web apps can define their own compact binary format for predictable, high-frequency data.</p></li><li><p><strong>Efficient number storage</strong>: Numeric values can be stored in 1&#8211;4 bytes rather than as multi-character strings, saving space.</p></li></ul><p>For instance, TikTok LIVE pages use WebSockets to stream updates (e.g., chat messages, view counters, and other statistics) in binary format:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GVMj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GVMj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 424w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 848w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1272w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png" width="1456" height="862" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:862,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the binary message sent from the server&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the binary message sent from the server" title="Note the binary message sent from the server" srcset="https://substackcdn.com/image/fetch/$s_!GVMj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 424w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 848w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1272w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the binary message sent from the server</figcaption></figure></div><p>Sure, binary data can be converted to text. So, you may think that&#8217;s not a problem&#8230;</p><p>Well, keep in mind that most web applications using binary data implementations include some form of compression or encryption. This adds significant complexity!</p><p>Reverse-engineering these systems is technically possible by inspecting browser WebSocket clients, analyzing request headers for compression hints, or trial-and-error with common compression methods. Still, that&#8217;s time-consuming and error-prone. Plus, encryption keys, salts, or other details can easily change with each deployment.</p><p><strong>&#128204; Tips</strong>:</p><p>This time, the only piece of advice I have is to look for alternative data sources. Many WebSocket-based pages, including TikTok LIVE, use regular HTTP APIs to retrieve initial data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZW_3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 424w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 848w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1272w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png" width="1456" height="779" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:779,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i" title="Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i" srcset="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 424w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 848w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1272w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the RESTful HTTP request made by the client during rendering</figcaption></figure></div><p><strong>Note</strong>: Why aren&#8217;t these APIs called server-side when the HTML page is generated? In the case of live data, it&#8217;s more reliable to fetch it on the client, because even a single second of latency could result in outdated or inconsistent information.</p><p>Thus, polling over those RESTful APIs instead of the WebSocket data streams can allow you to retrieve the information of interest without dealing with binary encoding, compression, or encryption challenges.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>WebSocket-Based Bot Detection Measures</h2><p>The WebSocket protocol is built on top of HTTP, so they inherit <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">many anti-bot techniques commonly used for HTTP requests</a>. At the same time, due to its stateful and persistent nature, anti-bot solutions like WAF (Web Application Firewalls) can leverage WebSockets to detect automated behavior even more effectively&#8230;</p><p>As a result, WebSocket-based anti-bot measures are not only relevant when connecting directly to WS servers, but also when interacting with web pages through browser automation tools like Playwright and Selenium. That&#8217;s why you must know them!</p><h3>Advanced TLS Fingerprinting</h3><p>Traditional HTTP fingerprinting checks headers and TLS details. WebSockets extend this by combining the TLS handshake with WebSocket-specific framing, which is much harder to spoof. Signals include <a href="https://developers.cloudflare.com/bots/additional-configurations/ja3-ja4-fingerprint/">JA3/JA4 fingerprints</a>, unusual cipher suite ordering, frame fragmentation patterns, and incorrect masking behavior.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Continuous Device Fingerprinting</h3><p>HTTP allows basic fingerprinting on a per-request basis, but it can&#8217;t verify whether the client&#8217;s environment remains consistent. The stateful nature of WebSockets enables servers to continuously <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">validate device fingerprints</a> over time. For example, servers can request Canvas/WebGL renders, available fonts, and other browser characteristics repeatedly. Any inconsistency can lead to an immediate block.</p><h3>Real-Time User Behavior Monitoring</h3><p>WebSockets allow live streaming of mouse, keyboard, and scrolling events back to the server. This enables a much deeper level of user behavior analysis compared to static HTTP requests.</p><p>After all, most <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation scripts</a> produce perfectly straight mouse movements or instantaneous clicks, while human interactions naturally include slight jitter, variable speed, and reaction delays. These differences make automated clients easier to detect when behavior is constantly monitored over a WebSocket connection.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I introduced the WebSocket protocol and explained why and when it comes in handy. Specifically, you learned that it powers live data updates on web applications. Want to access that data? Well, it&#8217;s not as straightforward as you might think due to WebSocket anti-bot techniques.</p><p>In this post, I explored the most relevant WS bot detection methods, along with useful advice for bypassing them successfully. You also saw how WebSocket&#8217;s stateful, continuous data streaming can be used by WAFs and other advanced anti-bot systems for enhanced detection.</p><p>I hope you found this helpful and informative. If you have any questions or comments, drop them below. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #100: Hybrid Scraping - One Browser Login, Thousands of HTTP Requests]]></title><description><![CDATA[Building a pipeline that uses Camoufox for authentication and curl_cffi for extraction on Akamai-protected targets.]]></description><link>https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 19 Mar 2026 22:07:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9e52f7e3-270c-41cc-ba33-7bbbfb446247_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Browser-based scraping tools have become the default answer when a website deploys anti-bot protection. When a target runs Akamai, Cloudflare, or Datadome, the natural reflex is to reach for Playwright, Puppeteer, or one of their stealth variants like Camoufox or Pydoll. And it works. A real browser renders JavaScript, solves challenges, and presents a legitimate fingerprint. The success rate is high.</p><p>But a browser does everything the hard way. It downloads the full page, parses HTML, executes JavaScript, renders the DOM, loads images, fonts, and stylesheets. For each request, it allocates hundreds of megabytes of RAM and takes seconds to complete what an HTTP client could do in milliseconds. When a pipeline needs to scrape ten pages, this overhead is irrelevant. When it needs to scrape ten thousand pages, the browser becomes the bottleneck.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Consider a concrete scenario: we need to monitor the wishlist of an e-commerce account, pulling product data, stock levels, and price changes every hour across hundreds of items. Running Camoufox for every single API call would mean spinning up a full browser instance, navigating to each page, waiting for JavaScript to execute, extracting the data, and closing. For a hundred items, that is minutes of execution time and gigabytes of memory. The same API calls through an HTTP client would complete in seconds using a fraction of the resources.</p><p>As we measured in <a href="https://substack.thewebscraping.club/p/scraping-nike-with-open-source">THE LAB #96</a>, HTTP clients with TLS impersonation can be 27x faster than browsers on the same target. The difference is not marginal. It is the difference between a pipeline that runs on a single machine and one that requires a cluster.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><p>The problem is that these two approaches are usually treated as mutually exclusive. Either you use a browser for everything, accepting the overhead, or you try an HTTP client and hope the anti-bot system does not block it. But many websites only need a browser at the gate: for the login, the initial challenge, or the session establishment. Everything after that is plain API calls.</p><p>If we can use a browser to earn a valid session and then hand it off to an HTTP client, we get the reliability of browser automation where it matters and the speed of HTTP everywhere else. That is the pattern we want to build. But the handoff is not as simple as copying a few cookies, and the traps along the way are worth understanding before building a pipeline around this idea.</p><h2>The hybrid pattern</h2><p>The idea is simple in principle. Many websites require a browser only at the gate: the login flow, the initial anti-bot challenge, or the session establishment. Once that gate is passed, subsequent requests are plain API calls or page fetches that do not require JavaScript execution. If we can extract the session state from the browser and replay it through an HTTP client, we skip the browser for 99% of the work.</p><p>The session state, in practice, means cookies. An authentication flow sets session cookies that the server trusts for subsequent requests. If we transfer those cookies from the browser to an HTTP client, the server should treat the HTTP client as the same authenticated user.</p><p>But cookies alone are often not enough. Modern anti-bot systems like Akamai do not just check whether you have the right cookies. They also check whether the client presenting those cookies looks like the same client that earned them. </p><p>This is where TLS fingerprinting enters the picture: if the browser that logged in was Firefox, but the HTTP client that reuses the cookies presents a Python TLS fingerprint, the server may reject the request or simply drop the connection without responding.</p><p>So the real challenge is not just transferring cookies. It is maintaining continuity across two different execution models: the browser and the HTTP client must look like the same entity to the server.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Tool landscape</h2><p>For this experiment, we used two tools.</p><p><a href="https://github.com/daijro/camoufox">Camoufox</a> is a custom Firefox build designed for stealth. It spoofs fingerprints (WebGL, canvas, audio, navigator properties), patches headless detection vectors, and uses Playwright&#8217;s Juggler protocol for automation. We covered it extensively in <a href="https://substack.thewebscraping.club/p/scraping-datadome-camoufox">THE LAB #65: Scraping Datadome-protected websites with Camoufox</a>. Its role here is limited to one thing: logging in.</p><p><a href="https://github.com/yifeikong/curl_cffi">curl_cffi</a> is a Python binding for curl-impersonate, a modified version of curl that mimics the TLS and HTTP/2 fingerprint of real browsers. It supports impersonating Chrome and Firefox at specific versions, which means it can present the same TLS fingerprint as the browser that established the session. Unlike a browser, it uses negligible resources per request and can process thousands of pages per minute.</p><p>The key property that makes this pairing work: Camoufox is Firefox-based, and curl_cffi can impersonate Firefox&#8217;s TLS fingerprint. The server sees a consistent Firefox identity across both steps.</p><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">100.HYBRID_SCRAPING</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The target: Net-a-Porter</h2><p>We chose <a href="https://www.net-a-porter.com">Net-a-Porter</a> as our target. It is a luxury e-commerce platform protected by Akamai Bot Manager, with authenticated features (wishlists, account details) exposed through internal JSON APIs. This gives us a clean test case: the login requires a real browser (Akamai blocks automation tools at the login endpoint), but the authenticated API calls are plain HTTP requests that return structured JSON.</p><p><em><strong>Please keep in mind that this is an experiment for study purposes, and we&#8217;re not inciting you to scrape Net-a-Porter or any other website, especially the part behind a login.</strong></em></p><p>Before diving into code, we need to understand what we&#8217;re dealing with. Net-a-Porter&#8217;s architecture has three layers relevant to us:</p><p><strong>Akamai Bot Manager</strong> sits in front of everything. It sets a cluster of tracking cookies (<code>_abck</code>, <code>bm_sz</code>, <code>bm_s</code>, <code>ak_bmsc</code>, and others) that are generated through JavaScript execution on the client side. These cookies prove that a real browser visited the page. Without them, API calls either fail or hang indefinitely.</p><p><strong>The login API</strong> at <code>/api/nap/wcs/resources/store/nap_il/loginidentity/v2</code> accepts a JSON payload with email and password. On success, it returns a 201 status with an <code>Ubertoken</code> in the response body. This token is the key to all authenticated endpoints.</p><p><strong>Authenticated API endpoints</strong> like the wishlist API at <code>/api/nap/wcs/resources/store/nap_il/wishlist/v2/{id}</code> require both the session cookies and the <code>Ubertoken</code> passed as an <code>x-ubertoken</code> header. They return clean JSON with product details, stock levels, and metadata.</p><h2>The experiment: what worked and what did not</h2><p>We did not arrive at the final solution directly. The investigation path itself reveals the constraints of session handoff, so it is worth walking through each attempt.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Stop Getting Blocked: Upgrade Your Scraping Infrastructure with Dolphin{anty}]]></title><description><![CDATA[My review of Dolphin{anty}. Weighing the pros, cons, and unique capabilities of this anti-detect browser.]]></description><link>https://substack.thewebscraping.club/p/dolphin-anty-product-review</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/dolphin-anty-product-review</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 15 Mar 2026 15:33:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d9486151-c072-4ffa-b126-fe482a216e7e_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The web scraping industry has evolved very fast in recent years. The fact that rotating proxies is no longer enough to guarantee success is a clear sign of how advanced anti-bot systems have become.</p><p>Lots of tools have emerged to solve the issue of browser fingerprinting, which, for example, is one of the <a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies">primary reasons for blocks even when using high-quality residential proxies</a>. So, the need companies have for stable, scalable data collection makes anti-detect solutions essential for survival in the current status of the industry.</p><p>In this article, you&#8217;ll discover Dolphin{anty}: A powerful anti-detect browser that lets you orchestrate hundreds of unique, isolated browser profiles. You&#8217;ll learn its strengths, why you should consider it for your scraping or multi-accounting projects, and how it works with a practical guide.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What is an Anti-detect Browser?</h2><p>An antidetect browser is a specialized web browsing tool designed <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">to mask a user&#8217;s digital fingerprint</a>, allowing them to appear as a distinct, unique visitor to websites and tracking systems. Standard browsers like Chrome or Firefox broadcast a user&#8217;s hardware and software data. An anti-detect browser, instead, enables users to customize and spoof these parameters for every session.</p><p>In the context of web scraping, web scraping professionals use this technology to bypass anti-bot measures that rely on browser fingerprinting to identify and block automated traffic. Anti-detect browsers can also be used in &#8220;multi-accounting&#8221; strategies. You can use them to create isolated browser profiles, each with its own unique fingerprint, cookies, and proxy IP. The common use case is that a single user can manage hundreds of social media, e-commerce, or ad accounts simultaneously without triggering security flags that would normally link the accounts together and lead to mass bans.</p><div><hr></div><blockquote><p><em>A successful data pipeline is made not only by the right tool to use, but also from the right IP address. Proxy providers like <strong>Decodo</strong> help you achieving your scraping goals.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>What is Dolphin{anty} and Why Consider it for Your Web Scraping Projects?</h2><p><a href="https://dolphin-anty.com/">Dolphin{anty}</a> is an anti-detect browser that allows you to manage hundreds of unique, isolated browser profiles for web scraping and multi-accounting. You can use it via its desktop application or programmatically, as it provides a flexible API for deep integration with your scripts.</p><p>The best part of using it is that you can orchestrate wide scraping operations without worrying about browser fingerprinting. Forget about immediate IP bans, CAPTCHAs triggered by suspicious metadata, or complex cookie management. Dolphin{anty} handles the masking of your digital identity for you very simply. Also, thanks to its <a href="https://dolphin-anty.com/blog/en/dolphin-anty-has-become-even-more-effective-a-significant-update-to-the-scenarios-capabilities/">built-in &#8220;Scenarios&#8221; builder and synchronizer</a>, it can automatically replicate human-like actions across multiple profiles simultaneously. So, say goodbye to manual warm-up routines and the fear of losing accounts to anti-fraud systems.</p><p>The top reasons why you should consider it for your projects are the following:</p><ul><li><p><strong>Advanced anti-detect capabilities:</strong> If you&#8217;ve been scraping for a while, you know that standard headless browsers often leak metadata that triggers anti-bot defenses. Dolphin{anty} solves this by providing real, unique digital fingerprints for every profile. It mimics user behaviors at a granular level, allowing you to bypass sophisticated detection systems without the constant headache of being blocked.</p></li><li><p><strong>Mass profile management:</strong> Managing a few accounts is easy, but scaling to hundreds or thousands is a different beast. Dolphin{anty} is built for scale. It allows you to orchestrate hundreds of isolated browser profiles from a single interface. Whether you are managing a massive farm of accounts for data collection or need to segment your scraping tasks, the tool provides the infrastructure to keep everything organized and efficient.</p></li><li><p><strong>Flexible API integration:</strong> For those who prefer code, Dolphin{anty} offers a robust API that integrates deeply with your existing Python or Node.js pipelines. This allows you to automate profile creation, launch browsers programmatically, and integrate the anti-detect capabilities directly into your custom scraping infrastructure.</p></li></ul><div><hr></div><blockquote><p><em>For your ethical scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>Dolphin{anty}&#8217;s Main Features</h2><p>Dolphin{anty} is packed with features designed to make multi-accounting and scraping easier. The main features you should know about it are the following:</p><ul><li><p><strong>Real fingerprint generation:</strong> The core of Dolphin{anty} is its ability to provide genuine device fingerprints. Instead of just blocking trackers, it creates a unique digital identity for every profile you run. In practice, it manages over 20 parameters&#8212;from WebRTC to Canvas&#8212;so your scrapers look exactly like real users on real devices.</p></li><li><p><strong>Built-in Automation:</strong> You don&#8217;t always need to be a coding wizard to automate tasks. Dolphin{anty} offers a &#8220;Scenarios&#8221; builder that lets you create automated workflows visually. Whether it&#8217;s warming up accounts or parsing data, you can set these scripts to run automatically. And for those who prefer code, the flexible API allows you to integrate these profiles directly into your existing scripts.</p></li><li><p><strong>Profile synchronizer:</strong> This is a game-changer if you need to perform the same action across multiple accounts. The Synchronizer allows you to perform an action in a &#8220;master&#8221; profile, and the tool automatically repeats that exact action across all other selected profiles in real-time. This saves you a massive amount of time on routine interactions.</p></li><li><p><strong>Team collaboration:</strong> If you work in a team, you know that sharing browser sessions and cookies can be a nightmare. Dolphin{anty} simplifies this by allowing you to transfer profiles, cookies, and proxies to colleagues in just a few clicks. You can also manage permissions, ensuring that team members only have access to the functionality they need.</p></li><li><p><strong>Smart profile management:</strong> When you are dealing with hundreds of profiles, organization is key. The tool provides a highly intuitive interface where you can use tags, statuses, and notes to sort and find your profiles instantly. It&#8217;s built to help you navigate a large farm of accounts without getting lost in the chaos.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2><strong>Hands-on Dolphin{anty}: Step-by-step Scraping Tutorial</strong></h2><p>In this section, you will see how easy and fast it is to use Dolphin{anty}. Get ready for the tutorial!</p><h3>Setting Up Dolphin{anty} </h3><p>First of all, you need to create a new login. After <a href="https://dolphin-anty.com/panel/#/auth/registration">creating a new account on Dolphin{anty}</a>, the system will ask you to download the software. As you can see from the image below, it supports all the major Operating Systems:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qICh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qICh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 424w, https://substackcdn.com/image/fetch/$s_!qICh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 848w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1272w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:158491,&quot;alt&quot;:&quot;Dolphin Anty supports all major operating systems by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Dolphin Anty supports all major operating systems by Federico Trotta" title="Dolphin Anty supports all major operating systems by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!qICh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 424w, https://substackcdn.com/image/fetch/$s_!qICh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 848w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1272w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Dolphin{anty} supports all major operating systems</figcaption></figure></div><p>Below is how Dolphin{anty}&#8217;s interface appears after you installed it on your machine:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kn4O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 424w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 848w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1272w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png" width="1456" height="762" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:762,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:185539,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 424w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 848w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1272w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Dolphin{anty}&#8217;s first interface</figcaption></figure></div><p>Good. Everything is set up. Time to create new profiles!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Create New Profiles</h3><p>Before using Dolphin{anty}, you have to create a new profile. To do so, click on <strong>CREATE PROFILE</strong> and fill in the fields:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xU9q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xU9q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 424w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 848w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1272w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png" width="1152" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1152,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183896,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xU9q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 424w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 848w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1272w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating new profiles in Dolphin{anty} ty</figcaption></figure></div><p>Profiles are the core of Dolphin{anty}. This is where, for example, you can change the fingerprinting for your anti-detect strategies. To do so, you only need to click on <strong>NEW FINGERPRINT,</strong> and the tool will change all the fingerprinting data for you. And if the standard fingerprinting is not sufficient, you can manage advanced configurations:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rafe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rafe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 424w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 848w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1272w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png" width="1166" height="873" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:873,&quot;width&quot;:1166,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184964,&quot;alt&quot;:&quot;Changing fingerprint configuration in Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Changing fingerprint configuration in Dolphin Anty by Federico Trotta" title="Changing fingerprint configuration in Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Rafe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 424w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 848w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1272w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Changing fingerprint configuration in Dolphin{anty} </figcaption></figure></div><p>Also, if your use case needs to use a specific social media like Facebook, you can set Facebook&#8217;s URL as the starting page and the credentials to log in to a profile you need to manage:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ONwa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ONwa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 424w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 848w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1272w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png" width="1173" height="872" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:872,&quot;width&quot;:1173,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187856,&quot;alt&quot;:&quot;How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta" title="How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!ONwa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 424w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 848w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1272w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How to set up your social media profile&#8217;s login with Dolphin{anty} </figcaption></figure></div><p>When everything is set up, click on <strong>SAVE,</strong> and your profile is completed! You are now ready to use Dolphin{anty} via UI or code.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Use Dolphin{anty}  Via The UI</h3><p>The power of anti-detect browsers rely in allowing you to create different profiles and letting you use the browser with one instance, but different profiles. So, after you created the profiles, click on <strong>START</strong> to launch the instances:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!38xz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!38xz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 424w, https://substackcdn.com/image/fetch/$s_!38xz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 848w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1272w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png" width="1456" height="285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:285,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124225,&quot;alt&quot;:&quot;How to launch instances with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to launch instances with Dolphin Anty by Federico Trotta" title="How to launch instances with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!38xz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 424w, https://substackcdn.com/image/fetch/$s_!38xz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 848w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1272w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">How to launch instances with Dolphin{anty}</figcaption></figure></div><p>Dolphin{anty} will launch a new browser instance, allowing you to manage as many profiles as you have created and activated. Below is the expected result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_xVL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_xVL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 424w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 848w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1272w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png" width="1058" height="916" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:916,&quot;width&quot;:1058,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:218925,&quot;alt&quot;:&quot;Launching an instance with two different profiles with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Launching an instance with two different profiles with Dolphin Anty by Federico Trotta" title="Launching an instance with two different profiles with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!_xVL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 424w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 848w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1272w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Launching an instance with two different profiles with Dolphin{anty} </figcaption></figure></div><p>That&#8217;s it for using Dolphin{anty}  via UI!</p><h3>Use Dolphin{anty} Via Code</h3><p>Before using Dolphin{anty}  via code, you have to create an API key. To do so, navigate through the <strong><a href="https://dolphin-anty.com/panel/#/api">API</a></strong><a href="https://dolphin-anty.com/panel/#/api"> panel in the web app</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oWZ3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 424w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 848w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1272w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224126,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 424w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 848w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1272w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating an API key in Dolphin{anty} </figcaption></figure></div><p>Now you can connect to a profile through a port generated at startup and automate the browser using tools like <a href="https://substack.thewebscraping.club/p/improving-performance-puppeteer-scraping">Puppeteer</a>, <a href="https://substack.thewebscraping.club/p/web-scraping-from-0-to-hero-our-first">Playwright</a>, <a href="https://substack.thewebscraping.club/p/selenium-tutorial-course">Selenium</a>, and others.</p><p>Basic automation you can do includes the following:</p><ol><li><p>Start a profile via API with DevTools Protocol enabled.</p></li><li><p>Connect to the profile&#8217;s port using a browser tool.</p></li><li><p>Run your own automation script through the open connection.</p></li></ol><p>Dolphin{anty} allows you maximum flexibility, so you can use your favourite programming language. For example, below is how you can write an authorization script:</p><pre><code><code>import requests
api_url = "&lt;http://localhost:3001/v1.0/auth/login-with-token&gt;"
token = "your-api-key"
request_data = {"token": token}
headers = {"Content-Type": "application/json"}

response = requests.post(api_url, json=request_data, headers=headers)
if response.status_code == 200:
&#9;print("OK", response.json())
else:
&#9;print("Error", response.status_code)</code></code></pre><p>If the response is successful, you will receive a message like the following:</p><pre><code><code>{"success": true}</code></code></pre><p>Discover how to use <a href="https://help.dolphin-anty.com/en/collections/4645237-api">Doplhin{anti} via API by reading the documentation</a>!</p><h2>Pros and Cons of Dolphin{anty}</h2><p>Like any tool, Dolphin{anty} has its strengths and weaknesses. Here is a breakdown of what you need to know before deciding if it fits your stack.</p><p>&#128077; <strong>Pros:</strong></p><ul><li><p><strong>Top-tier fingerprinting:</strong> The ability to generate real, unique fingerprints for every profile is its biggest selling point. It goes beyond simple user-agents, making your scrapers look genuinely human.</p></li><li><p><strong>Built-in automation tools:</strong> The &#8220;Scenarios&#8221; builder and the Synchronizer are massive time-savers. You can automate routine warm-up tasks or replicate actions across dozens of profiles without writing a single line of code.</p></li><li><p><strong>Team-centric design:</strong> If you work with a team, the ability to transfer profiles and share them instantly is invaluable. It removes the friction of sharing session data manually via files or text.</p></li></ul><p>&#128078;<strong>Cons:</strong></p><ul><li><p><strong>REST API complexity:</strong> This is a significant friction point for developers. Unlike other solutions that offer native SDK wrappers, Dolphin{anty} relies only on REST API calls for automation. This adds &#8220;boilerplate&#8221; complexity compared to simply importing a library.</p></li><li><p><strong>Resource intensive:</strong> Running multiple browser profiles with full fingerprinting requires significant system resources. You will need a powerful machine if you plan to run dozens of concurrent sessions locally.</p></li></ul><h2>Conclusion</h2><p>In this article, you discovered Dolphin{anty}, a flexible anti-detect browser that can be used both via UI and via code. As you&#8217;ve learned, it comes packed with interesting features that can speed up your processes. In particular, we found that the &#8220;Scenarios&#8221; feature is the one that actually makes it stand out.</p><p>So, let&#8217;s discuss in the comments: Were you already using Dolphin{anty} before reading this article? What&#8217;s your experience with it?</p>]]></content:encoded></item><item><title><![CDATA[The DMCA Was Built to Stop DVD Piracy. Google Wants to Use It Against Scrapers]]></title><description><![CDATA[How a 12-page complaint is trying to turn every CAPTCHA into a federal copyright perimeter]]></description><link>https://substack.thewebscraping.club/p/google-vs-serpapi-web-scraping-case</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/google-vs-serpapi-web-scraping-case</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Sun, 08 Mar 2026 17:52:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/97060676-c153-4ea6-a3c9-7e70cd1f3c22_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On December 19, 2025, Google filed a lawsuit against SerpApi in the Northern District of California. The case number is 25-10826, and the complaint is 12 pages long. Twelve pages that could reshape how the entire scraping industry operates.</p><p>We are not talking about a cease-and-desist letter or a Terms of Service dispute. Google did not send SerpApi any communication before filing the lawsuit. No cease-and-desist, no attempt to resolve their concerns directly. SerpApi told us this was highly unusual, and that had Google reached out, they might have learned that their claims lack merit.</p><p>Google is invoking the Digital Millennium Copyright Act, specifically Section 1201, the anti-circumvention provision. The same statute originally designed to prevent people from cracking DVD encryption is now being pointed at a SERP scraping API.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>We reached out to both Google and SerpApi for comment on this case. Google did not respond. SerpApi did, and we will include their statements throughout this article where relevant.</p><p>Let us break down what happened, why it matters, and what it could mean for anyone who scrapes the web for a living.</p><h3>The Facts</h3><p>Google&#8217;s complaint tells a straightforward story. SerpApi, founded in 2017 by Julien Khaleghy, operates a paid API that sends automated queries to Google Search and returns the results as structured JSON. Google estimates that SerpApi sends hundreds of millions of artificial search requests per day, and that this volume has increased by as much as 25,000% over the past two years.</p><p>In January 2025, Google deployed a technological protection measure called SearchGuard. SearchGuard works by sending JavaScript challenges to incoming search queries. For regular browser users, the challenge is invisible: the browser runs the JavaScript, sends back the expected response, and the search results load normally. For automated systems, the challenge is a wall. Bots that cannot execute JavaScript or that fail behavioral checks get blocked.</p><p>According to Google&#8217;s complaint, SerpApi&#8217;s response to SearchGuard was to build circumvention mechanisms. The complaint alleges that SerpApi creates &#8220;fake browsers using a multitude of IP addresses that Google sees as normal users,&#8221; misrepresents device and location information when solving challenges, and syndicates authorization tokens from legitimate requests to unauthorized machines around the world. Google also alleges that SerpApi uses automated means to bypass CAPTCHAs that SearchGuard deploys as a secondary verification layer. SerpApi disputes these factual allegations.</p><p>The complaint cites SerpApi&#8217;s own blog posts, where the company reportedly described SearchGuard as making &#8220;web scraping more difficult&#8221; but claimed to be &#8220;fortunate to be minimally impacted&#8221; because its services had &#8220;already pre-solved Google&#8217;s JavaScript challenge.&#8221;</p><h2>The Legal Theory</h2><p>This is where it gets interesting for the scraping industry, because Google chose not to sue under the Computer Fraud and Abuse Act (CFAA). That would have been the traditional route. Instead, Google went with the DMCA.</p><p>The context matters. The CFAA path has been significantly narrowed by the hiQ Labs v. LinkedIn case. In that landmark decision, the Ninth Circuit held that scraping publicly available data does not violate the CFAA, and warned against allowing companies to create &#8220;information monopolies.&#8221; The Supreme Court vacated and remanded the case under its Van Buren ruling, but on remand, the Ninth Circuit reaffirmed its original position.</p><p>After hiQ, the CFAA is a much weaker weapon against scraping of publicly visible content. Google needed a different legal framework. Section 1201 of the DMCA provides one.</p><p>Section 1201 has two relevant provisions. The first, Section 1201(a)(1)(A), prohibits the act of circumventing a technological measure that effectively controls access to a copyrighted work. The second, Section 1201(a)(2), prohibits trafficking in technology designed to circumvent such measures. Google&#8217;s complaint invokes both.</p><p>The argument chain goes like this: Google&#8217;s search results contain copyrighted content, specifically images in Knowledge Panels licensed from third parties, merchant-supplied product images in Google Shopping, and licensed content from Google Maps. SearchGuard is a technological measure that controls access to these search results pages (and therefore to the copyrighted works within them). SerpApi circumvents SearchGuard. Therefore, SerpApi violates Section 1201.</p><p>Each act of circumvention carries statutory damages of between $200 and $2,500. Google alleges billions of individual circumventions. Do the math, and the potential damages exceed what SerpApi could ever pay. Google itself notes in the complaint that SerpApi &#8220;reportedly earns a few million dollars in annual revenue, but already faces liability that is orders of magnitude higher and growing.&#8221;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>SerpApi&#8217;s Position</h2><p>When we reached out to SerpApi, they were clear about their stance. On the fundamental legality of what they do, SerpApi told us: &#8220;<em>We embrace the term &#8216;scraping,&#8217; and we practice it legally and transparently. SerpApi accesses publicly visible search results, the same ones available to any browser, and delivers clean, structured JSON back to our customers. We&#8217;ve operated this way since 2017, serving developers, researchers, and businesses who need reliable access to public information at scale.&#8221;</em></p><p>On the legal boundaries of automated access to search results, their position is equally direct: &#8220;<em>The law on this is clear, and we&#8217;re prepared to defend that position in court. Scraping is legal, and we stand behind our products and customers. Our API replicates real-time searches with no login, no bypass of any paywall, and no access to anything that isn&#8217;t already available to anyone with a browser. U.S. courts have upheld this repeatedly; hiQ Labs v. LinkedIn is a key precedent. The data Google surfaces lives on the open web. Google didn&#8217;t create it.</em>&#8221;</p><p>In February 2026, <a href="https://serpapi.com/blog/google-v-serpapi-motion-to-dismiss-why-were-in-the-right/">SerpApi filed a motion to dismiss</a>. Their arguments include the assertion that the DMCA is a copyright protection statute, not a website protection statute, and that Google is improperly trying to use it to control access to public portions of its website. They also argue that mimicking browser behavior to access publicly available pages is not the same as cracking encryption or disabling authentication, and that any ambiguity in the definition of "circumvention" must be given its narrowest reasonable reading, citing the "First Amendment interest in maintaining accessibility of the Internet as an open forum."</p><p>SerpApi also pointed out what they see as an absurdity in Google&#8217;s theory. If statutory damages were calculated at scale, the total &#8220;would exceed U.S. GDP.&#8221; Congress, they argue, never intended Section 1201 to be used this way.</p><p>On the DMCA claim specifically, SerpApi told us: &#8220;<em>The DMCA&#8217;s anti-circumvention provision was designed to protect copyrighted works, full stop. Google is not protecting access to copyrighted works. Google is improperly attempting to use the DMCA to limit access to the public portions of its website. We believe that the law is on our side.</em>&#8221;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The Hypocrisy Argument</h2><p>SerpApi is not shy about making this point. <a href="https://serpapi.com/blog/google-v-serpapi-threatening-access-to-public-data/">In a blog post about the lawsuit</a>, they argue that Google&#8217;s case threatens access to public data on the open internet and this resonates widely in the scraping community. As they told us: &#8220;<em>Google indexed the web without anyone&#8217;s permission. That&#8217;s how search works. Now it&#8217;s trying to pull up the ladder behind it, prohibiting the practices that it used, and still uses today, to build its business empire. That&#8217;s why SerpApi is standing up to Google. Not just to protect our business, but to protect legal competition and open access to public information on the internet.</em>&#8221;</p><p>Google Search operates by crawling, indexing, and presenting content from billions of websites. Many of those website owners never explicitly consented to being indexed. Google&#8217;s position has always been that robots.txt provides the mechanism for opting out, and that the default state of the open web is crawlable. Now Google is arguing that its own search results should be exempt from the same logic.</p><p>The irony is not lost on legal commentators either. <a href="https://abovethelaw.com/2025/12/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google">Above the Law</a>( described the case as Google &#8220;<em>pulling up the ladder after climbing it.</em>&#8221; <a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">Eric Goldman&#8217;s blog published an extensive guest analysis</a> arguing that Google&#8217;s DMCA strategy represents an attempt to relitigate hiQ Labs through a different statutory framework.</p><h2>Why This Matters Beyond SerpApi</h2><p>If Google&#8217;s legal theory prevails, the implications extend far beyond one API company. The core question is whether deploying an anti-bot system on a publicly accessible website is enough to invoke federal copyright law against anyone who bypasses it.</p><p>Think about what that means in practice. Every CAPTCHA, every JavaScript challenge, every behavioral analysis system deployed on a public website could potentially become a &#8220;technological protection measure&#8221; under Section 1201. Any scraper that solves a CAPTCHA, executes JavaScript to render a page, or rotates IP addresses to avoid detection could be committing a federal offense.</p><p>This is not hypothetical. The legal theory applies to any website that hosts copyrighted content (which is almost all of them) and deploys some form of bot detection (which is increasingly all of them).</p><p>Eric Goldman&#8217;s blog highlighted this exact concern. <a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">The guest analysis by Kieran McCarthy</a> warns that accepting Google&#8217;s theory would allow any website deploying anti-bot technology to invoke federal law against circumvention, &#8220;transforming speed bumps and CAPTCHAs into federally enforceable copyright perimeters.&#8221;</p><p>The <a href="https://www.eff.org/">Electronic Frontier Foundation</a> has also weighed in. Staff attorney Tori Noble stated that &#8220;the right to scrape publicly available information keeps the Internet free and open,&#8221; cautioning that overly broad DMCA interpretations undermine innovation and research.</p><p>SerpApi made a similar point when we asked about the impact on consumers: &#8220;<em>Scraping-powered services benefit all kinds of consumers who use the web every day. Scraping helps to maintain the free and open flow of information across the internet, ultimately encouraging things like price transparency, competition, and informed decision-making, all to benefit consumers. Expanding the DMCA as Google has suggested would only benefit the largest tech incumbents and hinder transparency and healthy competition.</em>&#8221;</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Emerging Legal Pattern</h2><p>Google&#8217;s lawsuit does not exist in isolation. In October 2025, <a href="https://copyrightalliance.org/wp-content/uploads/2025/10/Reddit-v.-SerpApi.pdf">Reddit filed a 41-page complaint</a> against SerpApi, Perplexity AI, Oxylabs, and AWMProxy in the Southern District of New York. The complaint is far more aggressive than Google&#8217;s, both in tone and in scope: six legal counts including three separate DMCA claims, unfair competition, unjust enrichment, and civil conspiracy.</p><p>Reddit&#8217;s framing is vivid. It describes the defendants as &#8220;similar to would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead.&#8221; AWMProxy is characterized as &#8220;a former Russian botnet.&#8221; Perplexity is compared to &#8220;a North Korean hacker.&#8221; The language is clearly designed to make scrapers look like criminals.</p><p>The underlying theory is similar to Google&#8217;s. Reddit has signed licensing deals with both <a href="https://blog.google/inside-google/company-announcements/expanded-reddit-partnership/">Google</a> and <a href="https://openai.com/index/openai-and-reddit-partnership/">OpenAI</a> to grant them programmatic access to its data. Companies that want Reddit content at scale are expected to pay for it. But when scrapers circumvent SearchGuard to harvest Google&#8217;s search results, they also harvest Reddit content without paying a cent. According to data Reddit obtained through a subpoena to Google, the three scraping defendants accessed almost three billion Google SERPs containing Reddit content in just two weeks during July 2025. SerpApi alone accounted for over 1.8 billion of those page accesses. Like Google, Reddit did not send SerpApi any communication before filing suit. SerpApi disputes these figures and the other factual allegations in Reddit&#8217;s complaint, and has filed a motion to dismiss in that case as well.</p><p>Reddit also produced a piece of evidence that reads like a detective novel. It created a hidden &#8220;test post&#8221; that could only be crawled by Google&#8217;s search engine and was not otherwise accessible anywhere on the internet. Within hours, the contents of that post appeared in Perplexity&#8217;s &#8220;answer engine.&#8221; The only way Perplexity could have obtained that content was through scraping Google&#8217;s search results. Reddit calls this technique the equivalent of &#8220;marked bills&#8221; in a bank robbery investigation.</p><p>The Reddit complaint also reveals a detail that connects directly to our industry: after Reddit sent a cease-and-desist letter to Perplexity in May 2024, Perplexity&#8217;s citations to Reddit content did not decrease. They increased forty-fold.</p><p>And in December 2025, in Ziff Davis v. OpenAI, a federal judge in the Southern District of New York ruled that robots.txt files do not &#8220;effectively control access&#8221; under Section 1201. Judge Sidney Stein compared robots.txt to a &#8220;keep off the grass&#8221; sign that &#8220;relies on readers to decide to comply rather than enforcing any kind of access control itself.&#8221; The ruling is important because it sets a baseline: passive, voluntary measures are not enough to trigger DMCA protection.</p><p>But SearchGuard is not robots.txt. It is an active system that executes JavaScript, performs behavioral analysis, deploys CAPTCHAs, and makes real-time decisions about whether to grant access. Whether this kind of system meets the &#8220;effectively controls access&#8221; standard is the open legal question. The answer will likely set the direction for the entire industry.</p><p><a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">Legal commentators</a> have identified what they call the &#8220;DMCA 1201 scraping strategy&#8221;: platforms deploy technological protection measures specifically to create legal standing under Section 1201, then sue when those measures are circumvented. The sequence is intentional. Deploy, document, sue. Whether courts view this as legitimate copyright protection or as <a href="https://abovethelaw.com/2025/12/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/">strategic rent-seeking</a> will determine the outcome.</p><p>There is also a relevant doctrinal debate. The Lexmark case in the Sixth Circuit introduced the &#8220;front door/back door&#8221; argument: if a house&#8217;s front door is unlocked, putting a lock on the back door does not mean the house is &#8220;access-controlled.&#8221; Applied here: if anyone with a regular browser can access Google Search results, does deploying SearchGuard against automated systems meaningfully &#8220;control access&#8221; to the copyrighted works within those results?</p><h2>The AI Angle</h2><p>There is one more layer worth noting. <a href="https://searchengineland.com/openai-chatgpt-serpapi-google-search-results-461226">As Search Engine Land reported</a>, OpenAI used SerpApi to scrape Google Search results for ChatGPT responses on current events, after Google declined to provide direct access to its search index. SerpApi listed OpenAI as a customer on its website as recently as May 2024 before removing the listing. Other reported customers include Meta, Apple, and Perplexity.</p><p>This context matters because Google already has a massive structural advantage in the AI race when it comes to fresh web data. <a href="https://finance.yahoo.com/news/google-huge-edge-over-openai-110102636.html">Cloudflare CEO Matthew Prince put numbers on it</a>: &#8220;For every one page that OpenAI sees, Google is seeing 3.2 pages.&#8221; Against Microsoft, the ratio is 4.8 to 1. The reason is simple. Publishers cannot block Googlebot without disappearing from search results. So Google gets access to the web at a scale that no competitor can match, and it can use that data not just for search but also for training and running its AI products.</p><p>In this context, suing companies that make it easier for competitors to scrape Google&#8217;s search results is not just about protecting copyrighted images in Knowledge Panels. It is also an act of defense of a competitive advantage. If OpenAI or any other AI company can get structured search data through SerpApi, they partially close the gap that Google&#8217;s crawler monopoly creates. Shutting down that channel through litigation serves Google&#8217;s position in the AI race, even if the complaint is framed purely in terms of copyright protection.</p><h2>What Happens Next</h2><p>The case is still in its early stages. <a href="https://ppc.land/serpapi-files-motion-to-dismiss-googles-dmca-scraping-lawsuit/">SerpApi filed its motion to dismiss</a> on February 20, 2026. <a href="https://www.courtlistener.com/docket/72059948/google-llc-v-serpapi-llc/">According to the court docket</a>, the initial case management conference before Judge Yvonne Gonzalez Rogers is scheduled for March 30, 2026, and a hearing on the motion to dismiss is set for May 19, 2026.</p><p>If the motion to dismiss fails and the case proceeds to discovery and trial, it will force courts to answer questions that have been left open since hiQ. Is a JavaScript challenge a &#8220;technological protection measure&#8221; under the DMCA? Can anti-bot systems on publicly accessible websites invoke federal anti-circumvention law? Does the DMCA protect the act of accessing a public webpage, or only the copyrighted works behind genuine access controls like encryption and authentication?</p><p>For the scraping industry, the stakes are high. A ruling in Google&#8217;s favor would give any website with copyrighted content and a bot-detection system a federal cause of action against scrapers. A ruling in SerpApi&#8217;s favor would confirm that the DMCA was not designed to protect public webpages from automated access, regardless of the technical measures deployed.</p><p>We will follow the case closely. Whatever happens, the days of operating in a legal gray area are coming to an end. The courts will have to draw a line, and that line will define the rules for the next decade of web scraping.</p><p>*<em>Disclaimer: We are not lawyers. This article represents our analysis of publicly available court filings and legal commentary. Consult legal counsel for advice specific to your situation.</em>*</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #99: HTTP Caching for Web Scraping]]></title><description><![CDATA[How Conditional Requests Can Cut Your Proxy Bill, using HTTP caching.]]></description><link>https://substack.thewebscraping.club/p/http-caching-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/http-caching-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 05 Mar 2026 15:18:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5c39bf0e-6c50-4c30-bb29-fe68b7b616d5_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the biggest cost drivers in recurring scraping operations is fetching pages daily or even more times a day, especially if we need to use proxies, just to discover that have not changed since the last run. <br>In price monitoring application this is fairly common: let&#8217;s say you are monitoring prices every hour across 50,000 product pages, it&#8217;s highly probable that  most of them still show the same price they showed an hour ago. You are paying your proxy provider for bandwidth that carries identical data, over and over.</p><p>The scraping industry is well aware of this problem. A <a href="https://scrapeops.io/blog/scraping-shock/">recent analysis by ScrapeOps</a> found that even though proxy prices have dropped by 67% over the past five years, the cost per successful payload has actually increased by 133%, mostly because anti-bot defenses now require heavier infrastructure. When each request costs more, wasting them on unchanged pages hurts even more.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Several approaches try to solve this. Tools like <a href="http://changedetection.io">changedetection.io</a> monitor pages for visual or structural changes and alert you when something is different. On the more technical side, <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Altay Akkus&quot;,&quot;id&quot;:272178059,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13f918be-3a3b-4cc1-b442-cd912cb5efbe_144x144.png&quot;,&quot;uuid&quot;:&quot;dcac9616-c6b5-4389-950b-847aa67e589d&quot;}" data-component-name="MentionToDOM"></span> <a href="https://altayakkus.substack.com/p/partial-content-web-crawling-using">recently explored</a> using SimHash as a client-side fingerprint to determine whether a document has changed since the last crawl, without downloading the full body. These are valid strategies, but they all share one trait: they require you to build and maintain the change detection logic yourself.</p><p>What you might not know is that the HTTP protocol already has a native mechanism for this, and it has been part of the spec since 1999. It is called conditional requests, and it lets the server itself tell your scraper &#8220;nothing has changed&#8221; by responding with a 304 status and zero bytes of body. No diffing, no hashing, no client-side state management beyond storing a single header value.</p><p>We have written about proxy cost optimization before in articles like <a href="https://substack.thewebscraping.club/p/optimizing-proxy-costs">Optimizing Proxy Usage for Large-Scale Scraping</a> and <a href="https://substack.thewebscraping.club/p/analyzing-cost-web-scraping">Analyzing the Cost of a Web Scraping Project</a>, but we have never covered this technique. In this article, we will test it against real e-commerce sites and measure exactly how much bandwidth and money it can save.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>How HTTP caching works (the short version)</h2><p>When a web server responds to a request, it can include headers that describe the freshness and identity of the content. Two of these headers are relevant for our purposes.</p><p><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/ETag">The first is </a><code>ETag</code><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/ETag">, short for Entity Tag</a>. It is a string that uniquely identifies a specific version of a resource. Think of it as a fingerprint of the page content. When the content changes, the ETag changes.</p><p><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Last-Modified">The second is </a><code>Last-Modified</code>, a timestamp indicating when the resource was last updated.</p><p>These two headers enable what HTTP calls conditional requests. The idea is simple. After your first request, you store the ETag (or the Last-Modified value) returned by the server. On the next request to the same URL, you send it back using the `If-None-Match` header (for ETags) or `If-Modified-Since` (for timestamps). The server compares your stored value with the current one. If they match, the server responds with status 304 Not Modified and an empty body. If they do not match, you get a regular 200 response with the fresh content.</p><p>A 304 response contains zero bytes of body. For a proxy billed per GB, that is a request that costs almost nothing in bandwidth.</p><h2>The tools we used</h2><p>The HTTP caching technique itself is protocol-level and works with any HTTP client that allows setting custom headers. You could implement it with Python&#8217;s `requests`, <code>httpx</code>, or even raw <code>curl</code>.</p><p>For this article, we used <a href="https://github.com/lexiforest/curl_cffi">curl_cffi</a>, a Python HTTP client built on top of curl-impersonate. Its main strength for our purposes is TLS fingerprinting: it can impersonate the TLS handshake of real browsers (Chrome, Firefox, Safari), which prevents e-commerce sites from blocking the request before we even get to test caching behavior. Without TLS fingerprinting, some of the e-commerce targets we wanted to test would have returned 403 immediately, making it impossible to evaluate their caching support.</p><p>Then later in the article, we&#8217;ll see if we can use the same approach with Scrapy.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p></p><h2>The audit methodology</h2><p>Before attempting conditional requests, we need to check whether a target supports them. We wrote a simple audit function that makes two requests to any URL.</p><p>The first request is a standard GET. We capture the <code>ETag</code>, <code>Last-Modified</code>, and <code>Cache-Control</code> headers from the response, along with the response body size.</p><p>If an ETag or Last-Modified header is present, we make a second request with the corresponding conditional header (<code>If-None-Match</code> or <code>If-Modified-Since</code>). If the server responds with 304, the site supports conditional requests and we measure the bandwidth saving.<br></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;38438c17-6240-4fb7-b820-bcab5f5bf7d7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import time
from curl_cffi import requests


def audit_caching(url: str) -&gt; dict:
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    }

    resp = requests.get(url, headers=headers, impersonate="chrome", timeout=30)

    resp_headers = {k.lower(): v for k, v in resp.headers.items()}
    etag = resp_headers.get("etag")
    last_modified = resp_headers.get("last-modified")
    cache_control = resp_headers.get("cache-control")
    response_size = len(resp.content)

    result = {
        "url": url,
        "status": resp.status_code,
        "etag": etag,
        "last_modified": last_modified,
        "cache_control": cache_control,
        "response_size_bytes": response_size,
        "supports_304": False,
    }

    if etag or last_modified:
        time.sleep(2)

        cond_headers = dict(headers)
        if etag:
            cond_headers["If-None-Match"] = etag
        if last_modified:
            cond_headers["If-Modified-Since"] = last_modified

        cond_resp = requests.get(
            url, headers=cond_headers, impersonate="chrome", timeout=30
        )

        result["conditional_status"] = cond_resp.status_code
        result["conditional_size_bytes"] = len(cond_resp.content)
        result["supports_304"] = cond_resp.status_code == 304

    return result</code></pre></div><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">99.CONDITIONAL_SCRAPING</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Shopify stores: full conditional request support</h2><p>We focused our testing on Shopify stores because, while working on various scraping projects, we came across several Shopify-hosted sites that had this caching system enabled. Shopify powers hundreds of thousands of online stores and is one of the most common scraping targets in e-commerce, so the finding felt worth investigating systematically. The results were clear: Shopify stores with the native page cache enabled support conditional requests out of the box.</p><p>Allbirds, Kylie Cosmetics, and Brooklinen all returned 304 responses consistently. Here is what we measured on Allbirds:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;df447db3-cd85-4495-a2ce-1e3336e6b09e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">URL: https://www.allbirds.com/products/mens-tree-runners.json
Status: 200
Response size: 7,961 bytes

Caching headers:
  ETag: "page_cache:11044168:ProductDetailsController:de822deb7906aa6f9932541f4fe3dae9"
  Last-Modified: not present
  Cache-Control: not present

Conditional request support:
  304 Not Modified: YES
  Conditional response size: 0 bytes
  Bandwidth saving: 100.0%</code></pre></div><p>The saving is 100% because the 304 response body contains exactly zero bytes. The only cost is the request/response headers, which are a few hundred bytes.</p><p>This behavior was consistent across three types of Shopify endpoints. The Product HTML page is the standard storefront URL that a browser would load (e.g. <code>/products/mens-tree-runners</code>), which includes the full rendered page with images, reviews, and theme assets. The Product JSON endpoint is the same URL with .json appended (e.g. <code>/products/mens-tree-runners.json</code>), which returns only the structured product data: variants, prices, inventory, and metadata. The Catalog JSON endpoint <code>(/products.json</code>) returns the first page of the store&#8217;s entire product catalog in a single response.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sMAm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sMAm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 424w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 848w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1272w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png" width="914" height="159" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:159,&quot;width&quot;:914,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39412,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/189924926?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sMAm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 424w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 848w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1272w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We ran repeated conditional requests on each endpoint and confirmed that all returned 304 consistently. The ETag stayed stable as long as the product data did not change.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/http-caching-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Kadoa: Simplify Your Scraping Workflows with Automation and AI]]></title><description><![CDATA[My review of Kadoa: An AI-powered tool that lets you create scraping workflows in minutes]]></description><link>https://substack.thewebscraping.club/p/kadoa-review-ai-powered-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/kadoa-review-ai-powered-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 01 Mar 2026 12:34:31 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ad9731e7-7825-4d82-afea-27d4bd727905_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The web scraping industry has evolved very fast in recent years. The fact that <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">web scraping professionals needed to pivot their careers from scripts to agents</a> is only one of the facts that confirm how resilient this industry is. In particular, the scraping industry has changed not only due to AI, which is relatively recent, but also due to developments in infrastructure, bot detection, and more.</p><p>Lots of tools and libraries for the main programming languages have indeed driven web scraping to significant growth. The need companies have for data also makes such growth the actual reason for existing.</p><p>In this article, I&#8217;ll talk about Kadoa: A tool that lets you create resilient scraping workflows in minutes. I&#8217;ll show you its strengths, why you should consider it, and how it works, with a practical guide.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What is Kadoa?</h2><p><a href="https://www.kadoa.com/">Kadoa</a> is a web scraping tool that automatically and programmatically extracts web data at scale. You can use it either via the UI or via code, as it has SDKs and provides you with REST APIs.</p><p>The best part of using it is that you can just paste the target URL and the tool retrieves the data for you. Forget about anti-bot measures, fingerprinting issues, or proxy management: Kadoa does all of that for you very simply. Also, thanks to its AI engine, it can automatically recognize the structure of the data you want to scrape from a target website. So, say goodbye also to <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">CSS selectors and any other strategy you use to go beyond the DOM using LLMs</a>.</p><div><hr></div><blockquote><p><em>Using the right tool is just the first steps for a successful data extraction pipeline. Having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Why Consider Kadoa for Your Web Scraping Projects?</h2><p>The top reasons why you should consider Kadoa are the following:</p><ul><li><p><strong>Scrape via workflows</strong>: Kadoa&#8217;s UI is built to help you set scraping workflows step-by-step. Insert your target URL(s), define the data schema (or let AI make the work for you), and choose to scrape all the available pages or to remain on page and see the agent work for you.</p></li><li><p><strong>Write code only if you need it</strong>: Other than the UI, Kadoa provides you with Python and JavaScript SDKs and a wide set of REST APIs you can call. This allows you to create workflows via UI, but to manage and call them via code if you need to.</p></li><li><p><strong>Integrated data quality management</strong>: Before starting the scraping process of your target data, Kadoa allows you to manage data quality. In practice, it allows you to set data quality rules or to manage the rules it provides you, thanks to its AI agent.</p></li><li><p><strong>Easy proxy management</strong>: If you&#8217;ve been scraping for a while, you know that you have low chances of successfully scraping the majority of the content you need without using proxies. Using proxies is not a very big issue if you are used to it and if you already have a favourite provider. However, Kadoa simplifies proxy management. It already provides you with a list of countries you can choose from and, under the hood, it manages everything that&#8217;s needed to integrate proxies in your workflow.</p></li><li><p><strong>Scheduling feature</strong>: There are cases where you need to scrape the same target data from time to time. Or, eventually, you&#8217;d like to be notified when data in a target page has changed. Kadoa provides both these features. You can choose to schedule your workflow to scrape at precise time intervals. You can also choose among different notifications, one of which is getting notified when data is changed.</p></li></ul><h2>Kadoa&#8217;s Main Features</h2><p>Below is a list of Kadoa&#8217;s top features to help better understand its potential:</p><ul><li><p><strong>Simple and intuitive UI</strong>: Kadoa&#8217;s UI is simple and intuitive. It allows you to create workflows in minutes. Every scraping workflow is subdivided into steps, and Kadoa provides you with different screens. In a matter of a few minutes, you can define your preferred setup, insert the target page(s), and leave it scraping for you.</p></li><li><p><strong>Chrome extension</strong>: Other than the UI, <a href="https://www.kadoa.com/chrome-extension">Kadoa provides you with a Chrome extension</a>. If you are a Chrome user, this feature allows you to define everything you need directly on the target page, then trigger the workflow to let Kadoa&#8217;s agent start scraping.</p></li><li><p><strong>Code integrations</strong>: If you are a developer or if you simply need to invoke your workflows via code, Kadoa offers you two possibilities. It provides you with <a href="https://github.com/kadoa-org/kadoa-sdks">Python and JavaScript SDKs</a> in an open-source repository, so that you can use custom code to invoke your scrapers. Also, if you like to use code but prefer <a href="https://docs.kadoa.com/api-reference/introduction">REST APIs, Kadoa provides you with several endpoints</a>.</p></li><li><p><strong>Scraping suitable for structured or unstructured data</strong>: One of the difficult aspects you may encounter when manually scraping websites is defining how to grab unstructured data. This is one of the typical use cases where you could <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">use AI to detect patterns in data in your scraping projects</a>. The good news is that you don&#8217;t need to come up with imaginative solutions. Kadoa automatically retrieves unstructured data for you thanks to its AI engine.</p></li><li><p><strong>Data schemas definition</strong>: The tool provides you with a feature that allows you to define recurrent data structures. This can be helpful when you retrieve similar data from different websites. If you leave its AI engine to automatically define the data structure, in such cases, you could lose consistency across similar data.</p></li><li><p><strong>Proxy and anti-detection features</strong>: Forget about anti-bot measures and proxy management. Kadoa manages anti-bot solutions under the hood. It also provides you with a predefined list of locations you can choose from, and it will automatically set coherent proxies.</p></li><li><p><strong>Error handling</strong>: It provides you with advanced error handling management. Common cases are when the target site goes offline, is under maintenance, or encounters a technical issue. When this happens, Kadoa detects the problem, it notifies you, and automatically retries the data extraction. If recovery still fails, its support team is notified and investigates.</p></li><li><p><strong>Integration capabilities</strong>: The software allows you to integrate with several third parties. One interesting one is the <a href="https://n8n.io/integrations/kadoa/">integration between n8n and Kadoa</a>, which allows you to get your scraping automation workflow a step forward.</p></li><li><p><strong>Pricing model and usage graphs</strong>: Kadoa offers a <a href="https://www.kadoa.com/pricing">free tier option</a>, for which you can use 500 credits. Its pricing model is based on credit consumption, and it provides you with a UI section where you can see a graph of the consumption.</p></li><li><p><strong>Extensive docs</strong>: <a href="https://docs.kadoa.com/docs/introduction">Kadoa has extensive documentation</a> that covers both UI and API usage.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Hands-on Kadoa: Step-by-step Scraping Tutorial</h3><p>In this section, I&#8217;ll show you how to use Kadoa on an actual scraping task via the UI. The workflow will retrieve <a href="https://finance.yahoo.com/quote/INTC/history/?period1=1737538396&amp;period2=1769074385&amp;frequency=1wk">Intel&#8217;s historical price from Yahoo Finance</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XhPg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XhPg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 424w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 848w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1272w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png" width="1106" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1106,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138487,&quot;alt&quot;:&quot;Intel historical stock price data, image from their website taken by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Intel historical stock price data, image from their website taken by Federico Trotta" title="Intel historical stock price data, image from their website taken by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!XhPg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 424w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 848w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1272w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Intel historical stock price data</figcaption></figure></div><p>In this scraping workflow, I will:</p><ul><li><p>Set the target web page.</p></li><li><p>Define the data schema.</p></li><li><p>Set scheduling options and notifications.</p></li><li><p>Retrieve the actual data.</p></li></ul><p>Before starting the actual workflow, log in to Kadoa. Below is the first access page you will see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!taKn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!taKn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 424w, https://substackcdn.com/image/fetch/$s_!taKn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 848w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1272w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png" width="1456" height="705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:705,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:166980,&quot;alt&quot;:&quot;Kadoa's first access page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa's first access page by Federico Trotta" title="Kadoa's first access page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!taKn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 424w, https://substackcdn.com/image/fetch/$s_!taKn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 848w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1272w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa's first access page</figcaption></figure></div><p>Perfect! You are now ready to create your first scraping workflow with Kadoa.</p><h3>Step #1: Create a New Workflow</h3><p>From the main page, click on <strong>Add workflow</strong> to create a new one and paste the target URL. The <strong>Proxy location</strong> box allows you to select a country where proxies are localized; leave it to <strong>AUTO</strong> to let the tool automatically manage it. Click on <strong>Continue</strong> to proceed with the next step:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nd3l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 424w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 848w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png" width="1456" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240236,&quot;alt&quot;:&quot;A new workflow in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A new workflow in Kadoa by Federico Trotta" title="A new workflow in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 424w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 848w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A new workflow in Kadoa</figcaption></figure></div><p>Note that inside the <strong>Enter one or more URLs </strong>box<strong>,</strong> you have to insert the target page. If the target page is more than one, you can insert all the target pages you are interested in.</p><p>Alright, you created a new workflow in Kadoa. Let&#8217;s proceed with the next step and customize it!</p><h3>Step #2: Define the Data Schema</h3><p>As the next step, define the data schema:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ib01!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ib01!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 424w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 848w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1272w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png" width="1456" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8febc847-1387-489a-815c-cb8fa342897e_1840x861.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:200560,&quot;alt&quot;:&quot;Define the data schema in a Kadoa workflow by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Define the data schema in a Kadoa workflow by Federico Trotta" title="Define the data schema in a Kadoa workflow by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Ib01!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 424w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 848w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1272w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Define the data schema in a Kadoa workflow</figcaption></figure></div><p>If you want to insert the schema manually, Kadoa already provides you with some predefined schemas. For this tutorial, I&#8217;ve chosen to let AI do the job. So I selected <strong>AI Suggest Fields</strong>.</p><p>The system, then, asks you how you want to navigate the data. For the sake of this example, I decided to scrape only the current page from the target one, but you can also choose among three different options:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2REz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2REz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 424w, https://substackcdn.com/image/fetch/$s_!2REz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 848w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1272w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:318930,&quot;alt&quot;:&quot;Scraping data on a single page in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scraping data on a single page in Kadoa by Federico Trotta" title="Scraping data on a single page in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!2REz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 424w, https://substackcdn.com/image/fetch/$s_!2REz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 848w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1272w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Scraping data on a single page in Kadoa</figcaption></figure></div><p>After clicking on <strong>Continue</strong>, the agent will start doing its job:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!05pv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!05pv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 424w, https://substackcdn.com/image/fetch/$s_!05pv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 848w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1272w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png" width="1456" height="702" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:702,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183128,&quot;alt&quot;:&quot;Kadoa&#8217;s AI agent working by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s AI agent working by Federico Trotta" title="Kadoa&#8217;s AI agent working by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!05pv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 424w, https://substackcdn.com/image/fetch/$s_!05pv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 848w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1272w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s AI agent working</figcaption></figure></div><h3>Step #3: Review Extracted Fields and Schedule the Workflow</h3><p>Because I let AI work, the agent automatically tries to extract the data from the target page. But before proceeding, Kadoa asks for your review:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uYWt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uYWt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 424w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 848w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1272w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png" width="1456" height="783" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:783,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:305054,&quot;alt&quot;:&quot;The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta" title="The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!uYWt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 424w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 848w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1272w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The proposed extraction data schema by Kadoa&#8217;s AI agent</figcaption></figure></div><p>As you can see from the previous image, the agent has correctly detected the data to extract from the target page. Also, this job is finely improved as the tool provides you with a screenshot of the data it will extract, so that you can visualize it even better.</p><p>In the next step, you have to define the scheduling:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qxMP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qxMP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 424w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 848w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1272w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png" width="1456" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:260383,&quot;alt&quot;:&quot;Scheduling workflows in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scheduling workflows in Kadoa by Federico Trotta" title="Scheduling workflows in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!qxMP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 424w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 848w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1272w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Scheduling workflows in Kadoa</figcaption></figure></div><p>For the sake of this example, I decided to run the workflow only once. But, as you can see, you can choose among several scheduling options.</p><h3>Step #4: Set Notifications and Final Details</h3><p>As the next step, define the way you want to be notified:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CuRd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CuRd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 424w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 848w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1272w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png" width="1456" height="772" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:288124,&quot;alt&quot;:&quot;Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta" title="Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!CuRd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 424w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 848w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1272w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Setting up notifications in Kadoa</figcaption></figure></div><p>In this case, I decided to be notified via email if the workflow fails. You can add different notification channels by clicking on <strong>Add channel</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qp2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 424w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 848w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1272w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png" width="856" height="471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e10a8400-3f90-4793-832d-01537d8ef16b_856x471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:471,&quot;width&quot;:856,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45084,&quot;alt&quot;:&quot;Adding notification channels in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Adding notification channels in Kadoa by Federico Trotta" title="Adding notification channels in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 424w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 848w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1272w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Adding notification channels in Kadoa</figcaption></figure></div><p>Next, define the latest details of your scraping workflow:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zCk9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zCk9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 424w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 848w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1272w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png" width="1456" height="674" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:674,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:261303,&quot;alt&quot;:&quot;Define your workflow&#8217;s latest details by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Define your workflow&#8217;s latest details by Federico Trotta" title="Define your workflow&#8217;s latest details by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!zCk9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 424w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 848w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1272w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Define your workflow&#8217;s latest details</figcaption></figure></div><p>Before starting with the actual scraping, the system asks you to approve the sample data it proposes to you or to review the data quality rules:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hpCi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hpCi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 424w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 848w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1272w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png" width="1456" height="678" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229183,&quot;alt&quot;:&quot;Decide whether to review rules or not by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Decide whether to review rules or not by Federico Trotta" title="Decide whether to review rules or not by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!hpCi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 424w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 848w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1272w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Decide whether to review data quality rules or not</figcaption></figure></div><p>By clicking on <strong>Review rules</strong>, the tool provides you with automated data quality rules. You can select them if you think this will improve the quality of the scraping result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1tPf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1tPf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 424w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 848w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1272w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png" width="1456" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:328419,&quot;alt&quot;:&quot;Reviewing data quality rules in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reviewing data quality rules in Kadoa by Federico Trotta" title="Reviewing data quality rules in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!1tPf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 424w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 848w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1272w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reviewing data quality rules in Kadoa</figcaption></figure></div><p>When you are done reviewing quality rules, click on <strong>Approve</strong>. The actual scraping workflow will start and will be queued:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cov2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cov2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 424w, https://substackcdn.com/image/fetch/$s_!cov2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 848w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1272w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161278,&quot;alt&quot;:&quot;New Kadoa&#8217;s workflow queued by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="New Kadoa&#8217;s workflow queued by Federico Trotta" title="New Kadoa&#8217;s workflow queued by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!cov2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 424w, https://substackcdn.com/image/fetch/$s_!cov2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 848w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1272w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">New Kadoa&#8217;s workflow queued</figcaption></figure></div><p>Et voil&#224;! You have launched your first scraping workflow with Kadoa.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Download Data, See Logs and Statistics in Kadoa</h3><p>The <strong>workflow</strong> section reports all the workflows you created, their status, and the token consumption for each scraper:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GQl0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GQl0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 424w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 848w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1272w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png" width="1456" height="631" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:631,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176411,&quot;alt&quot;:&quot;Kadoa&#8217;s workflows summary and statistics by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s workflows summary and statistics by Federico Trotta" title="Kadoa&#8217;s workflows summary and statistics by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!GQl0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 424w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 848w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1272w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s workflows summary and statistics</figcaption></figure></div><p>By clicking on one workflow, you can see the data it retrieved and can decide the format you want to download it:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QgG1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QgG1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 424w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 848w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1272w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png" width="1456" height="724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:724,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233419,&quot;alt&quot;:&quot;Visualizing and retrieving scraped data in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Visualizing and retrieving scraped data in Kadoa by Federico Trotta" title="Visualizing and retrieving scraped data in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!QgG1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 424w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 848w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1272w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visualizing and retrieving scraped data in Kadoa</figcaption></figure></div><p>The <strong>Activity log</strong> page reports detailed logs of every action occurred to your workflows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UduM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UduM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 424w, https://substackcdn.com/image/fetch/$s_!UduM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 848w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1272w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png" width="1456" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178535,&quot;alt&quot;:&quot;Kadoa&#8217;s logs page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s logs page by Federico Trotta" title="Kadoa&#8217;s logs page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!UduM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 424w, https://substackcdn.com/image/fetch/$s_!UduM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 848w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1272w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s logs page</figcaption></figure></div><p>The <strong>Usage</strong> page reports graphs of the trend in terms of active workflows and the number of rows extracted for workflow, as well as the remaining total tokens on your plan:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!snXw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!snXw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 424w, https://substackcdn.com/image/fetch/$s_!snXw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 848w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1272w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png" width="1456" height="843" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:843,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139493,&quot;alt&quot;:&quot;Kadoa&#8217;s tokens usage page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s tokens usage page by Federico Trotta" title="Kadoa&#8217;s tokens usage page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!snXw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 424w, https://substackcdn.com/image/fetch/$s_!snXw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 848w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1272w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s tokens usage page</figcaption></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Manage Kadoa&#8217;s Workflows via APIs</h2><p>As introduced before, <a href="https://docs.kadoa.com/api-reference/introduction">Kadoa provides you with several endpoints for making calls via REST APIs</a>. The APIs allow you to perform several actions that are not strictly necessary for workflows already created. For example, you can start <a href="https://docs.kadoa.com/api-reference/crawling/start-crawling-session">crawling sessions</a> and <a href="https://docs.kadoa.com/api-reference/schemas/create-schema">create data schemas</a>.</p><p>Before using the API, get your API Key under the <strong>Settings</strong> page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pwMH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pwMH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 424w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 848w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1272w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png" width="1456" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108703,&quot;alt&quot;:&quot;Get your Kadoa API key by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Get your Kadoa API key by Federico Trotta" title="Get your Kadoa API key by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!pwMH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 424w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 848w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1272w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Get your Kadoa API key</figcaption></figure></div><p>If you want to manage already existing workflows, either created via the UI or APIs, you have to use the specific workflow&#8217;s ID via the UI.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b0VW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b0VW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 424w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 848w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1272w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png" width="1456" height="205" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:205,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66109,&quot;alt&quot;:&quot;Get a workflow ID by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Get a workflow ID by Federico Trotta" title="Get a workflow ID by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!b0VW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 424w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 848w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1272w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Get a workflow ID</figcaption></figure></div><p>Then you can perform several actions by invoking the REST endpoints. For example, you can <a href="https://docs.kadoa.com/api-reference/workflows/schedule-a-workflow">schedule a particular workflow</a> for later:</p><pre><code><code>curl --request PUT \\
  --url &lt;https://api.kadoa.com/v4/workflows/{workflowId}/schedule&gt; \\
  --header 'Content-Type: application/json' \\
  --header 'x-api-key: &lt;api-key&gt;' \\
  --data '
{
  "date": "2025-02-07T10:00:00.000Z"
}
'</code></code></pre><p>Where you have to insert the following:</p><ul><li><p><em>workflowId</em> : Is the ID of the workflow you want to schedule.</p></li><li><p><em>&lt;api-key&gt;</em>: Is your KadoaAPI key.</p></li><li><p>The actual date you want your workflow to start the scraping task. You have to use the ISO format for the date in UTC.</p></li></ul><h2>Kadoa: Final Comments</h2><p>After analyzing and testing the tool, I can say the following are its main advantages and disadvantages:</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Ready for AI integration. You can download the scraped data or integrate it into your AI projects directly via API.</p></li><li><p>Suits all the user needs, as it provides APIs, SDKs, and the UI.</p></li><li><p>Supports structured output formats, including JSON.</p></li><li><p>Offers virtually unlimited scalability on the side of infrastructure management and the number of URLS to scrape.</p></li><li><p>Focuses on data quality before scraping, not later.</p></li><li><p></p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Currently, it supports only 5 proxy locations.</p></li><li><p>You can&#8217;t scrape all the websites you&#8217;d like:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nPP4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nPP4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 424w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 848w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1272w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png" width="1226" height="616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:1226,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93999,&quot;alt&quot;:&quot;Unsupported scraping URL in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Unsupported scraping URL in Kadoa by Federico Trotta" title="Unsupported scraping URL in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!nPP4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 424w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 848w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1272w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Unsupported scraping URL in Kadoa</figcaption></figure></div><h2>Conclusion</h2><p>In this article, I&#8217;ve presented Kadoa: An AI-powered scraping tool that helps you simplify your scraping projects. As you&#8217;ve seen, this is a ready-to-use tool that creates scraping workflows in minutes via UI and also supports code.</p><p>Let us know in the comments: Did you know this tool before? Have you already tested it?</p>]]></content:encoded></item><item><title><![CDATA[Why LLM-Ready Scrapers Return Content in Markdown: A Deep Dive]]></title><description><![CDATA[Why do all AI-ready scraping solutions produce Markdown results? Let&#8217;s find out!]]></description><link>https://substack.thewebscraping.club/p/why-scraping-return-markdown-llm-ai</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/why-scraping-return-markdown-llm-ai</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 22 Feb 2026 21:35:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3da96938-1add-42c9-a64d-3888021f9eba_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What do FireCrawl&#8217;s <em>/scrape</em> endpoint, Bright Data&#8217;s Web Unlocker API, Craw4AI, and a whole bunch of other AI-ready scraping libraries and products have in common? They all give you the option to return scraped content from web pages in Markdown (and sometimes, it&#8217;s even the default behavior!) Ever wondered why?</p><p>In this article, I&#8217;ll break down the main reasons so you can understand why LLM-ready scrapers work this way&#8212;and how you could even build a simple one yourself!</p><h2>A Brief Reminder About Web Scraping Tools for AI</h2><p>With the <a href="https://substack.thewebscraping.club/p/how-ai-is-changing-the-web-scraping">recent rise of AI</a>, some web scraping solutions have specialized in returning content optimized for LLM ingestion.</p><p>That means the content returned by <a href="https://substack.thewebscraping.club/p/web-scraping-ai-tools-landscape">AI-ready web unlockers or open-source scraping libraries</a> isn&#8217;t just plain HTML. On the contrary, you often get an optimized Markdown version of the page. (Sometimes it&#8217;s even parsed JSON, but that&#8217;s a different story I won&#8217;t cover here.)</p><p>The Markdown content is then ready to be processed by an LLM as part of an AI agent, an AI workflow or pipeline, a multi-agent system, or similar system. In some cases, <a href="https://substack.thewebscraping.club/p/build-an-ai-agent-for-scraping-papers">these web scraping tools are even accessed autonomously by AI agents</a>, which decide when to use them to retrieve web content based on the task at hand.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Why AI-Ready Web Scrapers Choose Markdown in the First Place</h2><p>Let me introduce you to Markdown, the language spoken by LLMs.</p><p><strong>Note</strong>: I assume you already know what Markdown is, but if not, <a href="https://en.wikipedia.org/wiki/Markdown">read its Wikipedia page</a> (as it gives a quick overview with everything you need to know about its syntax).</p><h3>A Bit of Context About Data Formats in LLMs</h3><p>Most LLMs can handle pretty much any text-based format you throw at them, whether it&#8217;s plain text, HTML, JSON, CSV, XML, or others. <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">Some even have vision capabilities</a> and can understand images or other multimodal content.</p><p>Still, under the hood, most LLMs actually &#8220;speak&#8221; Markdown. That&#8217;s how they handle code blocks, tables, and other structured content, if you&#8217;ve ever wondered&#8230;</p><p>I&#8217;m sure I&#8217;m not the only one who has received a response from ChatGPT or Gemini in pure Markdown, even if I didn&#8217;t ask for it. Or sometimes, you can even catch the LLM responding in Markdown, and the page renders it in real time:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e_7Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e_7Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 424w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 848w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 1272w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png" width="1010" height="563" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:563,&quot;width&quot;:1010,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the `` characters returned by the LLM&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the `` characters returned by the LLM" title="Note the `` characters returned by the LLM" srcset="https://substackcdn.com/image/fetch/$s_!e_7Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 424w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 848w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 1272w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the `` characters returned by the LLM</figcaption></figure></div><p>Note that the LLM is writing the &#8220;```&#8221; characters used in Markdown to signify code blocks.</p><h3>Why Markdown Is a Perfect Data Format for LLMs (and AI-Ready Scrapers Use It Too)</h3><p>So, cool, LLMs love Markdown and use it behind the scenes. But why? Well, because Markdown is a versatile format that hits all the sweet spots LLMs care about:</p><ul><li><p><strong>Structured content</strong>: Markdown gives you hierarchy and organization out of the box (H1, H2, H3, lists, images, code blocks, etc.), making it easy for LLMs to parse and understand the structure of your content.</p></li><li><p><strong>Concise and LLM-friendly</strong>: Compared to raw HTML or JSON, Markdown is much more concise. Less unnecessary markup or structure means fewer tokens consumed, which also reduces the risk of hallucinations, truncations, or context overflows.</p></li><li><p><strong>De facto standard</strong>: While there&#8217;s no single formal Markdown standard, <a href="https://github.github.com/gfm/">GitHub-flavored Markdown </a>has become the widely adopted baseline, so most tools and scrapers default to it.</p></li><li><p><strong>Rich content support</strong>: Markdown supports images, links, tables, code snippets (and in some cases, such as with MDX/MarkdownX, even raw HTML or embedded React components), making it flexible for a wide range of content types.</p></li><li><p><strong>Alignment with training data:</strong> LLMs are trained on <a href="https://commoncrawl.org/">massive datasets like Common Crawl</a>, where a huge portion of high-quality technical documentation (READMEs, wikis, Stack Overflow posts, etc.) is written in Markdown. This means most AI models don&#8217;t just &#8220;understand&#8221; Markdown. Instead, they learned to reason through its structure during training, giving them a natural intuition for the format.</p></li></ul><p>Long story short, that&#8217;s why most web scraping solutions built for AI integrations return content in Markdown (or at least give you the option).</p><p>By converting a scraped HTML page directly into Markdown, AI-ready scrapers help the underlying LLM (whether it&#8217;s part of a machine learning pipeline, AI agent, <a href="https://substack.thewebscraping.club/p/web-scraping-assistant-gpt">RAG workflow</a>, plugin, or other application) process the content efficiently and effectively while also saving on token usage.</p><h3>Markdown vs HTML</h3><p>Still not convinced? Take a look at an HTML-to-Markdown conversion of a <a href="https://www.espn.com/tennis/story/_/id/45732583/jannik-sinner-defeats-carlos-alcaraz-rematch-win-wimbledon-2025-men-singles-title">sports news page from ESPN</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ac9N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ac9N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 424w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 848w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 1272w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png" width="1456" height="671" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:671,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;HTML vs Markdown representation of the same news article&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="HTML vs Markdown representation of the same news article" title="HTML vs Markdown representation of the same news article" srcset="https://substackcdn.com/image/fetch/$s_!Ac9N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 424w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 848w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 1272w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">HTML vs Markdown representation of the same news article</figcaption></figure></div><p>As you can see, the original HTML page contains 125.88 KB of content. After converting to Markdown, it drops to 35.84 KB. That&#8217;s a <strong>~28% reduction</strong> in size just from a simple data format conversion, without any significant loss of actual content!</p><p>If we look at token usage, the difference can appear even more striking. The original HTML page translates to 40,125 tokens:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dVDt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dVDt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 424w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 848w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 1272w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dVDt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Tokens for the HTML page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Tokens for the HTML page" title="Tokens for the HTML page" srcset="https://substackcdn.com/image/fetch/$s_!dVDt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 424w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 848w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 1272w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tokens for the HTML page</figcaption></figure></div><p>Meanwhile, the Markdown version corresponds to only 11,175 tokens:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2LXh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2LXh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 424w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 848w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 1272w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2LXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png" width="1456" height="703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:703,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Tokens for the Markdown-converted page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Tokens for the Markdown-converted page" title="Tokens for the Markdown-converted page" srcset="https://substackcdn.com/image/fetch/$s_!2LXh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 424w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 848w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 1272w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tokens for the Markdown-converted page</figcaption></figure></div><p>Again, that&#8217;s roughly a <strong>3.6&#215; reduction in token usage </strong>(which usually translates directly into cost savings, since most AI providers charge based on LLM usage).</p><p>For a more direct comparison between data formats, explore the <a href="https://www.kaggle.com/code/brightdataml/benchmarking-ai-on-different-data-formats">AI data format comparison </a>research piece (which I wrote in collaboration with Bright Data on its Kaggle account).</p><h3>But Raw Markdown Alone Isn&#8217;t Enough&#8230;</h3><p>Now, you might be thinking: <em>&#8220;Okay, I&#8217;ll just convert HTML pages to Markdown using one of the many HTML-to-Markdown libraries out there, and I&#8217;m done.&#8221; </em>Well&#8230; not quite.</p><p>The problem is that a direct HTML-to-Markdown conversion isn&#8217;t enough, and below are the main reasons why (and how to address them).</p><h3>1. Non-Content HTML Tags Get Treated as Content</h3><p>HTML pages are full of blocks that are required for rendering, but are completely useless for understanding the page itself. Think <em>&lt;script&gt;</em>, <em>&lt;style&gt;</em>, inline JSON configs, and similar tags.</p><p>After all, those HTML blocks contain plain text. Thus, conversion libraries (rightfully) treat them like any other text node and include them in the Markdown output:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u51a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u51a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 424w, https://substackcdn.com/image/fetch/$s_!u51a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 848w, https://substackcdn.com/image/fetch/$s_!u51a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!u51a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u51a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png" width="1456" height="625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:625,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Notice how the <script> tags are converted&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Notice how the <script> tags are converted" title="Notice how the <script> tags are converted" srcset="https://substackcdn.com/image/fetch/$s_!u51a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 424w, https://substackcdn.com/image/fetch/$s_!u51a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 848w, https://substackcdn.com/image/fetch/$s_!u51a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!u51a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Notice how the &lt;script&gt; tags are converted</figcaption></figure></div><p>This clearly pollutes the output with noise the LLM doesn&#8217;t need, while also greatly increasing token usage (as <em>&lt;script&gt;</em> and <em>&lt;style&gt;</em> blocks can be surprisingly long!)</p><p><strong>&#127919; Solution</strong>: Use an HTML parser (or, in simpler cases, well-scoped regexes) to remove <em>&lt;script&gt;</em>, <em>&lt;style&gt;</em>, and similar rendering-only HTML tags before converting the page to Markdown.</p><p>For instance, simply removing <em>&lt;script&gt;</em> and <em>&lt;style&gt;</em> tags from the input HTML produces an impressive reduction, from 1.93 MB down to 68.37 KB, which also translates into huge token savings!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RvpI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RvpI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 424w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 848w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 1272w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RvpI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png" width="1456" height="634" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:634,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the size of the new output&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the size of the new output" title="Note the size of the new output" srcset="https://substackcdn.com/image/fetch/$s_!RvpI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 424w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 848w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 1272w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the size of the new output</figcaption></figure></div><h3>2. Ads and Promotional Content</h3><p>Ads, sponsored sections, and &#8220;recommended for you&#8221; blocks might have nothing to do with the main content of the page. Leaving them in the converted Markdown can confuse the LLM or skew its understanding of what the page is really about.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kFvv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kFvv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 424w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 848w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 1272w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kFvv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The rendered <iframe> ad is leaking content into the output Markdown&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The rendered <iframe> ad is leaking content into the output Markdown" title="The rendered <iframe> ad is leaking content into the output Markdown" srcset="https://substackcdn.com/image/fetch/$s_!kFvv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 424w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 848w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 1272w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The rendered &lt;iframe&gt; ad is leaking content into the output Markdown</figcaption></figure></div><p><strong>&#127919; Solution</strong>: Use proxies that support ad-blocking when retrieving the HTML page, <a href="https://adguard.com/en/blog/adguard-for-linux-nightly.html?utm_source=reddit">enable OS-based ad blockers</a> on your deployment server, or apply rules to remove ads after fetching the HTML and before converting it to Markdown.</p><h3>3. Navigation, Headers, and Footers</h3><p>Menus, breadcrumbs, and footer links are all technically &#8220;content,&#8221; but could be semantically irrelevant for your use case (particularly if you&#8217;re not interested in links for crawling or further exploration).</p><p>If those elements aren&#8217;t removed or downweighted, they increase token usage. Plus, the LLM may overemphasize them or mistake them for part of the main content.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DwXc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DwXc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 424w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 848w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 1272w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DwXc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png" width="1215" height="1157" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1157,&quot;width&quot;:1215,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The conversion of the <header> element results in a list of URLs appearing in the target Markdown&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The conversion of the <header> element results in a list of URLs appearing in the target Markdown" title="The conversion of the <header> element results in a list of URLs appearing in the target Markdown" srcset="https://substackcdn.com/image/fetch/$s_!DwXc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 424w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 848w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 1272w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The conversion of the &lt;header&gt; element results in a list of URLs appearing in the target Markdown</figcaption></figure></div><p><strong>&#127919; Solution</strong>: Conveniently remove tags like <em>&lt;header&gt;</em> and <em>&lt;footer&gt;</em>, or design your HTML-to-Markdown system to accept only specific CSS selectors for the blocks you want to include in the conversion process (<a href="https://docs.crawl4ai.com/core/content-selection/">just like Crawl4AI does</a>).</p><h3>4. Repeated and Boilerplate Text and Content</h3><p>Things like &#8220;Sign up,&#8221; &#8220;Log in,&#8221; newsletter popups, cookie banners, or legal disclaimers (like GDPR notices) appear on almost every page of a site. Including them wastes tokens and adds repetition, which can degrade reasoning quality and increase the risk of hallucinations.</p><p><strong>&#127919; Solution</strong>: This is a tricky problem, as there&#8217;s no easy way to identify and remove all of these elements automatically. I know for a fact that some industry leaders have trained small LLMs specifically for this task, letting them process the remaining HTML (after earlier cleaning steps) to filter out all irrelevant content.</p><h3>How to Convert a Web Page from HTML to LLM-Optimized Markdown</h3><p>I was recently asked by a client to analyze specific web pages from competitors&#8217; websites. These included structured pages with hidden elements that required basic interactions (like dropdowns). Plus, some information was spread across badge images, links, etc.</p><p>Now, if you&#8217;re trying to get high-level insights from web pages, your first idea might be to just copy all the text on a page (CTRL+A + CTRL+C) and paste it into ChatGPT (or a similar AI solution), analyzing it with the right prompt. That&#8217;s far from optimal, because you lose structure, links, image URLs, and other important context.</p><p>Instead, I wrote a simple Python script that:</p><ol><li><p>Reads HTML from an <em>index.html</em> file.</p></li><li><p>Keeps only the <em>&lt;body&gt;</em> tag with Beautiful Soup for restricting the content to what you&#8217;re typically interested in.</p></li><li><p>Remove <em>&lt;script&gt;</em>, <em>&lt;style&gt;</em>, <em>&lt;header&gt; </em>and <em>&lt;footer&gt;</em> nodes.</p></li><li><p>Converts it to Markdown using <em><a href="https://github.com/matthewwithanm/python-markdownify">markdownify</a></em>.</p></li><li><p>Writes the output to an <em>output.md</em> file.</p></li></ol><p>Let me show you this script!</p><h3>HTML to LLM-Optimized Markdown Script</h3><p>Here&#8217;s the simple script for converting HTML files to LLM-ready Markdown outputs:</p><pre><code># pip install beautifulsoup4 lxml markdownify

from bs4 import BeautifulSoup
from markdownify import markdownify as md

def html_to_markdown(html_input_path: str, output_markdown_path: str):
    # Load the input HTML from a file
    with open(html_input_path, "r", encoding="utf-8") as f:
        html = f.read()

    # Parse the HTML (using lxml for high performance)
    soup = BeautifulSoup(html, "lxml")

    # Remove the undesired tags
    for tag in soup(["script", "style", "header", "footer"]):
        tag.decompose()

    # Keep only the &lt;body&gt; content (if present)
    body_html = soup.body.decode_contents() if soup.body else str(soup)

    # Convert the HTML to Markdown
    markdown = md(
        body_html,
        bs4_options="lxml" # Set the underlying HTML parser
    )

    # Write the Markdown output to disk
    with open(output_markdown_path, "w", encoding="utf-8") as f:
        f.write(markdown)


if __name__ == "__main__":
    html_to_markdown("index.html", "output.md")</code></pre><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>How to Use the Script</h3><p>The script above still involves a few manual steps, but it greatly improves the transformation of a web page into content that&#8217;s ready to be sent to any LLM.</p><p>First, <a href="https://substack.thewebscraping.club/p/anycrawl-testing-the-llm-ready-web">load the target page</a> in your browser (ideally with an ad blocker enabled):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ruqB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ruqB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 424w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 848w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 1272w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ruqB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target page" title="The target page" srcset="https://substackcdn.com/image/fetch/$s_!ruqB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 424w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 848w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 1272w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target page</figcaption></figure></div><p>Once the page has fully rendered, right-click and select the &#8220;Inspect&#8221; entry. Locate the <em>&lt;html&gt;</em> tag, then use &#8220;Copy &gt; Copy outerHTML&#8221; option to get the complete HTML of the rendered page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CTKj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CTKj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 424w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 848w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 1272w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CTKj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png" width="1456" height="789" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:789,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Copying the rendered HTML of the target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Copying the rendered HTML of the target page" title="Copying the rendered HTML of the target page" srcset="https://substackcdn.com/image/fetch/$s_!CTKj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 424w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 848w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 1272w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Copying the rendered HTML of the target page</figcaption></figure></div><p><strong>Note</strong>: Copying the rendered HTML is better than copying the HTML from the &#8220;View page source&#8221; option. The latter misses all dynamic content (basically, anything that requires JavaScript execution and rendering in the browser won&#8217;t appear in the raw page source).</p><p>Next, in your project folder, paste the HTML into a file named <em>index.html</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0mEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0mEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0mEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The index.html file in the project folder&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The index.html file in the project folder" title="The index.html file in the project folder" srcset="https://substackcdn.com/image/fetch/$s_!0mEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The index.html file in the project folder</figcaption></figure></div><p>Run the Python script, and it&#8217;ll generate an <em>output.md</em> file:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hGQJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hGQJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The resulting output.md file generated by the script&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The resulting output.md file generated by the script" title="The resulting output.md file generated by the script" srcset="https://substackcdn.com/image/fetch/$s_!hGQJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The resulting output.md file generated by the script</figcaption></figure></div><p>You can now pass the resulting Markdown to an LLM for processing.</p><p>Compared to a traditional HTML-to-Markdown approach, the tweaks in this process save tons of tokens. In particular, the output produced by this method is just 41.2 KB (compared to 611.68 KB for the original HTML), which corresponds to <strong>11,006 tokens</strong>.</p><p>If you applied a basic HTML-to-Markdown conversion, you&#8217;d end up with a 430.21 KB Markdown file, resulting in <strong>154,191 tokens</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Z-C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Z-C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 424w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 848w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png" width="1456" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Traditional HTML-to-Markdown conversion approach&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Traditional HTML-to-Markdown conversion approach" title="Traditional HTML-to-Markdown conversion approach" srcset="https://substackcdn.com/image/fetch/$s_!4Z-C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 424w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 848w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Traditional HTML-to-Markdown conversion approach</figcaption></figure></div><p>In other words, these basic tricks lead to <strong>over 14&#215; token savings</strong>. Not bad!</p><p>Et voil&#224;! Simple, manual, but highly effective.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Next Step</h3><p>The script I just shared achieves its goal, but it&#8217;s super straightforward. I presented it simply to prove how a basic HTML-to-Markdown conversion is essentially suboptimal.</p><p>For a more sophisticated result, you could integrate a similar process into your LLM-ready scraper, including CLI options for more control over which tags to remove, which nodes to select, and other conversion settings.</p><p>As a project idea, you could even turn this approach into a browser extension that converts rendered web pages in a user&#8217;s browser into LLM-ready Markdown output files. Clearly, if you go this route, make sure to follow <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical web scraping practices</a>.</p><h2>Is Markdown Always the Right Choice?</h2><p>This is the final question to ask after all this discussion. Now, you might be thinking: <em>&#8220;Okay, there are no good reasons to stick with plain HTML when passing web pages to an LLM.&#8221;</em></p><p>That&#8217;s not true since there are situations where having access to the raw HTML can make a difference. Think of when the HTML contains semantic attributes or metadata. This information would be lost during HTML-to-Markdown conversion.</p><p>In detail, these are some scenarios where sticking to HTML for LLM ingestion is beneficial:</p><ul><li><p><em><a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Global_attributes/data-*">data-*</a></em> attributes storing product IDs, prices, etc.</p></li><li><p><a href="https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA">ARIA attributes</a> that convey accessibility or structural information.</p></li><li><p><em>class</em> or other HTML attributes that reveal context beyond the visible content on a node.</p></li><li><p>HTML comments contain useful information about the page.</p></li></ul><p>For example, consider this HTML node on an Amazon page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1MJt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1MJt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 424w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 848w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 1272w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1MJt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the hidden element on this Amazon product page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the hidden element on this Amazon product page" title="Note the hidden element on this Amazon product page" srcset="https://substackcdn.com/image/fetch/$s_!1MJt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 424w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 848w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 1272w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the hidden element on this Amazon product page</figcaption></figure></div><p>This is just an empty <em>&lt;span&gt;</em>, but its <em>data-state</em> attribute contains information-rich JSON data that you would lose during the Markdown conversion (as the node doesn&#8217;t contain text).</p><p>Another common example is visual elements, which often carry semantic information not captured by visible text. For instance, based on the image below, you might think the rating is 5/5, but the aria-label attribute reveals it&#8217;s actually 4.3/5:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Uznb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Uznb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 424w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 848w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 1272w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Uznb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png" width="1456" height="694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the aria-label attribute&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the aria-label attribute" title="Note the aria-label attribute" srcset="https://substackcdn.com/image/fetch/$s_!Uznb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 424w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 848w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 1272w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the aria-label attribute</figcaption></figure></div><p>In short&#8212;let&#8217;s be honest, as it always happens in IT&#8212;converting to Markdown isn&#8217;t a one-size-fits-all solution. Therefore, it&#8217;s no surprise that most web scraping solutions built for direct AI integrations also offer the option to return raw HTML.</p><p>That said, based on my experience in the field and everything highlighted here, I highly recommend sticking to Markdown when feeding web pages to LLMs for processing or data parsing, as the benefits far outweigh the downsides in the vast majority of use cases.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>In this post, I&#8217;ve outlined why Markdown is the language of LLMs and, consequently, the preferred output format for all web scraping tools that integrate directly into AI systems like workflows, pipelines, agents, and so on.</p><p>The reasons are intuitive: Markdown is concise and strips unnecessary markup (reducing token usage) while preserving structure, images, links, lists, tables, and more.</p><p>As highlighted, plain HTML-to-Markdown conversion isn&#8217;t always optimal, and you need to apply some extra tricks to get the best results.</p><p>If you have any questions or comments, drop them below. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #98: Scraping Google Search Results in 2026: Device, Location, and Identity]]></title><description><![CDATA[Google does not have one set of results. It has millions. The hard part is knowing which one you are looking at.]]></description><link>https://substack.thewebscraping.club/p/scraping-serp-google-search</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/scraping-serp-google-search</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 19 Feb 2026 06:00:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/97f57072-a449-4a11-beab-aad59c0ec80c_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Using a search engine, you probably have noticed that results are not static. The same query returns different results depending on where you are, what device you use, and whether you are logged into a Google account. <br>When it comes to SERP scraping, this adds several layers of complexity. While for most scraping targets, you send a request and get the page, for search engines, you send a request and get <em>*a version*</em> of the page, shaped by signals you may not even be aware of.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>This makes SERP scraping fundamentally different from conventional web scraping. The data you collect is only as reliable as your control over these variables. Scrape from a datacenter IP in Virginia with a desktop Chrome fingerprint while logged out, and you will get one set of results. Scrape the same query from a mobile device in Milan while logged into a Google account, and you will get something entirely different. Both are &#8220;correct&#8221; Google results. Neither tells the full story.</p><p>In this article of The Lab, we wanted to understand how much these variables actually change the output, and more importantly, how to control them reliably. </p><h2>Google does not want you scraping its results</h2><p>Before we get into the technical setup, we need to acknowledge something that changed the landscape significantly in early 2025.<br><br>Starting in January 2025, Google began releasing SearchGuard, a technical protection measure designed to make scraping search results harder. </p><p>SearchGuard works by sending JavaScript challenges to search queries originating from unrecognized sources, <a href="https://substack.thewebscraping.club/p/google-hiding-serp-results-javascripts">as we covered on these pages when it started</a>. When a query arrives, Google&#8217;s system transmits JavaScript code that requires the browser to compute and return a &#8220;solve&#8221;, a set of specific information about the browser environment and the user generating the request. For human users, the solution happens transparently in the browser. For automated systems, it is a wall.</p><p>This change in strategy put pressure on all &#8220;SEO tools&#8221; and the operators that needed to scrape Google search results, suddenly increasing their day-to-day operational costs. </p><div><hr></div><blockquote><p><em>Need public web data, not scraper headaches?</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vag1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vag1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 424w, https://substackcdn.com/image/fetch/$s_!vag1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 848w, https://substackcdn.com/image/fetch/$s_!vag1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!vag1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vag1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png" width="398" height="143.78296703296704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:526,&quot;width&quot;:1456,&quot;resizeWidth&quot;:398,&quot;bytes&quot;:243844,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187899616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vag1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 424w, https://substackcdn.com/image/fetch/$s_!vag1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 848w, https://substackcdn.com/image/fetch/$s_!vag1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!vag1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>SerpApi turns search results into predictable JSON with built-in scale, location options, and speed. All with no maintenance. </em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://serpapi.com/?utm_source=thewebscrapingclub&quot;,&quot;text&quot;:&quot;Try for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://serpapi.com/?utm_source=thewebscrapingclub"><span>Try for free</span></a></p></blockquote><div><hr></div><p>The change in this strategy, and especially its timing, prompted professionals to raise questions that will likely never receive an answer. Does this have something to do with the AI race? Is this a way to make it harder for other AI companies rely on Google searches for their answers?<br>We&#8217;ll probably never know the answers, but they&#8217;re legitimate questions: SERP scraping is as old as Google search, so why bothering stop bots in 2025 and not some years ago?<br>However, this is today&#8217;s reality, and we need to adapt to it. Let&#8217;s examine the specifics of SERP scraping on Google (<em>as always, we&#8217;re showing this for educational purposes; be aware of current copyright and scraping laws</em>).</p><h2>What shapes a Google SERP response</h2><p>To scrape Google Search reliably, we need to model the system we are interacting with. Google personalizes search results along several axes, each of which produces measurably different output.</p><p><strong>Geographic location</strong> is one of the most impactful variables. Google determines your location through your IP address and, when available, browser geolocation permissions. A query for &#8220;pizza restaurant&#8221; from a New York IP returns local results for Manhattan. The same query from a Milan IP returns pizzerias in Milan. This extends beyond local searches: news results, shopping results, and even organic ranking order shift based on geography.</p><p>We&#8217;ll see in the test part of this article that changing location and mimicking another one is less trivial than expected, since not every proxy type works as expected.</p><p><strong>Device type</strong> determines the structure and content of the SERP page itself. Mobile and desktop results are not just different layouts of the same data. Google serves genuinely different content. Mobile SERPs prioritize featured snippets, location-based answers, and nearby points of interest. Desktop SERPs give more space to organic links and Knowledge Panels. Some results appear exclusively on mobile or exclusively on desktop. For anyone collecting SERP data for analysis, this distinction is not cosmetic. It is structural.</p><p><strong>Login state</strong> introduces personalization based on your Google account history. When you are logged in, Google uses your search history, location history, and account preferences to tailor results. When logged out, you get a more &#8220;generic&#8221; version of the results for your location and device. The difference can be subtle for generic queries and dramatic for anything Google considers personal.</p><p><strong>Keywords,&nbsp;</strong>of course, are the main driver of change. But in addition to returning different results for different keywords, the answer layout also varies accordingly. If you look for &#8220;trousers&#8221;, you&#8217;ll see more shopping results and product data, while if you&#8217;re looking for &#8220;aspirin&#8221;, you&#8217;ll see a more traditional layout.</p><p>These four variables interact. A logged-in mobile user in Tokyo sees a fundamentally different page than a logged-out desktop user in London, even for the same query. Controlling all four simultaneously is what makes SERP scraping an infrastructure problem, not just a coding problem.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p></p><h2>Tools: An Anti-Detect browser and Selenium</h2><p>Given the variables we need to control (device type and login state, specifically), and the fact that we are not building a massive scraping operation here, the best setup we can use is Playwright paired with an anti-detect browser.</p><p>We need a real browser, not just an HTTP request library like <code>requests</code> or <code>httpx</code>, because Google&#8217;s SearchGuard validates the browser environment through JavaScript challenges. A raw HTTP client has no JavaScript engine, no DOM, no window object. It cannot compute the &#8220;solve&#8221; that SearchGuard requires. The request simply fails or returns a challenge page. To pass these checks, we need something that renders JavaScript and exposes a complete browser environment.</p><p>But a standard browser is not enough either. Regular Chrome or Firefox, even when automated with Playwright or Selenium, carries detectable signals: the <code>navigator.webdriver</code> flag, predictable fingerprint values, and missing or inconsistent browser properties. Google&#8217;s systems can identify these inconsistencies and treat the session as automated.</p><p>That&#8217;s why we&#8217;re pairing Selenium with an anti-detect browser, which is a modified browser engine that spoofs the properties websites use for fingerprinting. Navigator properties, screen resolution, WebGL parameters, canvas behavior, AudioContext values, font lists, language headers, and device type. Instead of presenting the same default fingerprint every time, an anti-detect browser generates a consistent, realistic identity that looks like a genuine user on a specific device and operating system.</p><p>The critical feature for our use case is <strong>persistent profiles</strong>. An anti-detect browser manages browser profiles that survive across sessions. Each profile stores its fingerprint configuration, cookies, local storage, proxy, and device settings. When we start a profile, it resumes exactly where it left off. This means we can log into a Google account through one profile, close the browser, and reopen it days later with the session still active. Without persistent profiles, we would need to authenticate on every run, which is both impractical and a red flag for Google&#8217;s security systems.</p><p>For this article, we use <a href="https://kameleo.io/">Kameleo</a> as our anti-detect browser. It runs as a local service (Kameleo.CLI) exposing a REST API on port 5050, controllable via a Python client. It supports Chromium-based profiles (Chroma) for Chrome and mobile device emulation, and Firefox-based profiles (Junglefox). Each profile is an isolated browser session with its own fingerprint, proxy, and cookies.<br></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Setting up the infrastructure: deploying Kameleo on AWS</h2><p>Our Kameleo instance runs on a Windows EC2 in the US. This means that without a proxy, all traffic exits via a US-based AWS IP address. We will use this setup to demonstrate the difference between the instance&#8217;s own IP and a proxy claiming to be somewhere else. I&#8217;m sure you&#8217;ll be surprised by what we&#8217;ll find later.</p><h3>Installing Kameleo on AWS</h3><p>We installed Kameleo on a Windows EC2 instance using the standard graphical installer, no rocket science here. Once Kameleo is running on the AWS machine, it exposes its API on port 5050. Our Python scripts run locally and connect to the remote Kameleo instance over the network.</p><p>The architecture is straightforward: Kameleo manages browser profiles and runs the actual browsers on the AWS instance. Our local machine sends API commands (create profile, start browser, stop browser) and connects to the browser via WebSocket for Playwright automation. The AWS instance needs port 5050 open in its security group for this to work.</p><p>Every script in this article follows the same initialization pattern. We read the remote IP from an environment variable:</p><pre><code>from kameleo.local_api_client import KameleoLocalApiClient
import os
kameleo_ip = os.getenv(&#8217;KAMELEO_IP&#8217;)

kameleo_port = os.getenv(&#8217;KAMELEO_PORT&#8217;, &#8216;5050&#8217;)

client = KameleoLocalApiClient(endpoint=f&#8217;http://{kameleo_ip}:{kameleo_port}&#8217;)</code></pre><h2>Test 1: setting the right location</h2><p>As we said, one of the keys to extracting SERP data is setting the location we&#8217;d like to know more about. Our Kameleo installation is on an AWS US machine, so we expect to get SERP data from there. But if we want to change location? </p><p>We run the same query, &#8220;weather&#8221;, three times from the same AWS instance in the US. First, without any proxy, the traffic exits from the instance&#8217;s own IP. Then, through a residential proxy geolocated in Italy. Finally, through a datacenter proxy also claiming to be in Italy. For each run, we first visit whatismyipaddress.com to verify the exit IP, then navigate to Google, type the query in the search bar with randomized keystroke delays, and capture the results.<br><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">98.SERP-DATA</a>. </strong>If you&#8217;re one of them but cannot access the repository, <a href="https://twsc-private-form.lovable.app/">please fill out this form</a>.</p><p>In the file <strong>test_location_comparison.py,</strong> we&#8217;ll see how Google responds to us when we&#8217;re using different types of proxies.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/scraping-serp-google-search">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Avoid Copyright Violations While Scraping ]]></title><description><![CDATA[Discover how copyright violations can occur in web scraping and how to avoid them]]></description><link>https://substack.thewebscraping.club/p/avoid-copyright-violations-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/avoid-copyright-violations-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 15 Feb 2026 04:00:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5eee30d1-21ea-4010-a335-5d9e31803bfd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As it core, web scraping is based on a simple process: You retrieve data from a target website with the goal of doing something meaningful with the data. Regardless of your experience in the industry, this process should immediately make you ask a question to yourself:&#8221; <em>I&#8217;m retrieving and using someone else&#8217;s data, so</em> a<em>m I violating copyright or something while scraping?</em>&#8221;.</p><p>In this article, we&#8217;ll discuss what copyright in the context of web scraping is, when it occurs, and how to avoid it.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What is Copyright Violation in the Context of Scraping?</h2><p>Generally speaking, a copyright violation occurs when you reproduce, display, distribute, or create derivative works from someone else&#8217;s creative work without their permission (or without a valid legal exception). In the context of web scraping, the &#8220;creative work&#8221; involves (but is not limited to) the following:</p><ul><li><p>Articles.</p></li><li><p>Images.</p></li><li><p>Audio and video.</p></li><li><p>Code (under particular conditions).</p></li></ul><p>In other words, if you scrape and reproduce an article (even a small part of it)  on your website without the author&#8217;s permission, you can be infringing copyright. Whether it is actually infringement depends on context (how much content you copied, how you used it, and which jurisdiction applies), but &#8220;a small part of a whole article&#8221; is not a safe harbor.</p><p>So here&#8217;s the thing to bear in mind: Just because some content is accessible on the Internet, it doesn&#8217;t mean you can take it. Even though some content is publicly accessible, ownership and reproducibility are not. This is why <a href="https://substack.thewebscraping.club/i/179653589/best-practice-5-mind-the-data-you-scrape-and-the-goal">minding the data you scrape is one of the best practices for ethical scraping.</a></p><div><hr></div><blockquote><p><em>For your ethical scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>How Can Copyright Violations Occur While Scraping?</h2><p>To avoid copyright infringements, you should know the common cases to take care of. Below is a list of common situations where copyright can be violated while scraping data from websites:</p><ul><li><p><strong>Copying content</strong>: Technically speaking, scraping is copying. When you download a webpage&#8217;s HTML to your disk, you&#8217;ve made a copy. If that HTML contains creative expression, you have created a copy of copyrighted material. That does not automatically mean you are infringing, but this is the exact action copyright law regulates. And if you store, reuse, or republish that expression without permission (or a solid exception), you&#8217;re in infringement territory. Note that courts don&#8217;t need the copied content to be 1:1 identical. For them, &#8220;substantial similarity&#8221; can be enough.</p></li><li><p><strong>Copying images and media</strong>: Images are typically strongly protected. Scraping image URLs and hotlinking can still be risky, even if you report the source URLs while republishing the images. And, of course, downloading and rehosting is even more direct copying.</p></li><li><p><strong>Copying &#8220;creative fields&#8221; that look like &#8220;data&#8221;</strong>: Product descriptions, editorial blurbs, &#8220;about&#8221; sections, hotel/restaurant descriptions, FAQs, and similar content is often copyrighted text. While editorial blurbs and similar text are obviously copyrighted content, the others are not so obvious. The point to always take care of is in relation to &#8220;creative work&#8221;. A product description can be creative work when it contains original language, structure, or marketing copy. But not every description is protected. For example, a purely functional description text may have weak or no copyright protection, depending on the jurisdiction and the originality of the content itself.</p></li><li><p><strong>Scraping for training LLMs</strong>: Scraping web pages to get data for training LLMs is surely part of <a href="https://substack.thewebscraping.club/i/173603764/the-future-ai-llms-and-the-next-frontier">the evolving career of web scraping professionals</a>. However, scraping data to train Large Language Models can trigger reproduction/derivative-work arguments in courts. This is still an evolving legal area, so you should not assume &#8220;transformative&#8221; automatically saves you from legal troubles, especially at scale. The issue between <a href="https://techcrunch.com/2025/11/03/studio-ghibli-and-other-japanese-publishers-want-openai-to-stop-training-on-their-work/">studio Ghibli and OpenAI on copyright violations due to LLMs&#8217; training</a> is one among the open ones, but keep in mind: allegations, investigations, and lawsuits are not the same thing as a final court ruling.</p></li></ul><h3>How to Avoid Copyright Violations While Scraping</h3><p>Having legal issues is probably the worst nightmare for professional scrapers. So, how can you be sure you are not violating copyright while scraping? Below is a list of guidelines to take into consideration:</p><ul><li><p><strong>Scrape facts, not expression</strong>: Copyright protects expression, not facts. Scraping the price of a stock, the temperature in London, or a flight arrival time doesn&#8217;t infringe any copyright because these are facts. No one owns the fact that today it is 20 degrees in London. On the other hand, scraping a journalist&#8217;s analysis about why the price of a stock moved in a certain direction, or a photographer&#8217;s image of London, is a creative expression.</p></li><li><p><strong>Transform, don&#8217;t replicate</strong>: When repurposing content (on your website or anywhere else), transform it. This is a general rule of thumb, but if you are in the US, one of your best defenses is &#8220;Fair Use&#8221;. But to claim this, your use must be transformative. For example, scraping Amazon reviews and posting them on your own e-commerce site is replicating, not transforming. Even summarizing reviews cannot be considered transformative in some cases, and even when it is, it&#8217;s not a guaranteed shield.</p></li><li><p><strong>Don&#8217;t store raw pages by default</strong>: As said before, storing the HTML of entire pages means creating copies. To solve this, you can follow two paths:</p><ul><li><p>Parse in-memory.</p></li><li><p>Extract only the necessary content, not whole pages.</p></li></ul></li><li><p><strong>Treat images as a separate &#8220;danger zone&#8221;</strong>: Images are a type of content that, during the whole Internet era, had the majority of copyright issues so far. The safest options are:</p><ul><li><p>Using the website&#8217;s official APIs when scraping images, if available.</p></li><li><p>Scraping images under a Creative Commons license with compliance.</p></li><li><p>Asking and getting direct licensing from the owner.</p></li></ul></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Standardized Processes to Stay Safe</h3><p>So far so good, but let&#8217;s be honest: When you are taken by your daily job tasks, it&#8217;s easy to lose your compass. To avoid it, the best thing to do is to create standardized (and documented) processes and procedures so that you always operate under a guardrail. This section provides you with a couple of ideas you can implement as standardized processes to be sure you don&#8217;t violate any copyright while scraping.</p><h3>Procedure #1 to Avoid Copyright Violations While Scraping: Develop a Copyright Risk Check</h3><p>Most copyright problems in scraping are self-inflicted. This happens because developers often scrape &#8220;everything on the page,&#8221; save it &#8220;for later,&#8221; and only then do they ask: &#8220;<em>Wait, can we ship this?</em>&#8221;.</p><p>Before you add a field (or a selector) to your scraper, ask yourself the following questions:</p><ul><li><p><strong>&#8220;Is this a fact, or is this someone&#8217;s writing?&#8221;</strong>: Prices, dates, SKUs, addresses, and opening hours are facts. A paragraph of an article is someone&#8217;s writing. Remember to treat those differently.</p></li><li><p><strong>&#8220;If I publish this, would it compete with the source?&#8221;:</strong> If your application lets users consume the content without clicking the original, you&#8217;re not &#8220;aggregating.&#8221; You&#8217;re substituting.</p></li><li><p><strong>&#8220;Am I copying just what I need, or am I copying the entire page?&#8221;</strong>: If the answer to this question is: &#8220;<em>We only store it for debugging</em>&#8221;, then you are building a copy.</p></li><li><p><strong>&#8220;How much am I taking?&#8221;:</strong> A single excerpt is one thing. Thousands of excerpts across a site start looking like a dataset designed to recreate the whole content.</p></li><li><p><strong>&#8220;What am I going to do with it later?&#8221;:</strong> Internal analysis is one risk profile. A public API that returns the scraped text is a completely different risk profile.</p></li><li><p><strong>&#8220;Is my plan defensible if someone sends a legal notice?&#8221;:</strong> If your only defense is &#8220;<em>but the content is publicly available</em>&#8221;, you don&#8217;t have a defense. As said before, public availability is different than ownership-</p></li></ul><p>If answers to these questions feel shaky, the fix is usually boring: don&#8217;t collect it, collect less, or get permission.</p><h3>Procedure #2 to Avoid Copyright Violations While Scraping: Build Your Scraper So It&#8217;s Hard to Do Something Dumb</h3><p>If you want to stay out of trouble, don&#8217;t rely on &#8220;policy.&#8221; Rely on defaults and standards.</p><p>Here&#8217;s what I mean: The safest scraper is the one that can&#8217;t casually vacuum up article bodies, image files, and review text unless you deliberately build it that way.</p><p>Below is a process that works safely:</p><ol><li><p>Fetch the page.</p></li><li><p>Extract only what you came for.</p></li><li><p>Store facts + metadata (source URL, timestamp).</p></li><li><p>Throw the rest away.</p></li></ol><p>When you really do need to keep anything close to &#8220;content&#8221; (ie, media), treat it as a special case: short retention, locked-down access, and a reason written down somewhere if needed. Not &#8220;<em>maybe we&#8217;ll need it later</em>&#8221;: You must have a valid reason.</p><p>If you want a mental model, you can think of it like so: You&#8217;re not building a web scraper. You&#8217;re building a pipeline. And pipelines need guardrails.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Examples: What &#8220;Safe-ish&#8221; Looks Like vs What Can Surely Get You in Trouble</h3><p>Let&#8217;s be practical now and see some examples of what is generally safe and what is not. Of course: The following examples are not court outcomes. They&#8217;re the kind of setups that tend to be boring (safe-ish) or spicy (trouble-ish):</p><ul><li><p><strong>Price tracker (safe-ish)</strong>: You scrape SKU + price + availability + timestamp and show a price history chart. You don&#8217;t copy product descriptions or images. This is the classic &#8220;facts + original output&#8221; use case.</p></li><li><p><strong>Product catalog clone (risky)</strong>: You scrape titles, descriptions, bullet points, images, and reviews, then you show them on your site. That&#8217;s not &#8220;data.&#8221; That&#8217;s content. You&#8217;re rebuilding their user experience.</p></li><li><p><strong>News aggregation (high risk)</strong>: If you store headlines + links and add your own tags/filters, you&#8217;re closer to indexing. If you store full articles and users can read all the content as is without leaving your site, then you&#8217;re highly risking getting a trip to the nearest court.</p></li><li><p><strong>Review analytics (mixed)</strong>: Using reviews internally to compute &#8220;top complaints this month&#8221; is one thing. Republishing reviews precisely as they are is another.</p></li><li><p><strong>Business directory (often safer, until you start copying the fluff)</strong>: Name, address, phone, opening hours: These are usually factual. &#8220;About us&#8221; sections and photos, on the other hand, are where you cross over into copyrighted expression.</p></li></ul><p>So notice the pattern: The moment your product starts looking like a substitute for the source, your legal risk goes up fast.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Traps That Have Nothing to Do with Copyright (But Still Hurt)</h2><p>Copyright is only one way scraping can go wrong. Plenty of scraping disputes are won on issues that are simpler to prove than infringement. Below are the big ones you should treat as &#8220;no trespassing&#8221; signs:</p><ul><li><p><strong>Circumvention (DMCA Section 1201):</strong> If the site uses a login wall, CAPTCHA, paywall, anti-bot challenges, or IP blocking to stop you, and you write code to bypass those measures, you are potentially violating anti-circumvention laws. This is not &#8220;copyright infringement&#8221; in the traditional sense, but the practical takeaway is simple: If you have to defeat a technical barrier to get the data, you&#8217;re walking into a high-risk territory fast.</p></li><li><p><strong>Disregarding </strong><em><strong>robots.txt</strong></em><strong>:</strong> The <em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">robots.txt</a></em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications"> isn&#8217;t the law, but ignoring it has its implications</a>. In disputes, it can be used as evidence that you knew you were unwelcome and kept going anyway. It can also be relevant to arguments about authorization and &#8220;bad faith,&#8221; even if it doesn&#8217;t create copyright liability by itself.</p></li><li><p><strong>Terms of service (contract risk):</strong> If the ToS explicitly forbids scraping (and most do), and you scrape anyway, you may be liable for breach of contract. This is often easier for the content owner to win than a copyright claim because the argument is straightforward: You agreed (explicitly or implicitly) to a contract, then you violated the agreement.</p></li><li><p><strong>Do not scrape behind a login:</strong> Once you log in, you have affirmatively agreed to a contract. Breaking that contract to scrape is a fast track to a lawsuit. If your plan requires authenticated access, treat it as a licensing/permission problem, not an engineering challenge.</p></li></ul><h2>Conclusion</h2><p>In this article, we&#8217;ve discussed how copyright infringements can occur while scraping and how to avoid them. As said, it&#8217;s not always easy to understand when you are actually infringing copyright, as it depends on the governing laws which, often, are local ones. Still, the main ideas proposed can help you be conservative and stay pretty safe while scraping web pages.</p><p>So, let us know: Did you find those practices useful? Do you apply other frameworks to be sure you&#8217;re not violating copyrighted content? Let us know in the comments!tat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.&#8221;</p>]]></content:encoded></item><item><title><![CDATA[Google vs IPIDEA: Anatomy of a Residential Proxy Takedown]]></title><description><![CDATA[Google Took Down 16 Million Proxy IPs. Here is Why It Will Not Be Enough.]]></description><link>https://substack.thewebscraping.club/p/google-vs-ipidea-takedown</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/google-vs-ipidea-takedown</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Sun, 08 Feb 2026 20:21:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d1ec9374-4ca3-4d8d-9afb-783dcabe3e9b_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On these pages, we have written several times about how proxy networks work and how they source their IPs. I&#8217;m exposing no secret by saying that, in some cases, companies act in a gray area. </p><p>Think about it: how would you convince several million people to share their spare internet connection with companies that you don&#8217;t know how they will use it? <br>Not an easy task, and some companies took shortcuts, as the IPIDEA case shows.<br>Let&#8217;s see it in detail: more than the takedown itself, we&#8217;ll use it as an opportunity to look under the hood: how do residential proxy networks acquire millions of IP addresses, what keeps them running, and why are they so difficult to shut down permanently? The IPIDEA case provides unusually detailed answers to all these questions.</p><h2>What happens when Big Tech goes after the infrastructure that powers both scrapers and threat actors</h2><p>On January 28, 2026, <a href="https://cloud.google.com/blog/topics/threat-intelligence/disrupting-largest-residential-proxy-network">Google Threat Intelligence Group (GTIG) announced what they called the disruption of &#8220;one of the largest residential proxy networks in the world.&#8221;</a> The target was IPIDEA, a name that most people outside the proxy industry had never heard. Yet according to Google&#8217;s analysis, IPIDEA&#8217;s infrastructure was being used by over 550 distinct threat groups in a single week, including state-sponsored actors from China, North Korea, Iran, and Russia.</p><p>This is not just a story about a takedown. It is a detailed look at how residential proxy networks actually work, how they acquire millions of IP addresses, and why disrupting them is harder than it sounds.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What Google Actually Did</h2><p>The disruption involved three coordinated actions:</p><p>First, Google took legal action to seize the domains used to control enrolled devices and route traffic through them. Without these command-and-control (C2) domains, the SDK code running on millions of devices loses the ability to receive instructions and proxy traffic.</p><p>Second, GTIG shared technical intelligence about IPIDEA&#8217;s SDKs with platform providers, law enforcement, and research firms. The goal was to trigger ecosystem-wide enforcement, getting these SDKs flagged and removed across multiple app stores and platforms.</p><p>Third, Google updated Play Protect to automatically warn users and remove applications known to contain IPIDEA SDKs. This blocks the network&#8217;s ability to recruit new devices on certified Android devices.</p><p>Google claims these actions reduced IPIDEA&#8217;s available device pool by millions. Whether that number holds up over time is a different question, and we will get to that.</p><div><hr></div><blockquote><p><em>Not all residential proxy networks operate in gray zones. Decodo built theirs on user consent, ISO 27001 certification, and co-founded the Ethical Web Data Collection Initiative to prove the model works.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>The Anatomy of a Residential Proxy Network</h2><p>To understand why this matters, we need to understand what a residential proxy network actually is and how it differs from datacenter proxies.</p><p>A datacenter proxy routes your traffic through IP addresses belonging to cloud providers or hosting companies. These IPs are easy to identify and block because they belong to known ASNs (Autonomous System Numbers) associated with commercial hosting.</p><p>A residential proxy routes traffic through IP addresses assigned by consumer ISPs to regular households. When you connect through a residential proxy, your request appears to come from someone&#8217;s home internet connection in Omaha, Tokyo, or Milan. This makes detection and blocking significantly harder because the traffic looks indistinguishable from a regular consumer browsing the web.</p><p>The challenge for proxy providers is obvious: they need access to millions of consumer devices to build a usable network. These devices need to be online, geographically distributed, and willing (or unwilling) to forward traffic.</p><p>There is an important nuance here. While Google&#8217;s report focuses on residential proxies, the same SDK installation on a mobile phone yields two distinct proxy classes. When the phone is connected to home WiFi, traffic exits through a residential IP assigned by the home ISP. When that same phone disconnects from WiFi and switches to 5G or LTE, traffic now exits through a mobile carrier IP. The device has not changed. The SDK has not changed. But the proxy class has shifted from residential to mobile.</p><p>This matters because mobile proxies are typically sold at a premium, sometimes 2-3x the price of residential proxies. Mobile carrier IPs are considered even harder to block than residential IPs because they are shared across thousands of legitimate mobile users through carrier-grade NAT. A single SDK deployment on mobile devices effectively generates inventory for two separate product lines.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p></p><h3>How Residential Proxy Providers Acquire IP Addresses</h3><p>The IPIDEA takedown revealed the specific mechanisms used to build and maintain a large-scale residential proxy network. These methods fall into several categories, ranging from semi-legitimate to clearly deceptive.</p><h3>SDK Integration</h3><p>The primary method is embedding Software Development Kits into legitimate applications. IPIDEA operated multiple SDK brands: PacketSDK, HexSDK, CastarSDK, and EarnSDK. These SDKs are marketed to app developers as monetization tools. The pitch is simple: integrate our SDK, and we will pay you based on downloads or active users.</p><p>Once embedded, the SDK turns the device into an exit node for the proxy network. The device will accept incoming connections from the proxy infrastructure and forward requests to target websites. The app continues to function normally. The user has no obvious indication that their device is being used as a proxy.</p><p>Google&#8217;s analysis found that many applications containing these SDKs did not disclose this functionality to users. The SDK was hidden, not mentioned in the terms of service, and ran silently in the background.</p><h3>Trojanized Applications</h3><p>Beyond SDK integration, IPIDEA directly operated or controlled VPN applications that served as trojan horses. Galleon VPN, Radish VPN, and Aman VPN all provided genuine VPN functionality while simultaneously enrolling devices into the proxy network.</p><p>The logic is effective: users who install VPN applications expect their traffic to be routed through external servers. They are primed to accept unusual network behavior. The proxy functionality hides inside this expected behavior.</p><p>Google identified over 600 Android applications across multiple download sources with code connecting to IPIDEA&#8217;s C2 infrastructure. On Windows, they found 3,075 unique executables making DNS requests to IPIDEA&#8217;s Tier One domains, including applications masquerading as OneDriveSync and Windows Update.</p><h3>Pre-Infected Devices</h3><p>Researchers have documented cases of uncertified Android devices shipping with residential proxy payloads already installed. Set-top boxes, TV boxes, and other IoT devices from off-brand manufacturers have been found with hidden proxy software baked into the firmware.</p><p>This method bypasses the need for user installation entirely. The device arrives compromised.</p><h2>The Technical Architecture</h2><p>Google&#8217;s reverse engineering of the SDK code revealed a two-tier command-and-control system.</p><p>When an infected device starts up, it contacts a Tier One server. The device sends diagnostic information including OS version, device identifier, and a key parameter that appears to be used for affiliate tracking (determining which app developer gets paid for the enrollment). The Tier One server responds with timing configuration and a list of Tier Two server IP addresses.</p><p>The device then periodically polls a Tier Two server, checking for proxy tasks. When a task arrives, it contains a target FQDN (like www.google.com:443) and a connection ID. The device establishes a connection to the target, receives data payloads from the Tier Two server, and forwards them unmodified to the destination.</p><p>Google found approximately 7,400 Tier Two servers at the time of their analysis, hosted globally including in the United States. The number fluctuated daily, suggesting a demand-based scaling system.</p><p>The infrastructure analysis revealed something important: despite different brand names (PacketSDK, HexSDK, CastarSDK, EarnSDK), all the SDKs connected to the same pool of Tier Two servers. The brands were marketing fronts for a single unified network.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>The Brand Proliferation Problem</h3><p>This brings us to one of the most interesting findings. IPIDEA did not operate under a single name. Google identified at least 13 ostensibly independent proxy and VPN brands controlled by the same actors:</p><ul><li><p>360 Proxy</p></li><li><p>922 Proxy</p></li><li><p>ABC Proxy</p></li><li><p>Cherry Proxy</p></li><li><p>Door VPN</p></li><li><p>Galleon VPN</p></li><li><p>IP 2 World</p></li><li><p>Ipidea</p></li><li><p>Luna Proxy</p></li><li><p>PIA S5 Proxy</p></li><li><p>PY Proxy</p></li><li><p>Radish VPN</p></li><li><p>Tab Proxy</p></li></ul><p>These brands operated separate websites, separate marketing, and separate pricing. A customer comparing &#8220;922 Proxy&#8221; to &#8220;Luna Proxy&#8221; would have no obvious indication that they were buying access to the same underlying network.</p><p>This is not unique to IPIDEA. Industry analysis suggests there are only about 7 truly unique residential proxy networks globally, despite hundreds of brands competing in the market. The rest are resellers, white-labels, or, like IPIDEA, multiple storefronts for the same infrastructure.</p><h2>Why Residential Proxies Are Hard to Block</h2><p>The fundamental problem for defenders is that residential proxy traffic looks legitimate. When a request arrives from a Comcast IP in Chicago, there is no technical marker indicating whether it comes from an actual Comcast customer browsing normally or from a proxy network routing traffic through that customer&#8217;s compromised device.</p><p>The proxy networks exploit this by design. The value proposition they sell is precisely this difficulty of detection.</p><p><a href="https://deviceandbrowserinfo.com/learning_zone/articles/inside-ipidea-residential-proxy-network">Security researcher Antoine Vastel published concrete data that illustrates the scale of this problem</a>. By actively testing proxy endpoints, he verified more than 16 million unique IP addresses that were functional and associated with the IPIDEA network during the 30 days preceding the takedown. The breakdown by brand shows the relative sizes within the IPIDEA ecosystem: PY Proxy (PyProxy) accounted for 13.4 million IPs, PIA S5 Proxy for 2.2 million, and Luna Proxy for 549,000.</p><p>These are not theoretical numbers from marketing materials. These are IP addresses through which Vastel routed traffic and confirmed as working proxy endpoints. And here is the critical insight from his analysis: even with 16 million identified proxy IPs, defenders cannot simply block them.</p><p>The reason is that residential exit nodes mix traffic from automated tools and legitimate human users on the same IP. The device owner browses the web normally, while the SDK, in the background, forwards proxy traffic over the same connection. Blocking these IPs based on proxy activity would inevitably block real users who happen to share an IP or whose IP was previously used as an exit node.</p><p>Vastel&#8217;s recommendation is telling: use these IoCs for risk scoring, behavioral enrichment, and incident investigation, but not for direct blocking. The data is context, not a verdict. This fundamental asymmetry makes residential proxies valuable to attackers and frustrating for defenders.</p><p>His research also confirmed another pattern: IP addresses frequently appear across multiple proxy ecosystems simultaneously. IPIDEA did not rely exclusively on its own residential pool. Requests were routed through or resold from other networks. The same IP might be accessible through IPIDEA, a competitor, and a reseller all at once. This interconnection means that even identifying an IP as &#8220;IPIDEA-linked&#8221; does not tell you the full story of how it is being used.</p><p>Traditional IP reputation systems struggle with this complexity. Blocking known bad IPs works for datacenters where the IP assignments are stable, and the ASNs are identifiable. Residential IP addresses rotate frequently as ISPs reassign addresses, and blocking residential IP ranges blocks legitimate users.</p><h2>Google&#8217;s Approach: Attacking the Infrastructure</h2><p>Rather than trying to block individual IP addresses, Google attacked the control infrastructure. By taking down the Tier One C2 domains, they severed the connection between infected devices and the proxy operators. Without C2 connectivity, the SDK code on millions of devices becomes inert.</p><p>This approach has precedent. It is the same strategy used against botnets: identify the command-and-control infrastructure and take it down. The infected devices remain infected, but they can no longer receive instructions.</p><p>Google also partnered with Cloudflare to disrupt IPIDEA&#8217;s domain resolution, adding another layer of infrastructure disruption beyond the legal domain seizures.</p><h2>Will It Work? The Persistence Problem</h2><p>Here is where we need to be realistic about the limitations of this approach.</p><p>The takedown disrupted IPIDEA&#8217;s current infrastructure. The domains are gone. The C2 servers are unreachable. Millions of devices are no longer participating in the proxy network. But &#8220;no longer participating&#8221; is not the same as &#8220;cleaned up.&#8221;</p><p>The fundamental problem is that infected devices remain infected. The SDK code is still installed on millions of phones, tablets, TV boxes, and computers worldwide. Google can take down domains. Google can update Play Protect to block new installations. What Google cannot do is reach into millions of devices and uninstall the malicious code that is already there.</p><p>For the SDK to be removed, one of these things needs to happen: the user manually uninstalls the app containing it, the device gets factory reset, or the device gets replaced. None of these happens at scale. Most users are unaware that the SDK exists. The apps that contain it often provide real functionality, games, utilities, and VPNs that users want to keep. There is no mechanism to notify millions of people across dozens of countries that their flashlight app is secretly a proxy node.</p><p>This creates an asymmetry that favors the attackers. Google invested significant legal, technical, and coordination resources to take down IPIDEA&#8217;s infrastructure. The IPIDEA operators (or anyone who acquires their codebase) can spin up new C2 domains, update their DNS configuration, and potentially reactivate a substantial portion of the dormant network. The SDK code often includes fallback mechanisms and update capabilities precisely for this scenario.</p><p>The brand proliferation we discussed earlier is part of this resilience. IPIDEA operated 13+ brands. If some domains get seized, others may survive. If the entire IPIDEA operation is compromised, the operators can rebrand entirely and inherit a pre-installed base of millions of devices waiting for new instructions.</p><p>Google acknowledged this reality in its announcement, noting that &#8220;this industry appears to be rapidly expanding&#8221; and that &#8220;there are significant overlaps across providers.&#8221; The reseller and partnership agreements that connect different proxy brands mean that disruption propagates unpredictably through the ecosystem, but so does recovery.</p><p>This is not a battle that can be won definitively. It is a cost-imposition strategy. The goal is to make operating these networks so expensive and risky that some operators exit the market or shift to more legitimate practices. But as long as the infected device base exists, the infrastructure can be rebuilt. The realistic outcome is degradation, not eradication.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Economics Behind Residential Proxy Networks</h2><p>Understanding why these networks exist requires understanding the economics.</p><p>Residential proxy bandwidth sells for $4-8 per gigabyte or more to end customers. The cost to acquire that bandwidth through SDK partnerships is measured in cents per gigabyte. The gross margins appear enormous.</p><p>But running a proxy operation is not just bandwidth arbitrage. The real costs include:</p><p><strong>Engineering and Infrastructure</strong>: Maintaining thousands of C2 servers globally, building rotation logic, handling the unreliable nature of consumer devices going online and offline unpredictably.</p><p><strong>SDK Distributio</strong>n: Paying app developers for integration, maintaining relationships with publishers, and navigating app store policies that increasingly scrutinize monetization SDKs. The silver lining is that mobile app SDKs generate dual inventory: residential IPs when devices are on WiFi, mobile IPs when they switch to cellular. This allows providers to sell the same underlying device pool across two product categories at different price points.</p><p><strong>Customer Acquisition</strong>: Finding buyers for proxy services is expensive. The market is niche, competition is intense, and customers are price-sensitive.</p><p><strong>Legal and Compliance</strong>: Or in IPIDEA&#8217;s case, the lack thereof. Operating in legal gray zones creates ongoing risk. The Google takedown demonstrates what happens when that risk materializes.</p><p>Industry estimates suggest customer acquisition costs consume 40-60% of revenue even at scale. The apparent margin compression means that most residential proxy providers operate on thin actual profits despite the high sticker prices.</p><p>This economic pressure explains the proliferation of brands. Running multiple storefronts for the same underlying network lets operators segment the market, test different price points, and spread legal risk across multiple corporate entities.</p><h2>The Bigger Picture</h2><p>The IPIDEA takedown is part of a larger pattern. Google previously took action against the BadBox2.0 botnet, which shared infrastructure with IPIDEA. Law enforcement agencies worldwide are paying greater attention to residential proxy networks, recognizing the role this infrastructure plays in facilitating activities ranging from credential stuffing to espionage.</p><p>The residential proxy industry has partially operated in a legal gray zone for years. The Google action, particularly the legal component, establishes a clearer precedent that enrolling devices without consent and facilitating malicious activity creates meaningful legal exposure.</p><p>This does not mean residential proxies are going away. The demand exists, and where demand exists, supply follows. However, the industry may be compelled toward more transparent practices: clearer consent mechanisms, improved disclosure in applications that embed SDKs, and more careful vetting of customers who purchase proxy access.</p><p>For those of us who work with web scraping, the takedown is a reminder that the infrastructure we rely on has a supply chain. Understanding the supply chain, including its technical architecture, its business model, and its vulnerabilities, helps us make better decisions about which providers to trust and how to build resilient scraping systems.</p><p>The IPIDEA network may be rebuilt, rebranded, or replaced by competitors. However, the detailed technical analysis Google published provides the entire industry with greater visibility into how these networks operate. That visibility, more than the takedown itself, may be the most lasting impact.</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #97: My first week with OpenClaw]]></title><description><![CDATA[160,000 Stars in Two Months: What OpenClaw Means for Scrapers]]></description><link>https://substack.thewebscraping.club/p/my-first-week-with-openclaw</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/my-first-week-with-openclaw</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 05 Feb 2026 15:25:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5a3676f5-c43a-43eb-b4fd-266e2426505f_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On November 24, 2025, a new open-source project appeared on GitHub. Two months later, it has over 160,000 stars. The project is OpenClaw, and it describes itself as &#8220;the AI that actually does things.&#8221; And if you are interested in tech/AI and you don&#8217;t live under a rock, you&#8217;ve probably heard about it, since it gained huge popularity in the past weeks. <a href="https://newsletter.pragmaticengineer.com/p/the-creator-of-clawd-i-ship-code">You can find even a Podcast Episode with its creator on <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;The Pragmatic Engineer&quot;,&quot;id&quot;:458709,&quot;type&quot;:&quot;pub&quot;,&quot;url&quot;:&quot;https://open.substack.com/pub/pragmaticengineer&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5ecbf7ac-260b-423b-8493-26783bf01f06_600x600.png&quot;,&quot;uuid&quot;:&quot;0119e766-249a-4df6-bdb3-010c1eff28ec&quot;}" data-component-name="MentionToDOM"></span> newsletter</a>, which I highly recommend.</p><p>Of course, I could not miss running it, and in the past ten days, I&#8217;ve played with OpenClaw. What follows is what I learned, what surprised me, and why I think it can be interesting also for someone doing web scraping, even if it&#8217;s not its core use.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What OpenClaw actually is</h2><p><a href="https://openclaw.ai/">OpenClaw</a> is an AI assistant that runs locally on your machine. It can control your computer: read and write files, execute shell commands, manage your calendar, send emails, and browse the web. You interact with it through messaging apps you already use. Telegram, WhatsApp, Discord, Signal, Slack, iMessage. You message it like you would message a coworker, and it does things on your behalf.</p><p>The key difference from cloud-based assistants is that everything runs on your infrastructure. The agent lives on your computer, sees what you allow it to see, and acts within the boundaries you define. It supports multiple AI backends: Anthropic Claude, OpenAI models, or local models if you prefer to keep everything offline.</p><p>The tagline on the repository says it well: &#8220;A smart model with eyes and hands at a desk with keyboard and mouse.&#8221;</p><p>What makes OpenClaw more than a simple automation wrapper is the breadth of its integration layer. Out of the box, it connects to over 50 services and tools. Gmail for email, Obsidian for notes, GitHub for repositories, Spotify for music, Philips Hue for smart home control. Each integration is a capability the agent can invoke when relevant. You do not need to specify which tool to use. You describe what you want, and the agent figures out which integration applies.</p><p>The architecture includes a skill system that deserves attention. Skills are modular capabilities that the agent can learn, create, and modify. Users have reported that OpenClaw has written its own extensions and updated its own prompts autonomously. This is not science fiction. The agent has file system access and can modify its own configuration. Whether this is exciting or terrifying depends on your perspective.</p><p>There is also a memory layer. OpenClaw remembers context across sessions. It learns your preferences, your common requests, and your workflow patterns. Over time, it becomes more useful because it accumulates knowledge about how you work. This persistence is stored locally, which matters for privacy, but it also means the agent builds an increasingly detailed model of your behavior.</p><p>Background execution is another differentiator. OpenClaw can run cron jobs, scheduled reminders, and background tasks. You can tell it to check something every hour or to notify you when a condition is met. It is not just reactive. It can be proactive, monitoring and acting without constant prompting.</p><p>Finally, it works in group chats. You can add your OpenClaw bot to a Telegram group, and it will participate in conversations, responding when mentioned or when configured to do so. This opens possibilities for shared assistants across teams, though it also multiplies the security considerations.</p><div><hr></div><blockquote><p><em>If you don&#8217;t use LLMs for scraping, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>Setting it up</h2><p>Installation is straightforward. A one-liner gets you started:</p><pre><code>curl -fsSL https://openclaw.ai/install.sh | bash</code></pre><p>Or via npm:</p><pre><code>npm i -g openclaw
openclaw onboard</code></pre><p>I went with the npm route. The onboarding process walks you through connecting your AI provider and setting up your first communication channel.</p><p>For Telegram, you create a bot through BotFather, grab the token, and configure it in OpenClaw. The CLI guides you through the pairing process. When someone messages your bot for the first time, you approve them with a simple command:</p><pre><code>openclaw pairing approve telegram &lt;CODE&gt;</code></pre><p>After that, you can message your bot, and it responds as your personal assistant.</p><p>The browser extension required a few extra steps. The official documentation says it works with Chrome only, but since Brave is Chromium-based, I decided to try it anyway. It worked. You install the extension with:</p><pre><code>openclaw browser extension install</code></pre><p>Then load it as an unpacked extension in Brave by enabling Developer Mode and pointing to the path returned by `openclaw browser extension path`. The extension lets you attach specific tabs to OpenClaw&#8217;s control. Only attached tabs can be controlled, which is a reasonable security measure.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Living with it</h3><p>I configured OpenClaw to use Claude as its backend. The experience has been surprisingly natural. I message it on Telegram with requests, and it executes them. Check my calendar for conflicts. Find a file I worked on last week. Send a reminder at 3 pm.</p><p>What makes it different from simply asking ChatGPT is that it actually does things. It does not tell me how to check my calendar. It checks my calendar and tells me what it found.<br><br>I also connected it to my TWSC accounting system, which has some APIs. In a few steps, OpenClaw built a small app that lets me use Telegram to check invoice status, revenues, expenses, and more.</p><p>The browser extension adds another dimension. I can ask it to navigate to a website, read specific content, fill out forms, or extract information. It operates within a real browser session, with real cookies, logged into my real accounts if I choose to attach those tabs. <br>This is what I used to automate posting a note on Substack, which doesn&#8217;t have an API. I just prompted the desired message, and the LLM (Claude 4.5) understood where on the Substack website notes are posted and how to create one.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t8vV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t8vV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png 424w, https://substackcdn.com/image/fetch/$s_!t8vV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png 848w, https://substackcdn.com/image/fetch/$s_!t8vV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png 1272w, https://substackcdn.com/image/fetch/$s_!t8vV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t8vV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png" width="648" height="173" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:173,&quot;width&quot;:648,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23027,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/186961073?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t8vV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png 424w, https://substackcdn.com/image/fetch/$s_!t8vV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png 848w, https://substackcdn.com/image/fetch/$s_!t8vV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png 1272w, https://substackcdn.com/image/fetch/$s_!t8vV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fdc5ff4-81e8-42a8-826f-fbd62a3f3fc3_648x173.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div><hr></div><h3>The elephant in the room</h3><p>I need to address something that should be obvious by now. Using OpenClaw means giving an AI agent extensive access to your computer. It can see your browser sessions, your files, your credentials. It can execute commands.</p><p>Consider what this means in practice. When you attach a browser tab, the agent can read every cookie, every session token, every piece of data on that page. If you attach a tab where you are logged into your email, the agent can read your email. If you attach a tab with your banking session, the agent theoretically has access to that session. The same applies to files. If you give OpenClaw access to your file system, it can read your SSH keys, your environment files with API credentials, your password manager exports if you have any lying around.</p><p>The agent can also execute shell commands. This means it can install software, modify system configuration, create network connections, and run arbitrary code. OpenClaw does have a sandbox mode that restricts some of these capabilities, but the default configuration is permissive because that is what makes it useful.</p><p>With my current setup using Claude as the backend, every interaction passes through Anthropic&#8217;s servers. The model sees what I ask, sees the context from my computer, and processes it on their infrastructure. When I ask OpenClaw to read a balance sheet, the request goes to Anthropic&#8217;s API, is processed, and returns. I trust Anthropic&#8217;s privacy practices, but this is still a significant amount of sensitive data leaving my machine.</p><p>There is also the question of prompt injection and model behavior. What happens if you navigate to a malicious page that contains instructions designed to manipulate the agent? Modern LLMs are susceptible to prompt injection attacks where content on a webpage could potentially influence the agent&#8217;s behavior. This is not theoretical. It is an active area of security research, and there are no perfect defenses yet.</p><p>The risk profile changes depending on your configuration. Using a cloud model like Claude or GPT-4 means your data flows through external servers, but those models are also more capable and more likely to handle edge cases correctly. Using a local model keeps everything on your machine, but local models may make mistakes that a more capable model would avoid.</p><p>This is why my next step is to migrate to a local model. OpenClaw supports running with local LLMs, which means the entire pipeline can stay on my hardware. The tradeoff is capability. Local models are not yet at the level of Claude or GPT-4 for complex reasoning tasks. But for an assistant that executes relatively simple commands, they might be good enough. And critically, my cookies, my session tokens, my file contents never leave my network.</p><p>Practical recommendations if you decide to use OpenClaw: use a dedicated browser profile for attached tabs, separate from your personal browsing. Do not attach tabs with active banking or financial sessions. Be cautious about which directories you give the agent access to. Consider running in sandbox mode until you understand the tool&#8217;s behavior. And seriously consider the local model option if you plan to use this for anything sensitive.</p><p>If you are going to run an AI agent with this level of access, running it locally is the only configuration that makes full sense from a security perspective. The convenience of cloud models is real, but so is the exposure.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Why this matters for web scraping</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/my-first-week-with-openclaw">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[WebDriver vs Chrome DevTools Protocol (CDP) vs WebDriver BiDi: How We Control Browsers]]></title><description><![CDATA[Do you know how browser automation libraries actually manage to control browsers? Let&#8217;s find out!]]></description><link>https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 01 Feb 2026 20:07:06 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f3990ed3-2f27-4035-9f16-1768624098ea_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You may have built dozens of scraping or automation scripts using Selenium, Playwright, or Puppeteer. But have you ever stopped to wonder how these libraries actually control the underlying browser instances?</p><p>It&#8217;s not magic, but the result of a few very specific mechanisms&#8212;namely, browser automation protocols. The most important ones are:</p><ul><li><p>WebDriver</p></li><li><p>Chrome DevTools Protocol (CDP)</p></li><li><p>WebDriver BiDi</p></li></ul><p>In this post, I&#8217;ll break down each of them to cover what they are, how they work at a low level, and how they enable programmatic browser control. You&#8217;ll also discover where browser automation is headed next!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p><div><hr></div></blockquote><h2>Everything You Need to Know About WebDriver</h2><p>Let me start this WebDriver vs Chrome DevTools Protocol (CDP) vs WebDriver BiDi piece by focusing first on the protocol at the top of the list: WebDriver!</p><h3>What It Is</h3><p>WebDriver is a <a href="https://www.w3.org/standards/types/#x5-1-recommendation">W3C Recommendation</a> that standardizes a remote control interface for automating and inspecting &#8220;<em>user agents&#8221;</em> (in other words, web browsers). In practical terms, it lets external programs interact with a browser through a language- and platform-agnostic protocol.</p><p>In detail, WebDriver exposes a concise, object-oriented API for cross-browser control. This makes it a reliable and realistic foundation for end-to-end testing and automation, whether the browser runs locally or on a remote machine.</p><p><strong>&#128214; Further reading</strong>:</p><ul><li><p><em><a href="https://www.w3.org/TR/webdriver1/">WebDriver W3C Recommendation</a></em></p></li><li><p><em><a href="https://github.com/w3c/webdriver">WebDriver Standard GitHub repository</a></em></p></li><li><p><em><a href="https://developer.mozilla.org/en-US/docs/Web/WebDriver">WebDriver documentation on MDN</a></em></p></li></ul><h2>How It Works</h2><p>The W3C WebDriver protocol isn&#8217;t tied to any specific programming language or framework. Because of that, browser automation client libraries built on top of it&#8212;regardless of the language they&#8217;re written in&#8212;are essentially thin wrappers. Their main job is to translate application-level API calls into WebDriver-compliant commands and then deal with the results produced by the browser.</p><p>More in depth, there are 3 main components you need to keep in mind:</p><ul><li><p><strong>The WebDriver-based browser automation client library</strong> (e.g., Selenium) that exposes a developer-friendly, application-level API.</p></li><li><p><strong>A browser-specific driver server</strong> (e.g., <a href="https://developer.chrome.com/docs/chromedriver">ChromeDriver</a>, <a href="https://github.com/mozilla/geckodriver">geckodriver</a>). That&#8217;s usually a standalone executable that runs on your machine or in a remote environment. It understands the WebDriver protocol and maps incoming commands to the browser&#8217;s native automation interfaces. Depending on the automation tool you&#8217;re using, this driver may be downloaded and managed automatically, or you may have to install and version-match it yourself.</p></li><li><p><strong>The browser application</strong> itself (Chrome, Firefox, Edge, Safari, etc.), which ultimately performs the actions.</p></li></ul><p>At runtime, every WebDriver interaction follows a strict client&#8211;server communication model defined by the W3C specification. Thus, when a script built with a browser automation client issues a command (e.g., <em>element.click()</em>), the following happens under the hood:</p><ol><li><p>The client serializes the command into a standardized request (targeting a specific endpoint and including a well-defined JSON payload) as defined by the W3C WebDriver specification.</p></li><li><p>That request is sent over HTTP to a browser-specific WebDriver server endpoint exposed by the driver.</p></li><li><p>The driver server receives the request, interprets the WebDriver protocol command, and maps it to the appropriate native browser automation call.</p></li><li><p>The browser executes the action as a real user would.</p></li><li><p>The browser returns the execution result (e.g., status, errors, or requested data) back to the driver using browser-internal communication mechanisms.</p></li><li><p>The driver wraps that result into a WebDriver-compliant response, including a standardized status code and JSON payload.</p></li><li><p>The response is sent back over HTTP to the client, which deserializes it into application-level objects.</p></li></ol><p>One of the biggest advantages of modern WebDriver servers speaking the W3C WebDriver protocol directly is predictability. Because the communication language is standardized, different browser drivers implement the same semantics, resulting in more consistent behavior across browsers and environments.</p><p>This architecture also explains why WebDriver requires a browser-specific driver server in the first place. The protocol itself is browser-agnostic, while the driver is responsible for bridging that standardized protocol to each browser&#8217;s internal automation APIs.</p><h3>Scraping and Automation Libraries Built on Top of It</h3><p>Out of the libraries built on top of the WebDriver mechanism, <a href="https://www.selenium.dev/">Selenium</a> is by far the most popular and widely adopted. Other interesting libraries include <a href="https://webdriver.io/">WebDriverIO</a> and <a href="https://nightwatchjs.org/">Nightwatch.js</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TvE5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TvE5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png 424w, https://substackcdn.com/image/fetch/$s_!TvE5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png 848w, https://substackcdn.com/image/fetch/$s_!TvE5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png 1272w, https://substackcdn.com/image/fetch/$s_!TvE5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TvE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png" width="1456" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113580,&quot;alt&quot;:&quot;Libraries supporting CDPLibraries supporting the WebDriver protocol&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/181593018?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Libraries supporting CDPLibraries supporting the WebDriver protocol" title="Libraries supporting CDPLibraries supporting the WebDriver protocol" srcset="https://substackcdn.com/image/fetch/$s_!TvE5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png 424w, https://substackcdn.com/image/fetch/$s_!TvE5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png 848w, https://substackcdn.com/image/fetch/$s_!TvE5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png 1272w, https://substackcdn.com/image/fetch/$s_!TvE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1159ffd-0b02-4dd6-adde-dc7b95841c40_1920x574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Libraries supporting the WebDriver protocol</figcaption></figure></div><p>The main challenge with this type of library is configuring the correct browser driver server. Luckily, as of <a href="https://www.selenium.dev/blog/2023/selenium-4-10-0-released/">Selenium 4.10</a> (release in June 2023), you no longer need to manually download the correct browser driver for your specific browser type and version. Selenium now automatically detects, downloads, and configures the appropriate driver, ensuring that your tests or scraping scripts run smoothly out of the box.</p><h3>Extra: What About Selenium 3&#8217;s JSON Wire Protocol? A Bit of History&#8230;</h3><p>Before the W3C standardized the WebDriver protocol in 2018, older versions of Selenium (especially before 3.8) relied on a non-standard <a href="https://www.selenium.dev/documentation/legacy/json_wire_protocol/">JSON Wire Protocol</a>.</p><p>In this architecture, the client library serialized commands into JSON, but there was no unified specification. Thus, each browser driver (developed by different teams) had to implement its own logic to map those instructions to the browser&#8217;s native automation APIs.</p><p>That created a kind of &#8220;dialect problem,&#8221; where the same command could behave slightly differently or have different timing across Chrome, Firefox, Internet Explorer, and Safari. These inconsistencies were a major source of latency and flaky behavior.</p><p>Selenium 4 resolved that by adopting the W3C WebDriver protocol as its standard, eliminating the intermediary translation layer and ensuring consistent, predictable automation across browsers.</p><div><hr></div><h2>Chrome DevTools Protocol (CDP) Explained</h2><p>Time to dive into the second protocol under analysis: the Chrome DevTools Protocol (also known simply as CDP).</p><h3>What It Is</h3><p>The Chrome DevTools Protocol (CDP) is a low-level, JSON-based protocol that lets you inspect, debug, and instrument web pages in Chromium-based browsers, such as Chrome, Edge, Brave, and Opera.</p><p>It provides programmatic access to browser internals, enabling control over the DOM, network requests, and performance metrics. The protocol is designed by Google and is commonly used for automation, testing, and scraping.</p><p><strong>&#128214; Further reading</strong>:</p><ul><li><p><em><a href="https://chromedevtools.github.io/devtools-protocol/">Chrome DevTools Protocol Docs</a></em></p></li><li><p><em><a href="https://github.com/ChromeDevTools/devtools-protocol">Chrome DevTools Protocol GitHub repository</a></em></p></li></ul><h3>How It Works</h3><p>CDP is organized into domains. Domains represent functional areas in the browser and handle specific tasks, such as DOM manipulation, network interception, console logging, performance profiling, and device or network emulation.</p><p>Each domain exposes commands and events:</p><ul><li><p>Commands are JSON requests sent to the browser.</p></li><li><p>Events are JSON messages sent by the browser back to the client.</p></li></ul><p>Both commands and events can be transmitted over HTTP or <a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSocket">WebSocket</a>, with WebSocket being the preferred approach due to its support for quick, bidirectional communication.</p><p>When it comes to browser control via CDP, there are only 2 elements at play:</p><ol><li><p><strong>The CDP-based browser automation client</strong>: A library (e.g., Playwright) that communicates with the browser using JSON commands over the Chrome DevTools Protocol.</p></li><li><p><strong>A Chromium-based browser</strong>: Expose a CDP endpoint (by default on <em>ws://localhost:9222</em> for local instances). For remote browsers, libraries like Playwright can control them using a CDP URL, which typically starts with <em>wss://</em> (WebSocket over TLS).</p></li></ol><p>Now, here&#8217;s what happens under the hood when you call a high-level API like <em>page.screenshot()</em> in a CDP-based library:</p><ol><li><p>The client library establishes a WebSocket session with the browser&#8217;s CDP endpoint. This creates a bidirectional communication channel between the automation library and the browser.</p></li><li><p>The client sends a JSON command targeting a specific domain and method, with the required parameters (e.g., <em>{&#8221;cmd&#8221;:&#8221;Page.captureScreenshot&#8221;,&#8221;args&#8221;:{&#8221;format&#8221;:&#8221;jpeg&#8221;}}</em>).</p></li><li><p>The browser receives the JSON request and maps it to the corresponding native browser operation, such as capturing a screenshot.</p></li><li><p>The browser executes the action as if a real user or system process had triggered it.</p></li><li><p>The browser sends a JSON response back to the client over the WebSocket, containing execution results, metrics, or errors.</p></li><li><p>The browser automation client framework parses the response and converts it into usable objects or data structures for scripts or tests.</p></li></ol><p><strong>Note</strong>: Asynchronous events, like network requests or DOM mutations, are sent by the browser over the same WebSocket channel and can be subscribed to.</p><p>Unlike the W3C WebDriver protocol, which standardizes client-server browser automation across all major browsers, the Chrome DevTools Protocol (as the name suggests) is Chromium-specific.</p><h3>Scraping and Automation Libraries Built on Top of It</h3><p>Many libraries and frameworks leverage the Chrome DevTools Protocol to provide higher-level browser automation capabilities for testing, monitoring, or scraping, such as:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kllY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kllY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png 424w, https://substackcdn.com/image/fetch/$s_!kllY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png 848w, https://substackcdn.com/image/fetch/$s_!kllY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png 1272w, https://substackcdn.com/image/fetch/$s_!kllY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kllY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png" width="1456" height="669" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5ad7d90-5238-4d22-b665-872da3099667_1920x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:669,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:175998,&quot;alt&quot;:&quot;Libraries using CDP&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/181593018?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Libraries using CDP" title="Libraries using CDP" srcset="https://substackcdn.com/image/fetch/$s_!kllY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png 424w, https://substackcdn.com/image/fetch/$s_!kllY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png 848w, https://substackcdn.com/image/fetch/$s_!kllY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png 1272w, https://substackcdn.com/image/fetch/$s_!kllY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5ad7d90-5238-4d22-b665-872da3099667_1920x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Libraries using CDP</figcaption></figure></div><p><strong>Note</strong>: Until full support for the WebDriver BiDi protocol is fully implemented, Selenium 4 plans to also <a href="https://www.selenium.dev/documentation/webdriver/bidi/cdp/">give access to CDP features where applicable</a>, in addition to the standard W3C WebDriver protocol (of course).</p><p>Learn more about Pydoll in <a href="https://substack.thewebscraping.club/p/pydoll-webdriver-scraping">my previous article for The Web Scraping Club!</a></p><h3>Extra: What About the Firefox Remote Debug Protocol?</h3><p>CDP works specifically with Chromium-based browsers, but how can Playwright (and other automation libraries) control Firefox? The answer is the <a href="https://firefox-source-docs.mozilla.org/devtools/backend/protocol.html">Firefox Remote Debug Protocol</a>.</p><p>Similar to CDP, the Mozilla protocol allows a debugger or automation client to connect to Gecko-based browsers. In particular, it provides a unified view of the DOM, CSS rules, and other client-side web technologies.</p><p>At one point, Firefox even offered limited CDP support, but this <a href="https://fxdx.dev/cdp-retirement-in-firefox/">was deprecated in Firefox Nightly 141</a> to jump on the WebDriver BiDi train.</p><p><em>(And for Safari? The CDP-equivalent protocol is the WebKit Debug Protocol<strong>!)</strong></em></p><h2>Understanding WebDriver BiDi: The What, Why, How, and When</h2><p>WebDriver relies on a strict request-response model over HTTP. Commands are synchronous and unidirectional: the client sends a request to the browser, waits for a response, and then sends the next command.</p><p>That approach works well for standard UI testing, ensuring actions like clicks, typing, and navigation occur in the correct order. Still, it limits asynchronous interactions (such as monitoring network requests, console logs, or DOM changes) because the client must continuously poll the browser for updates.</p><p>To overcome those limitations, the Selenium team, together with major browser vendors, is developing the WebDriver BiDi (Bidirectional) Protocol. Currently (as of this writing) a<a href="https://www.w3.org/TR/webdriver-bidi/"> </a><a href="https://www.w3.org/standards/types/#WD">W3C Working Draft (WDC)</a>, WebDriver BiDi is designed to provide real-time, cross-browser automation.</p><p>WebDriver BiDi introduces bidirectional communication via WebSockets, allowing the browser to push events directly to the client as they occur. This enables streaming of logs, network activity, JavaScript exceptions, and other runtime events without the overhead of repeated HTTP requests, resulting in faster, more responsive, and richer automation.</p><p>Basically, BiDi combines the strengths of traditional WebDriver and the Chrome DevTools Protocol (CDP). While CDP offers low-level control over Chromium-based browsers, it isn&#8217;t standardized across browsers. BiDi fills that gap by providing true cross-browser support, while also opening the door to features that were previously limited to Chromium&#8212;such as network interception, performance monitoring, and console logging&#8212;in a consistent way across Firefox and Safari.</p><p><strong>&#128214; Further reading</strong>:</p><ul><li><p><em><a href="https://www.w3.org/TR/webdriver-bidi/">WebDriver BiDi W3C Working Draft</a></em></p></li><li><p><em><a href="https://github.com/w3c/webdriver-bidi">WebDriver BiDi GitHub page</a></em></p></li><li><p><em><a href="https://www.selenium.dev/documentation/webdriver/bidi/">Selenium&#8217;s BiDirectional functionality documentation page</a></em></p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>WebDriver vs CDP vs BiDi: Final Comparison</h2><p>Compare the most important protocols for browser control in the summary table below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GvgN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GvgN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png 424w, https://substackcdn.com/image/fetch/$s_!GvgN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png 848w, https://substackcdn.com/image/fetch/$s_!GvgN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png 1272w, https://substackcdn.com/image/fetch/$s_!GvgN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GvgN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png" width="1456" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:140040,&quot;alt&quot;:&quot;WebDriver vs CDP vs BiDi&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/181593018?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="WebDriver vs CDP vs BiDi" title="WebDriver vs CDP vs BiDi" srcset="https://substackcdn.com/image/fetch/$s_!GvgN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png 424w, https://substackcdn.com/image/fetch/$s_!GvgN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png 848w, https://substackcdn.com/image/fetch/$s_!GvgN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png 1272w, https://substackcdn.com/image/fetch/$s_!GvgN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F281ee305-57f5-4130-aa9e-51f844538ed1_1920x573.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">WebDriver vs CDP vs BiDi</figcaption></figure></div><p><strong>WebDriver (Classic)</strong></p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>W3C standard</p></li><li><p>Stable and predictable model</p></li><li><p>Cross-browser consistency</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Unidirectional and synchronous, with no real-time event streaming</p></li><li><p>Limited low-level browser access</p></li><li><p>Polling-based architecture, which can be slower for advanced use cases</p></li><li><p>Requires external browser-specific driver binaries (e.g., ChromeDriver, GeckoDriver) that must be downloaded, managed, and kept in sync with browser versions</p><p></p></li></ul><p><strong>CDP / Firefox Remote Debug Protocol / WebKit Debug Protocol</strong></p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Fully bidirectional, with real-time event streaming via WebSockets</p></li><li><p>Deep, low-level browser control (network interception, performance, DOM internals, etc.)</p></li><li><p>Excellent for debugging, scraping, and advanced automation</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Not standardized, relying on engine-specific APIs and semantics</p></li><li><p>Not cross-browser, as each protocol targets a single engine (Chromium-only, Gecko-only, or WebKit-only)</p></li><li><p>API instability, where protocol changes may require frequent updates to client libraries</p><p></p></li></ul><p><strong>WebDriver BiDi</strong></p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>W3C-backed, cross-browser standard for modern browser automation</p></li><li><p>Bidirectional communication with real-time events (network activity, logs, JavaScript errors, etc.)</p></li><li><p>Combines WebDriver&#8217;s stability with CDP-like advanced capabilities</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Still an evolving draft</p></li><li><p>Ecosystem adoption is ongoing</p></li><li><p>Requires browser driver server management</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Final Comment: Is WebDriver BiDi the Future?</h2><p>Yes, it appears so! But some explanation is needed&#8230;</p><p><strong>Note</strong>: Since the WebDriver BiDi is the next-generation iteration of the WebDriver protocol, the request-response model used by Selenium is now referred to as &#8220;WebDriver Classic.&#8221;</p><p>The Selenium team is actively transitioning from WebDriver Classic to WebDriver BiDi, while also gradually replacing CDP support for cross-browser automation, all while maintaining backward compatibility with existing tests. Similarly, <a href="https://www.cypress.io/blog/announcing-cypress-support-for-firefox-over-webdriver-bidi">Cypress has already adopted BiDi for Firefox automation</a>. In Puppeteer, when launching Firefox, <a href="https://pptr.dev/webdriver-bidi">WebDriver BiDi is enabled by default</a>. Other major browser automation tools like <a href="https://github.com/microsoft/playwright/issues/37277">Playwright are also exploring BiDi support</a>.</p><p>So, if WebDriver Classic will eventually be replaced by BiDi, what about CDP?</p><p><strong>WebDriver BiDi doesn&#8217;t aim to replace CDP!</strong></p><p>Chrome DevTools Protocol remains optimized for low-level, Chromium-specific debugging and browser control. In contrast, BiDi is a modern, cross-browser standard focused on test automation. That&#8217;s why Puppeteer still uses CDP when launching Chrome (as CDP features aren&#8217;t yet fully supported by BiDi).</p><p>BiDi&#8217;s goal is to standardize automation across browsers, not to replace engine-specific debugging tools like CDP, Firefox Remote Debug Protocol, or WebKit Web Inspector.</p><p>For the foreseeable future, WebDriver Classic will gradually be phased out. Chromium browsers will continue to support CDP for low-level debugging, while BiDi complements it by providing standardized, real-time, cross-browser automation and testing capabilities.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>In this post, I&#8217;ve outlined the three main protocols used by browser automation libraries for testing and web scraping to programmatically control browser instances.</p><p>As you&#8217;ve seen, Playwright and Selenium follow different approaches, relying on distinct sets of protocols. However, WebDriver and CDP are ultimately complementary, each serving its own purpose in the automation ecosystem.</p><p>Feel free to share your thoughts or questions in the comments. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #96: Scraping Nike.com with 5 open source tools]]></title><description><![CDATA[Match your tool to the protection, not the brand]]></description><link>https://substack.thewebscraping.club/p/scraping-nike-with-open-source</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/scraping-nike-with-open-source</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 29 Jan 2026 21:11:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/90126f2b-9893-4a93-adf9-f108b23ea197_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Nike.com is one of the most scraped e-commerce targets on the web. Competitors track pricing, researchers analyze catalog changes, and aggregators build product databases. But also think about the sneaker resale market, which alone generates billions in annual revenue, and much of that ecosystem relies on scraped data: release dates, stock levels, and price fluctuations across regions.<br>So it&#8217;s understandable why this website is so popular among scraping professionals, and, at the same time, is protected by anti-bot measures.</p><p>For this reason, we tested five open-source scraping tools on 1000 Nike product URLs to measure success rate, speed, and reliability. The results challenge a common assumption in the scraping community: that modern e-commerce sites require browser automation to reliably extract data. </p><h2></h2><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Nike.com system model</h2><p>Before testing, we need to understand what Nike.com actually serves and where potential blocking might occur.</p><p>Nike.com is protected by both Akamai Bot Manager and Kasada, but the two systems guard different parts of the site. Akamai handles the public-facing catalog, including product pages and search results. Kasada protects authenticated flows, such as login and checkout. This layered approach makes sense from Nike&#8217;s perspective: catalog data is semi-public anyway (they want customers to browse), while account actions carry real business risk.</p><p>We focus exclusively on public catalog data in this article. Scraping behind authentication raises legal and ethical concerns we prefer not to encourage. For the product catalog, we only need to bypass Akamai. We found no trace of Kasada challenges or fingerprinting scripts on product pages during our tests.</p><p>Nike product pages are server-side rendered. When you request a product URL, the server returns complete HTML with product data embedded in the DOM. This means that, unlike many modern e-commerce sites, which use client-side rendering, where the initial HTML is a shell and product data loads via JavaScript API calls. This is relevant for us, since there&#8217;s a practical consequence for scrapers: JavaScript execution is not required to extract product information because the data is already in the first response.</p><p>So we can use simple HTTP requests, but with some tweaks: modern WAFs inspect TLS handshake characteristics (cipher suites, extensions, ordering) to identify non-browser clients. This is where tools like Python&#8217;s <code>requests</code> library fail immediately: its TLS fingerprint looks nothing like Chrome or Firefox. </p><p>HTTP/2 fingerprinting adds another layer: header order, pseudo-header placement, and SETTINGS frames can reveal automation tools. Even if your TLS handshake passes, sending headers in the wrong order or with unusual HTTP/2 settings can trigger detection.</p><div><hr></div><blockquote><p><em>First of all, you need IPs with good reputations for scraping. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><p>On top of that, IP reputation plays a role. Datacenter IPs receive more scrutiny than residential ones because most legitimate users browse from home or mobile networks. </p><p>For this test, we focused on TLS and HTTP/2 fingerprinting. We scraped from a residential IP, which neutralized IP reputation as a variable. We did not interact with the page, so behavioral signals were not applicable for HTTP clients. We observed no JavaScript challenges on Nike product pages during testing. This last point is crucial: Nike could deploy Akamai&#8217;s JavaScript challenge on product pages, but they have chosen not to. Whether this is a deliberate trade-off (challenges slow down real users) or an oversight, we cannot say. But it opens the door to HTTP-based scraping.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q-5-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q-5-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!q-5-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!q-5-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!q-5-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q-5-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg" width="1360" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1360,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240409,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/186195972?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q-5-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!q-5-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!q-5-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!q-5-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda34d51d-ddf3-4cf9-9143-9f16acb6d6f7_1360x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>The Tool Landscape</h3><p>The five tools we tested fall into two categories: browser automation and HTTP clients with fingerprint emulation.</p><p>On the browser side, <strong><a href="https://github.com/pydoll/pydoll">Pydoll</a></strong> is an async Python library built on Chrome DevTools Protocol. It controls Chromium without WebDriver, avoiding the <code>navigator.webdriver</code> flag. </p><p><strong><a href="https://github.com/daijro/camoufox">Camoufox</a></strong> takes a different approach: it is a custom Firefox build that spoofs fingerprints (WebGL, canvas, audio, navigator) and patches headless detection vectors. </p><p><strong><a href="https://github.com/D4Vinci/Scrapling">Scrapling</a></strong> sits somewhere in between, offering multiple fetcher types from simple HTTP to full browser automation via Playwright. <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide">Its </a><code>StealthyFetcher </code><a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide">wraps Chromium with anti-detection features while the basic </a><code>Fetcher </code><a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide">uses just requests with TLS impersonation</a>.</p><p>On the HTTP client side, <strong><a href="https://github.com/0x676e67/rnet">Rnet</a></strong> is a Rust-based Python client that emulates browser TLS and HTTP/2 fingerprints (JA3, JA4, Akamai). It supports Chrome, Firefox, Safari, Edge, and OkHttp profiles. </p><p><strong><a href="https://github.com/michele0303/undetected-httpx">Undetected-httpx</a></strong> is built on `curl_cffi` and provides browser-identical TLS fingerprints without running a browser.</p><p>The fundamental difference: browser tools execute JavaScript and render pages, while HTTP clients make requests with browser-like signatures but cannot handle JS-dependent content. This distinction matters because it determines both what you can scrape and how fast you can do it. A browser spins up an entire rendering engine, consumes hundreds of megabytes of RAM per instance, and waits for network events, DOM parsing, and JavaScript execution. An HTTP client sends a request and receives bytes. The performance gap is enormous when the extra capability is not needed.</p><div><hr></div><blockquote><p><em>In case you&#8217;re still struggling with browser automation you can try out rayobrowse - a self-hosted Chromium stealth browser from Rayobyte. <a href="https://rayobyte.com/blog/custom-chromium-stealth-browser-web-scraping/?utm_source=twsc&amp;utm_medium=email&amp;utm_campaign=nike">Have a look at it here</a>.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QvTB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QvTB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png 424w, https://substackcdn.com/image/fetch/$s_!QvTB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png 848w, https://substackcdn.com/image/fetch/$s_!QvTB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png 1272w, https://substackcdn.com/image/fetch/$s_!QvTB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QvTB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png" width="310" height="57.91208791208791" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:272,&quot;width&quot;:1456,&quot;resizeWidth&quot;:310,&quot;bytes&quot;:1565999,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/174777782?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!QvTB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png 424w, https://substackcdn.com/image/fetch/$s_!QvTB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png 848w, https://substackcdn.com/image/fetch/$s_!QvTB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png 1272w, https://substackcdn.com/image/fetch/$s_!QvTB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce5df05a-d350-42d4-8ac3-30bc2df8dda4_3639x680.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>&#128176; - <a href="https://rayobyte.com/">Rayobyte is offering an exclusive 55% discount with the code WSC55 in all of their static datacenter &amp; ISP proxies, </a></strong><a href="https://rayobyte.com/">only to web scraping club visitors.<br>You can also claim a 30% discount on residential proxies by emailing </a><strong><a href="mailto:sales@rayobyte.com">sales@rayobyte.com</a></strong>.</p></blockquote><div><hr></div><h3>Test Setup</h3><p>We extracted 1000 product URLs from Nike&#8217;s sitemap (Austria EN locale). The sitemap is publicly accessible and provides a clean list of product URLs without requiring crawling. Each tool scraped the same URL set sequentially, with no delays between requests. This aggressive pacing represents a worst-case scenario for detection: real scrapers would typically add delays to reduce load and avoid rate limits.</p><p><strong>Extraction logic</strong>: All tools used identical HTML parsing. We targeted stable <code>data-testid</code> attributes, which Nike uses for internal testing. These selectors are more reliable than class names, which often change with CSS updates:</p><pre><code># Title
soup.select_one('h1[data-testid="product_title"]')

# Price
soup.select_one('[data-testid="currentPrice-container"]')

# Color
soup.select_one('[data-testid="product-description-color-description"]')

# Style code (SKU)
soup.select_one('[data-testid="product-description-style-color"]')</code></pre><p>This approach keeps the comparison fair: differences in results reflect fetching capability, not parsing logic.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><p><strong>Tool configurations:</strong></p><p>- <strong>Pydoll</strong>: Headless Chromium, 3-second wait after page load, new browser instance per URL</p><p>- <strong>Camoufox</strong>: Headless Firefox, <code>networkidle</code> wait, single browser session with new pages</p><p>- <strong>Scrapling</strong>: <code>Fetcher.get()</code> with <code>impersonate=&#8217;chrome&#8217;</code> and <code>stealthy_headers=True</code></p><p>- <strong>Rnet</strong>: <code>BlockingClient</code> with <code>Impersonate.Chrome137</code>, 30-second timeout</p><p>- <strong>Undetected-httpx</strong>: <code>httpx.Client</code> with browser-like headers, 30-second timeout</p><p>The browser tools opened new page contexts for each URL. The HTTP clients reused the same session.</p><p>As always, the full code can be found on <a href="https://github.com/TheWebScrapingClub/thelab">The Lab GitHub private repository, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">96.NIKE</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">, available only for paid subscriber of TWSC.</a></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/scraping-nike-with-open-source">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[ A preview of the Zyte 2026 Web Scraping Industry report]]></title><description><![CDATA[Where the industry is headed according to Zyte]]></description><link>https://substack.thewebscraping.club/p/a-preview-of-the-zyte-2026-web-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/a-preview-of-the-zyte-2026-web-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Sun, 25 Jan 2026 22:16:45 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/da4cf4e1-2879-44e9-b314-ae503bb4614e_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The start of the year is the perfect time for New Year&#8217;s resolutions and web scraping industry reports. Thanks to Zyte, we had the opportunity to read in advance their view on the industry, and we&#8217;re gonna share with you some key elements from it. <a href="https://www.zyte.com/whitepaper-ebook/2026-web-scraping-industry-report/">If you&#8217;d like to read it in full, you can find it here</a>.</p><p>The document, titled &#8220;The age of fast-forward web data&#8221;, identifies six trends reshaping the industry in 2026. We went through the report and pulled out what matters for anyone working in data extraction.</p><h2>The Market Has Exploded</h2><p>The report opens with a number: the web scraping market reached $1.03 billion in 2025, with projections pointing to $2 billion by 2030 (some estimates double that figure). The majority of mid-to-large enterprises now use web scraping for competitive intelligence, and most e-commerce companies monitor competitor prices using scraped data.</p><p>Web scraping, in other words, is no longer a fringe practice. It has become critical economic infrastructure.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Trend 1: Full-Stack APIs Replace Separate Components</h2><p>The first trend concerns the end of standalone proxies. Zyte reports that the market now counts over 250 proxy vendors, with price wars that have eroded margins and turned proxies into commodities. The problem is that websites have evolved their defenses well beyond simple IP blocking: TLS fingerprinting, behavioral analysis, canvas fingerprinting, and JavaScript traps. Some systems claim 99.9% accuracy in distinguishing humans from bots through behavioral biometrics alone.</p><p>The market response, according to Zyte, is migration toward APIs that handle the entire stack transparently: proxy rotation, browser automation, unblocking, parsing, and retry logic. The cited figure: request volume through the Zyte API grew 130% year-over-year in 2025.</p><p>The trend is real, though it needs context. For those operating at a large scale with specific control requirements, direct component management remains relevant. What we find more interesting is the underlying shift: defense complexity has crossed the threshold of manual manageability for most use cases. Now more than before, it&#8217;s a buy vs make choice: you can always try (and probably should, at least to improve your skills in web scraping) to create your in-house solutions but the game is becoming so hard that the market is looking for all-in-one APIs.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p></p><h3>Trend 2: AI Enters the Web Scraping Toolchain</h3><p>The second trend describes AI integration across every link in the chain. The report cites a Technavio projection: the AI-based web scraping market will reach $3.16 billion by 2029, growing at 39.4% annually.</p><p>The concrete applications listed in the report cover the entire cycle: auto-classification of content for schema-specific extraction, LLM-powered extraction for unstructured data, automatic identification of selectors and field mappings, change detection, crawler code generation, browser interaction via natural language, data cleaning, anomaly detection, and real-time unblocking strategies.</p><p>The key distinction Zyte proposes: LLM extraction for low-volume projects with volatile sites (higher cost per request, but flexibility compensates), code generation for high-volume mission-critical projects (generated code can be tested, versioned, and costs less at scale).</p><p>The report also mentions computer-use models for multi-step navigation (forms, filters, gated screens). We think this is a rapidly evolving area worth watching closely, and we have talked about this trend in several other articles on these pages. LLMs are not a silver bullet for HTML parsing, but their use certainly improves the productivity of data acquisition teams, both when they need to write scrapers and when they need to check data quality.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Trend 3: The Era of Autonomous Pipelines</h3><p>The third trend is the most ambitious: end-to-end automation through agents. Zyte cites a Deloitte study showing 30% of organizations exploring agentic approaches, 38% piloting them, but only 11% with production deployments. The gap, according to Zyte, will narrow in 2026.</p><p>The proposed vision: a team specifies an outcome (dataset with schema, coverage targets, freshness, failure tolerance), an agent explores the site, discovers necessary actions, chooses the most efficient method, and when the site changes, the agent diagnoses the breakage, regenerates code, re-validates outputs, and escalates only when confidence drops below a threshold.</p><p>In practice, the report describes a multi-agent system: API discovery agents, schema-first extraction agents, self-healing testing agents, vision-based computer-use agents, DOM-native browser agents, and coding agents. Each agent handles one specific job, and an orchestrator supervises.</p><p>The vision is compelling on paper, but the report itself admits that production adoption is still limited. Zyte&#8217;s practical advice is telling: apply agents selectively. For stable, straightforward sources, a conventional setup remains more cost-effective and we could not agree more.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Trend 4: The Arms Race Accelerates</h2><p>The fourth trend is perhaps the most concrete. Anti-bot systems now reconfigure continuously, driven by ML models that adapt in minutes. The report cites Proxyway: &#8220;Two days of unblocking efforts used to give two weeks of access... now it&#8217;s the other way around.&#8221;</p><p>Zyte reports observing a major bot management vendor deploy over 25 version changes in 10 months, often releasing updates multiple times per week. Cloudflare, according to the report, has a near-real-time system that adapts its detection strategy every few minutes.</p><p>The factors amplifying the mismatch: ML-driven detection with polymorphic JavaScript, WASM obfuscation, RASP, passive fingerprinting at scale; detection mechanisms monitoring timing patterns, network-level anomalies, device fingerprint consistency, pointer curves, scroll variance; growing AI bot traffic volume pushing sites to respond with automatic tuning.</p><p>Zyte&#8217;s conclusion: manual access strategies are no longer sustainable at scale. Only automated, self-adjusting pipelines survive. We can also say that web scraping is becoming more expensive, and there should be a smarter way to do it. Our idea at <a href="https://www.databoutique.com/">Databoutique.com</a> is to share scraping costs across multiple data buyers, and this can be a way to do so. </p><h2>Trend 5: The Web Fragments Into Access Lanes</h2><p>The fifth trend describes a trifurcation of the web from a bot access perspective.</p><p>The hostile web: sites deploying aggressive honeypot traps, AI-targeted challenge flows, sophisticated fingerprinting. Cloudflare has deployed AI Labyrinth, traps specifically designed for AI crawlers, claiming to have blocked 416 billion AI bot requests in six months.</p><p>The negotiated web: publishers adopting licensing, attestation, pay-per-crawl, paywalls. Standards like ai.txt, llms.txt, Really Simple Licensing (RSL) attempt to make permissions machine-readable. Adweek reports that 2026 will see LLM deals shift from one-time payments to usage-based revenue shares.</p><p>The invited web: sites exposing machine-first interfaces for approved actors. Shopify, Google, Visa, Stripe, OpenAI either support Model Context Protocol (MCP) or have launched proprietary protocols like Agentic Commerce Protocol (ACP), Universal Commerce Protocol, Trusted Agent Protocols.</p><p>The key point: identity becomes a first-class citizen. Initiatives like &#8220;Know Your Agent&#8221; are gaining traction. Verified or attested bots receive preferential routing, unverifiable bots face increasing friction. </p><p>This chapter describes attempts to make the web sustainable for content publishers via protocols that could reward them, but it&#8217;s still a long way to go<br></p><h2>Trend 6: Regulatory Compliance Arrives</h2><p>The sixth trend concerns the evolving legal landscape. Two relevant dates: California AB 2013 took effect January 1, 2026; the EU AI Act takes effect August 2, 2026.</p><p>California AB 2013 requires developers of publicly available generative AI systems to publish detailed documentation: data sources, dataset size, data types, whether copyrighted material is included, whether datasets were purchased or licensed, whether personal information is included, and data processing methods used.</p><p>The EU AI Act imposes transparency and other obligations based on risk to users. General-purpose AI providers must publish training dataset summaries and respect copyright holder opt-outs. Penalties: up to 35 million euros or 7% of global annual turnover.</p><p>On the litigation front, the report cites Bartz v. Anthropic (2025): training on legally obtained works is defensible, training on pirated content is not. In Kadrey v. Meta, market harm was a decisive factor.</p><p>Compliance, Zyte concludes, becomes an operational requirement. Enterprises will not adopt AI systems without evidence of lawful data sourcing. Provenance tracking becomes necessary for audits, investors, and enterprise customers.</p><h2>Conclusions</h2><p>We&#8217;re living in an unprecedented era: we have tools in our hands that are improving the efficiency of our work, and we&#8217;re still scratching the surface of what they can do. But the LLM training and Agents brought scraping operations to a new level, making anti-bot softwares are more important than ever and raising the bar for the scraping industry itself, which is responding with more advanced tools and APIs. For sure, we&#8217;ll have interesting times ahead.</p><p><br></p>]]></content:encoded></item><item><title><![CDATA[THE LAB #95: Bypassing Cloudflare in 2026]]></title><description><![CDATA[Testing Open Source Browser Automation Tools Against Real Targets]]></description><link>https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 22 Jan 2026 21:44:08 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/06570dd4-4551-433b-a075-9f2af10d280b_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this first article of The Lab series of 2026, we&#8217;ll see how to bypass the most common anti-bot measure on the market: Cloudflare. Like every anti-bot defense, it keeps evolving, forcing scraping tools to keep pace, and what worked in 2025 may fail in 2026. This means that for scraping professionals, operations become more expensive. In fact, the cost of choosing the wrong tool and finding the right technique to bypass it is wasted development time and blocked pipelines, and, as we&#8217;ll find out, there&#8217;s no silver bullet for it. We tested the most common open-source browser automation tools against two Cloudflare-protected production sites to identify what actually works.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p><div><hr></div></blockquote><h2>Tool Landscape</h2><p>We evaluated three browser automation tools that claim to bypass Cloudflare in 2026.</p><p><a href="https://github.com/daijro/camoufox](https://github.com/daijro/camoufox">Camoufox</a> is a custom Firefox build with fingerprint rotation and stealth patches. It uses Playwright&#8217;s Juggler protocol and focuses on avoiding detection through realistic Firefox fingerprints and non-default configurations.</p><p><a href="https://github.com/autoscrape-labs/pydoll">Pydoll</a> uses Chrome DevTools Protocol (CDP) for async Chromium automation. It avoids WebDriver entirely, emphasizing human-like interactions and behavioral anti-detection.</p><p><a href="https://github.com/ultrafunkamsterdam/undetected-chromedriver">undetected-chromedriver</a> provides a patched Selenium ChromeDriver. It modifies startup behavior and WebDriver fingerprints, serving as a drop-in replacement for standard Selenium workflows.<br><a href="https://github.com/TheWebScrapingClub/thelab">The code used for this test can be found on The Lab GitHub repository, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">95.CLOUDFLARE-2026</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">, available only for paid subscriber of TWSC.</a></p><h2>System Model: Cloudflare&#8217;s Detection Layers</h2><p>Cloudflare operates as a multi-layered defense system. Understanding which layer blocks your requests determines which tool characteristics matter.</p><p><strong>Layer 1:</strong> involves TLS and network fingerprinting. Cloudflare analyzes TLS handshakes and HTTP/2 frame ordering, inspecting cipher suites, negotiation order, and HTTP/2 header sequences to identify non-browser clients. Browser automation tools using real Chrome or Firefox inherit legitimate <a href="https://blog.cloudflare.com/ja4-signals/">TLS fingerprints</a>, typically passing this layer.</p><p><strong>Layer 2:</strong> <a href="https://developers.cloudflare.com/bots/reference/javascript-detections/)">runs JavaScript Detections (JSD)</a>. Cloudflare&#8217;s JavaScript Detections engine executes lightweight JavaScript on HTML page requests to identify headless browsers and automated clients. The detection runs via an invisible code snippet that analyzes browser environment properties without visible challenges. When verification succeeds, Cloudflare issues a <code>cf_clearance</code> cookie with <code>cf.bot_management.js.detection.passed = true</code>. Failed verifications block the request or trigger additional challenges.</p><p><strong>Layer 3:</strong> <a href="https://www.zenrows.com/blog/cloudflare-js-challenge-bypass">presents the JavaScript Challenge (IUAM)</a>, the visible &#8220;Checking your browser&#8221; interstitial. This executes complex JavaScript to validate browser capabilities and measure execution timing. The challenge tests Canvas and WebGL rendering, navigator property consistency, and execution patterns.</p><p><strong>Layer 4:</strong> implements behavioral analysis. Cloudflare monitors mouse movements, scrolling patterns, click timing, and request sequencing to identify bot-like behavior. These signals feed into the bot scoring system.</p><p><strong>Layer 5:</strong> <a href="https://developers.cloudflare.com/bots/concepts/bot-detection-engines/">applies machine learning bot scoring</a>. Cloudflare&#8217;s ML engine assigns Bot Scores (1-99) using supervised learning to distinguish human and bot traffic. This accounts for the majority of detections and incorporates signals from all other layers plus IP reputation.</p><p>For this test, we focused on whether tools can pass the JavaScript detection and challenge layers (Layers 2-3). We assumed clean residential IPs and did not evaluate IP reputation impact.</p><h2>Target Selection and URL Collection</h2><p>We needed real production targets with active Cloudflare protection, and we chose two very well-known websites.</p><p><strong><a href="https://www.harrods.com">Harrods.com</a></strong> operates as a high-value e-commerce site where Cloudflare protects product pages and navigation. The sitemap is accessible at <em>https://harrods.com/sitemap.xml</em>.</p><p><strong><a href="https://www.indeed.com">Indeed.com</a></strong> is a job board with aggressive protection. When we attempted sitemap access, we received 403 Forbidden, confirming active Cloudflare filtering even for automated sitemap requests.</p><h2>URL Extraction Process</h2><p>For Harrods.com, standard sitemap parsing works:</p><pre><code>def extract_urls_from_domain(domain: str, limit: int = 1000) -&gt; List[str]:
    sitemap_url = f'https://{domain}/sitemap.xml'
    xml_content = fetch_sitemap(sitemap_url)

    if '&lt;sitemapindex' in xml_content:
        sitemap_urls = parse_sitemap_index(xml_content)
        for sm_url in sitemap_urls:
            sm_content = fetch_sitemap(sm_url)
            urls = parse_sitemap(sm_content)
            all_urls.update(urls)
    else:
        urls = parse_sitemap(xml_content)
        all_urls.update(urls)

    return list(all_urls)[:limit]
</code></pre><p>For Indeed.com, sitemap access returned 403, forcing us to generate URLs based on known patterns:</p><pre><code>def generate_indeed_urls(limit: int = 1000) -&gt; List[str]:
    urls = []
    job_queries = ['software-engineer', 'data-scientist', 'product-manager', ...]
    locations = ['New-York-NY', 'Los-Angeles-CA', 'Chicago-IL', ...]

    for query in job_queries:
        for location in locations:
            urls.append(f'https://www.indeed.com/jobs?q={query}&amp;l={location}')
            for start in [10, 20, 30]:
                urls.append(f'https://www.indeed.com/jobs?q={query}&amp;l={location}&amp;start={start}')

    return urls[:limit]
</code></pre><p>The 403 on sitemap access itself demonstrates Cloudflare&#8217;s filtering. Even basic reconnaissance triggers blocks.</p><h2>Test Framework Design</h2><p>We designed our test framework to validate successful page loads by checking for site-specific content elements rather than generic Cloudflare blocking indicators.</p><h3>Content-Based Validation</h3><p>We discovered early that generic Cloudflare detection (searching for &#8220;checking your browser&#8221; or &#8220;just a moment&#8221; strings) produces false positives. Tools can retrieve partial pages or incomplete JavaScript renders that contain neither Cloudflare challenges nor the actual target content.</p><p>Our validation approach checks for critical page elements that must be present in a successfully loaded page:</p><p><strong>Indeed.com validation:</strong></p><pre><code>has_search_button = (
    'yosegi-InlineWhatWhere-primaryButton' in html and
    '&lt;button' in html and
    '&gt;Search&lt;/span&gt;' in html
)

result = {
    'success': has_search_button,
    'status_code': 200 if has_search_button else (403 if html else 0),
    'content_length': len(html),
}</code></pre><p>The search button element appears only when the full job listing page renders. Its absence indicates either Cloudflare blocking or incomplete JavaScript execution.</p><p><strong>Harrods.com validation:</strong></p><pre><code>has_product_name = (
    'data-test-id="pdp-product-name"' in html and
    '&lt;h1' in html
)

result = {
    'success': has_product_name and status_code == 200,
    'status_code': status_code,
    'content_length': len(html),
}
</code></pre><p>Product pages must contain the product name h1 element. Pages without this element are either editorial content, error pages, or Cloudflare blocks.</p><h3>Metrics Collected</h3><p>For each request, we recorded a success flag (based on content validation), HTTP status code (or inferred status), content length in bytes, final URL after redirects, and error messages when exceptions occurred.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Tool Implementation</h2><h3>Camoufox Setup</h3><p>Camoufox requires downloading the custom Firefox binary:</p><pre><code>pip install -U "camoufox[geoip]"
python -m camoufox fetch</code></pre><p>We configured it for headless operation with fingerprint rotation:</p><pre><code>from camoufox.sync_api import Camoufox

def scrape_with_camoufox(url: str) -&gt; Dict:
    with Camoufox(
        headless=True,
        humanize=True,
        os=['macos', 'windows'],
        geoip=False,
    ) as browser:
        page = browser.new_page()
        response = page.goto(url, timeout=30000, wait_until='domcontentloaded')
        page.wait_for_timeout(2000)
        html = page.content()

        return {
            'html': html,
            'status_code': response.status if response else 0,
            'url': page.url,
        }
</code></pre><p>We set <code>headless=True</code> for server execution. The <code>humanize=True</code> option enables human-like cursor movement to reduce behavioral signals. The <code>os=[&#8217;macos&#8217;, &#8216;windows&#8217;]</code> parameter rotates OS fingerprints between requests. We disabled <code>geoip</code> due to a lack of proxy infrastructure for this test.</p><p>The <code>wait_for_timeout(2000)</code> accounts for JavaScript challenges that execute after page load.</p><h3>Undetected-chromedriver Setup</h3><p>Standard pip installation works:</p><p><code>pip install undetected-chromedriver</code></p><p>Configuration uses the new headless mode to reduce detection surface:</p><pre><code>import undetected_chromedriver as uc

def scrape_with_undetected_chromedriver(url: str) -&gt; Dict:
    options = uc.ChromeOptions()
    options.add_argument('--headless=new')
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--no-sandbox')

    driver = uc.Chrome(options=options, version_main=None)
    driver.get(url)

    time.sleep(2)
    html = driver.page_source
    final_url = driver.current_url

    driver.quit()

    return {
        'html': html,
        'status_code': 200,  # Selenium doesn't expose status
        'url': final_url,
    }</code></pre><p>The <code>--headless=new</code> flag uses Chrome&#8217;s updated headless mode with reduced fingerprint differences. The <code>--disable-blink-features=AutomationControlled</code> option removes the <code>navigator.webdriver</code> flag. The <code>version_main=None</code> parameter auto-detects the installed Chrome version.</p><p>The 2-second sleep allows JavaScript challenges to complete.</p><h3>Pydoll: Chromium CDP-Based Automation</h3><p>Pydoll uses Chrome DevTools Protocol (CDP) for browser control. We found that the API differs from the documentation, requiring experimentation to identify working methods.</p><p>Working configuration:</p><pre><code>from pydoll.browser.chromium import Chrome
from pydoll.browser.options import ChromiumOptions

options = ChromiumOptions()
options.add_argument('--headless=new')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')

async with Chrome(options=options) as browser:
    tab = await browser.start()
    await tab.go_to(url)
    html = await tab.page_source  # Use page_source, not tab.html()
    current_url = await tab.current_url  ```</code></pre><p>We encountered API inconsistencies during testing. Documentation suggests <code>from pydoll import Browser</code>, but the <code>Browser</code> class doesn&#8217;t exist. The correct method is <code>tab.page_source</code>, not <code>tab.html()</code>. URL retrieval uses <code>tab.current_url </code>as an attribute, not a method call.</p><p>These inconsistencies required us to test against the installed version (2.15.1) to identify working patterns.<br></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Test Results: Indeed.com with Camoufox</h2><p>When we tested Camoufox against Indeed.com, we observed behavioral patterns specific to how Cloudflare&#8217;s Turnstile operates on this target.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>