<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Web Scraping Club]]></title><description><![CDATA[News, solutions and interviews about web scraping.
In this substack you will find weekly content about:
- Web Scraping techniques
- Interviews with key people in the industry
- Anti bot infos and counter measures
- Real world examples and code]]></description><link>https://substack.thewebscraping.club</link><image><url>https://substackcdn.com/image/fetch/$s_!gJt2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1e343ec9-7946-4440-8c00-57209a1d99a1_1024x1024.png</url><title>The Web Scraping Club</title><link>https://substack.thewebscraping.club</link></image><generator>Substack</generator><lastBuildDate>Tue, 05 May 2026 09:55:28 GMT</lastBuildDate><atom:link href="https://substack.thewebscraping.club/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Web Scraping Club SRL]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[pier@thewebscraping.club]]></webMaster><itunes:owner><itunes:email><![CDATA[pier@thewebscraping.club]]></itunes:email><itunes:name><![CDATA[Pierluigi Vinciguerra]]></itunes:name></itunes:owner><itunes:author><![CDATA[Pierluigi Vinciguerra]]></itunes:author><googleplay:owner><![CDATA[pier@thewebscraping.club]]></googleplay:owner><googleplay:email><![CDATA[pier@thewebscraping.club]]></googleplay:email><googleplay:author><![CDATA[Pierluigi Vinciguerra]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Cloudflare Crawl Endpoint: Everything You Need to Know]]></title><description><![CDATA[Is the Cloudflare /crawl endpoint a real game-changer?]]></description><link>https://substack.thewebscraping.club/p/cloudflare-crawl-endpoint-for-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/cloudflare-crawl-endpoint-for-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 03 May 2026 20:24:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/898316de-e54e-4a62-8089-2ad66bc363b8_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Cloudflare just shook the Web by announcing its first API for crawling entire websites. It&#8217;s built for RAG systems and website monitoring, but can it really be used for real-world web scraping scenarios?</p><p>In this article, you&#8217;ll find out this and more. I&#8217;ll walk you through a complete guided example of how to use it, and break down its (Spoiler: undoubtedly serious) limitations.</p><h2>An Introduction to the Cloudflare Crawl Endpoint</h2><p>Before exploring the technical aspects behind the Cloudflare <em>/crawl</em> endpoint and seeing it in action, let me first give you some context!</p><h3>What Is the Cloudflare <em>/crawl</em> Endpoint?</h3><p>The <em><a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/">/crawl</a></em> endpoint is a new addition to <a href="https://developers.cloudflare.com/fundamentals/api/">Cloudflare&#8217;s REST APIs</a>. Its goal is to crawl an entire website (or just a portion of it) starting from a single URL.</p><p><strong>Note</strong>: The Crawl endpoint is currently in beta and was introduced on March 10, 2026, <a href="https://developers.cloudflare.com/changelog/post/2026-03-10-br-crawl-endpoint/">as highlighted in the Cloudflare changelog</a>.</p><p>Under the hood, it automatically discovers and visits new pages, <a href="https://developers.cloudflare.com/browser-rendering/">rendering them in a headless browser</a>. It returns the discovered content as HTML, Markdown, or structured JSON, making it ideal for RAG pipelines, monitoring, or dataset creation.</p><p>As I&#8217;ll dive into later, it respects <em>robots.txt</em> and <em>doesn&#8217;t</em> bypass bot protection or captchas. Thus, it&#8217;s designed as a <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">compliant approach to web crawling!</a></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" loading="lazy" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>How It Works at a High Level</strong></h2><p>At a high level, the <em>/crawl</em> endpoint involves a two-step flow:</p><ol><li><p>You kick off an asynchronous crawl job, passing a starting URL. Cloudflare returns a job ID.</p></li><li><p>You use that job ID to periodically check the job&#8217;s status or fetch results as they become available, following typical <a href="https://en.wikipedia.org/wiki/Polling_(computer_science)">polling behavior</a>.</p></li></ol><p><strong>Important</strong>: A crawl job can run for <em>up to seven days!</em><strong> </strong>Results remain available for 14 days after completion, after which the job data is deleted.</p><p>Behind the scenes, the crawler expands outward from the starting URL. By default, the API follows a clear order:</p><ol><li><p>The initial page.</p></li><li><p>Sitemap URLs.</p></li><li><p>Links discovered within pages.</p></li></ol><p>Still, you can tweak that depending on whether you want to prioritize sitemaps, page links, or both.</p><h3>Supported Use Cases</h3><p>The officially promoted use cases for the Cloudflare <em>/crawl</em> API are just two:</p><ul><li><p>Creating knowledge bases or training AI systems (like <a href="https://substack.thewebscraping.club/p/ingest-web-data-rag-llm">RAG applications</a>) using up-to-date web content.</p></li><li><p>Collecting and analyzing content across multiple pages <a href="https://substack.thewebscraping.club/p/build-an-ai-agent-for-scraping-papers">for research</a>, summarization, or monitoring purposes.</p></li></ul><h3>Pricing</h3><p>Compared to most other web crawling or discovery APIs on the market, Cloudflare&#8217;s /<em>crawl</em> API doesn&#8217;t charge by the number of pages. Instead, costs are based on resource usage, which depends on whether you enable the headless browser rendering feature.</p><p>If headless rendering is active, pricing follows the <a href="https://developers.cloudflare.com/browser-rendering/pricing/">Browser Rendering model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vrIj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vrIj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 424w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 848w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1272w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png" width="1456" height="238" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:238,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48862,&quot;alt&quot;:&quot;The Browser Rendering pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Browser Rendering pricing model" title="The Browser Rendering pricing model" srcset="https://substackcdn.com/image/fetch/$s_!vrIj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 424w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 848w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1272w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Browser Rendering pricing model</figcaption></figure></div><p>If rendering isn&#8217;t active, pricing follows the <a href="https://developers.cloudflare.com/workers/platform/pricing/">Workers model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HVsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HVsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 424w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 848w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1272w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png" width="1456" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66389,&quot;alt&quot;:&quot;The Workers pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Workers pricing model" title="The Workers pricing model" srcset="https://substackcdn.com/image/fetch/$s_!HVsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 424w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 848w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1272w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Workers pricing model</figcaption></figure></div><p><em>Yeah, I know&#8230; It&#8217;s honestly a bit confusing, and it&#8217;s almost impossible to predict the exact cost of a crawling task. The good news? You can test it for free!</em></p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Cloudflare Crawl Endpoints: Technical Analysis</h2><p>Now that you know what Cloudflare is and what it brings to the table, it&#8217;s time to better understand its functioning, strengths, and limitations.</p><h3><strong>Endpoint Presentation</strong></h3><p>The Cloudflare Crawl API is built around two main endpoints. Both share the same base URL:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl</code></pre></div><p>Where <em>&lt;ACCOUNT_ID&gt;</em> is your <a href="https://developers.cloudflare.com/fundamentals/account/find-account-and-zone-ids/#copy-your-account-id">Cloudflare account ID</a>.</p><h4>1. Initiate the Crawl Job (POST)</h4><p>To start a new crawl, you need to send a POST request with the target URL (and optional parameters like depth, rendering mode, etc.) as below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -X POST 'https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl' \
  -H 'Authorization: Bearer &lt;CLOUDFLARE_API_TOKEN&gt;' \
  -H 'Content-Type: application/json' \
  -d '{ "url": "https://example.com" }'</code></pre></div><p>Keep in mind that the endpoint supports several parameters, allowing you to greatly customize the crawling behavior, output format (JSON, HTML, or Markdown), rendering options, caching, and more. Check out the <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#optional-parameters">full list of supported body parameters for all available options</a>.</p><p>Cloudflare immediately returns a job ID that you&#8217;ll use to track or retrieve results. A possible response looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{
  "success": true,
  "result": "9f1c2d3a-4b5e-6f7a-8c9d-0e1f2a3b4c5d"
}</code></pre></div><p>The UUID in the <em>result</em> field is the Crawl job ID you&#8217;ll use to poll for updates.</p><h4>2. Request Crawl Results (GET)</h4><p>Once the crawl is running, make a GET request with the job ID to check the status or fetch results:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -X GET 'https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl/&lt;JOB_ID&gt;' \
  -H 'Authorization: Bearer &lt;CLOUDFLARE_API_TOKEN&gt;'</code></pre></div><p>Here, the <em>&lt;JOB_ID&gt;</em> placeholder is the UUID retrieved before from the <em>result </em>field.</p><p>The response either includes a <em>status</em> field like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">{
  "result": {
    "id": "3e7a1c92-b4d8-4f6e-9a21-6c0d5b8e2f14",
    "status": "running"
    // ...
  }
}</code></pre></div><p>The possible <em>status</em> values are: <em>running</em>, <em>completed</em>, <em>errored</em>, or one of several cancellation states (<em>cancelled_due_to_timeout</em>, <em>cancelled_due_to_limits</em>, <em>cancelled_by_user</em>).</p><p>Or, once the job is completed, calling the API returns the full results in the <em>records</em> field:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">{
  "result": {
    "id": "3e7a1c92-b4d8-4f6e-9a21-6c0d5b8e2f14",
    "status": "completed",
    "browserSecondsUsed": 98.3,
    "total": 12,
    "finished": 12,
    "records": [
      {
        "url": "https://example.com/",
        "status": "completed",
        "markdown": "# Example Domain\nThis domain is for use in illustrative examples...",
        "metadata": {
          "status": 200,
          "title": "Example Domain",
          "url": "https://example.com/"
        }
      },
      {
        "url": "https://example.com/about",
        "status": "completed",
        "markdown": "## About\nLearn more about this example site...",
        "metadata": {
          "status": 200,
          "title": "About - Example Domain",
          "url": "https://example.com/about"
        }
      }
      // additional entries omitted for brevity...
    ],
    "cursor": 10
  },
  "success": true
}</code></pre></div><p>Note that the response will vary based on the specified query parameters. For example, you can filter by specific statuses, limit the number of results, and <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#polling-for-completion">navigate through them using a pagination system</a>.</p><h3>Features</h3><p>Below is a list of the main, most relevant capabilities provided by the Cloudflare Crawl API:</p><ul><li><p><strong>Asynchronous crawl jobs</strong>:<strong> </strong>Trigger crawling jobs and poll results when they are ready, enabling non-blocking, large-scale crawling workflows.</p></li><li><p><strong>Automatic URL discovery</strong>: Finds pages from the starting URL, sitemaps, and in-page links, with configurable source control.</p></li><li><p><strong>Flexible output formats</strong>: Returns HTML, Markdown, or structured JSON. JSON leverages <a href="https://developers.cloudflare.com/workers-ai/features/json-mode/">Workers AI for schema-driven data extraction</a>.</p></li><li><p><strong>Headless browser rendering</strong>: Control JavaScript execution with <em>render: true</em> or perform fast static HTML fetches with <em>render: false</em>.</p></li><li><p><strong>Fine-grained crawl control</strong>: Configure <em>limit</em>, <em>depth</em>, and URL inclusion/exclusion with the <em>includePatterns</em>/<em>excludePatterns </em>fields.</p></li><li><p><strong>Incremental and cache-aware crawling</strong>: Use <em>modifiedSince</em> and <em>maxAge </em>parameters to avoid re-fetching unchanged content, optimizing performance and cost.</p></li><li><p><strong>Advanced filtering and pagination</strong>: Retrieve results using <em>limit</em>, <em>cursor</em>, and <em>status</em> filters to handle large datasets efficiently.</p></li><li><p><strong>Authentication and custom headers</strong>: Supports HTTP auth, cookies, and custom headers for crawling protected or API-driven content.</p></li><li><p><strong>Dynamic content handling</strong>: Wait for JS-rendered content using <em>gotoOptions</em> and <em>waitForSelector</em>, ideal for SPAs and interactive pages.</p></li><li><p><strong>Resource skipping for performance</strong>: Optionally block images, media, fonts, or stylesheets to speed up crawling and reduce unnecessary bandwidth usage.</p></li></ul><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h3>Limitations</h3><p>Cloudflare&#8217;s <em>/crawl</em> API also comes with several important limitations, such as:</p><ul><li><p><strong>Respects bot protection</strong>: The crawler can&#8217;t <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">bypass CAPTCHAs (including Turnstile challenges) or Cloudflare bot protections</a>. As a rule of thumb, sites protected via Cloudflare Bot Management or other WAFs tend to block crawling tasks entirely, limiting automated access and leading to incomplete datasets.</p></li><li><p><strong>Fixed User-Agent</strong>: The <em>/crawl</em> endpoint sets a non-customizable <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent">User-Agent</a> </em>value<em> </em>(<em>CloudflareBrowserRenderingCrawler/1.0</em>). You can&#8217;t change it, which may cause sites to block requests or serve different content based on the <em>User-Agent</em>.</p></li><li><p><strong>Content Signals enforcement</strong>: If a site disallows AI usage via <a href="https://contentsignals.org/">Cloudflare Content Signals</a>, crawl requests for those purposes are rejected with a <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/400">400 Bad Request</a></em> error. Even if the site allows other uses, attempts to crawl disallowed categories will fail, limiting AI-specific data collection.</p></li><li><p><strong>Rate limiting and crawl pacing</strong>: Sites with strict rate limits can slow down crawling. The crawler respects the robots.txt <em>Crawl-delay </em>directive and implements backoff. Large crawls may need to be split into smaller jobs to avoid throttling or skipped URLs.</p></li><li><p><strong>Browser usage limits and job cancellation</strong>: Accounts on Workers free plans are capped at 10 minutes of browser time per day. Exceeding this limit results in a <em>cancelled_due_to_limits</em> status. To avoid that, you can upgrade to a paid plan.</p></li></ul><h2>How to Use the Cloudflare Crawl Endpoint: Step-by-Step Tutorial</h2><p>In this guided section, I&#8217;ll show you how to use the Cloudflare Crawl Endpoint to crawl a website in Python. The target site will be the &#8220;<a href="https://quotes.toscrape.com/">Quotes to Scrape</a>&#8221; sandbox. The goal here is to demonstrate how to use the API, rather than actually collecting relevant data.</p><p>Follow the instructions below!</p><h3>Prerequisites</h3><p>To follow this tutorial section, make sure you have:</p><ul><li><p>Your <a href="https://developers.cloudflare.com/fundamentals/account/find-account-and-zone-ids/#copy-your-account-id">Cloudflare account ID</a> at hand.</p></li><li><p>A <a href="https://developers.cloudflare.com/fundamentals/api/get-started/create-token/">Cloudflare API token</a> with the &#8220;Browser Rendering - Edit&#8221; permission.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nJvY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nJvY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 424w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 848w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission" title="Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission" srcset="https://substackcdn.com/image/fetch/$s_!nJvY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 424w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 848w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission</figcaption></figure></div><p>For the sake of simplicity and to keep this tutorial concise, I&#8217;ll assume you already have a Python project set up with <em><a href="https://substack.thewebscraping.club/p/python-http-request-explained">requests</a></em> installed. That said, you can use any programming language and any HTTP client, because the high-level logic remains the same.</p><h3>Step #1: Set Up the Configurations</h3><p>Start by importing the required libraries and reading the necessary secrets (your Cloudflare API token and account ID). Use these secrets to prepare the Cloudflare Crawl base URL and headers. Also, specify the starting target URL as a constant.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import requests
import time
import json

# The Cloudflare secrets required for authenticating Crawl API calls
CLOUDFLARE_API_TOKEN = "&lt;YOUR_CLOUDFLARE_API_TOKEN&gt;"
CLOUDFLARE_ACCOUNT_ID = "&lt;YOUR_CLOUDFLARE_ACCOUNT_ID&gt;"

# The base URL used for all Crawl API calls
CLOUDFLARE_CRAWL_BASE_URL = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl"

# Common headers shared by all API calls
HEADERS = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# The URL to crawl
START_URL = "https://www.ssense.com/en-us/men/product/acne-studios/silver-folded-leather-wallet/18169981"</code></pre></div><p><strong>Tip</strong>: In a production script, read the Cloudflare API token and account ID from environment variables rather than hardcoding them.</p><h3>Step #2: Trigger the Crawling Job</h3><p>Define a <em>start_crawl()</em> function to send a POST request to Cloudflare&#8217;s Crawl API:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def start_crawl(start_url):
    # Customize according to your needs
    payload = {
        "url": start_url,
        "limit": 20,
        "depth": 2,
        "formats": ["markdown"],
        "render": True
    }

    response = requests.post(CLOUDFLARE_CRAWL_BASE_URL, headers=HEADERS, json=payload)
    response.raise_for_status()

    job_id = response.json()["result"]
    return job_id</code></pre></div><p>This creates a new crawling job for the target URL. Then, it returns a <em>job_id</em> that identifies this specific crawl.</p><p><strong>Tip</strong>: In a production-level script, make the <em>payload</em> object configurable via function input arguments for greater flexibility and reusability.</p><h3>Step #3: Poll Over the Job</h3><p>Next, add a <em>wait_for_completion()</em> function to repeatedly check the job status every few seconds until the crawl finishes or times out:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def wait_for_completion(job_id, max_attempts=60, delay=5):
    status_url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}?limit=1"

    for attempt in range(max_attempts):
        res = requests.get(status_url, headers=HEADERS)
        res.raise_for_status()

        status = res.json()["result"]["status"]
        print(f"Attempt {attempt+1}: {status}")

        if status != "running":
            return status

        time.sleep(delay)

    raise Exception("Timeout waiting for crawl job")</code></pre></div><p>This makes GET calls to the Cloudflare <em>/crawl</em> endpoint. It ensures you&#8217;re waiting for the task to complete processing before fetching the crawled records.</p><p><strong>Tip</strong>: The <em>limit=1</em> query parameter is recommended to restrict the number of retrieved records, keeping the response lightweight. After all, at this stage, you&#8217;re only interested in checking the job status, not in retrieving the actual output data.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Step #4: Get the Crawled Content Pages</h3><p>Build a <em>fetch_records()</em> function to collect all crawled pages:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def fetch_records(job_id):
    print("Fetching results...")

    all_records = []
    cursor = None

    while True:
        url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}"
        # Getting 10 records at a time
        params = {
            "limit": 10
        }

        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()

        result = response.json()["result"]
        records = result.get("records", [])

        all_records.extend(records)
        print(f"+{len(records)} records (Total: {len(all_records)})")

        cursor = result.get("cursor")
        if not cursor:
            break

    return all_records</code></pre></div><p>This handles pagination using a <em>cursor</em>, accessing records in batches (<em>10</em> per request) until all results are returned.</p><h3>Step #5: Put It All Together</h3><p>Finally, in the <em>main()</em> function, orchestrate the workflow:</p><ol><li><p>Start the crawl</p></li><li><p>Wait for completion</p></li><li><p>Fetch all results</p></li></ol><p>Then, you can export the crawled records to a local JSON file for further use, store the retrieved data in a database, process it there, etc.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def main():
    job_id = start_crawl(START_URL)
    status = wait_for_completion(job_id)

    if status != "completed":
        raise Exception(f"Crawl failed: {status}")

    print("Crawl completed!")

    records = fetch_records(job_id)

    # Export the crawled pages to an output JSON file
    with open("records.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()</code></pre></div><h3>Step #6: Complete Code</h3><p>This is what your Python script for interacting with the Cloudflare Crawl API will look like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># pip install requests

import requests
import time
import json

# The Cloudflare secrets required for authenticating Crawl API calls
CLOUDFLARE_API_TOKEN = "&lt;YOUR_CLOUDFLARE_API_TOKEN&gt;"
CLOUDFLARE_ACCOUNT_ID = "&lt;YOUR_CLOUDFLARE_ACCOUNT_ID&gt;"

# The base URL used for all Crawl API calls
CLOUDFLARE_CRAWL_BASE_URL = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl"

# Common headers shared by all API calls
HEADERS = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# The URL to crawl
START_URL = "http://quotes.toscrape.com/"

def start_crawl(start_url):
    """
    Triggers the Cloudflare Crawl API job
    """

    # Customize according to your needs
    payload = {
        "url": start_url,
        "limit": 20,
        "depth": 2,
        "formats": ["markdown"],
        "render": True
    }

    response = requests.post(CLOUDFLARE_CRAWL_BASE_URL, headers=HEADERS, json=payload)
    response.raise_for_status()

    job_id = response.json()["result"]
    return job_id

def wait_for_completion(job_id, max_attempts=60, delay=5):
    """
    Waits for the crawling task to complete
    """

    status_url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}?limit=1"

    for attempt in range(max_attempts):
        res = requests.get(status_url, headers=HEADERS)
        res.raise_for_status()

        status = res.json()["result"]["status"]
        print(f"Attempt {attempt+1}: {status}")

        if status != "running":
            return status

        time.sleep(delay)

    raise Exception("Timeout waiting for crawl job")

def fetch_records(job_id):
    """
    Collects all records from the paginated results
    """

    print("Fetching results...")

    all_records = []
    cursor = None

    while True:
        url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}"
        # Getting 10 records at a time
        params = {
            "limit": 10
        }

        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()

        result = response.json()["result"]
        records = result.get("records", [])

        all_records.extend(records)
        print(f"+{len(records)} records (Total: {len(all_records)})")

        cursor = result.get("cursor")
        if not cursor:
            break

    return all_records

def main():
    job_id = start_crawl(START_URL)
    status = wait_for_completion(job_id)

    if status != "completed":
        raise Exception(f"Crawl failed: {status}")

    print("Crawl completed!")

    records = fetch_records(job_id)

    # Export the crawled pages to an output JSON file
    with open("records.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()</code></pre></div><h3>Step #7: Test the Script</h3><p>Launch the script, and it&#8217;ll produce an output like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XDal!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XDal!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 424w, https://substackcdn.com/image/fetch/$s_!XDal!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 848w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1272w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png" width="1175" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1175,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output produced by the script in the terminal&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output produced by the script in the terminal" title="The output produced by the script in the terminal" srcset="https://substackcdn.com/image/fetch/$s_!XDal!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 424w, https://substackcdn.com/image/fetch/$s_!XDal!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 848w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1272w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the script in the terminal</figcaption></figure></div><p>The polling mechanism required 5 attempts (~25 seconds), and the API discovered and retrieved 22 pages.</p><p>A <em>records.json</em> file will appear in your project directory. Open it, and you&#8217;ll see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZwCj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 424w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 848w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png" width="1456" height="1071" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1071,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 424w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 848w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the script</figcaption></figure></div><p>Notice how the &#8220;Quotes to Scrape&#8221; entries contain a <em>markdown</em> field with the Markdown version of the page. Instead, external links like Zyte&#8217;s homepage and Goodreads.com are skipped, since <em>includeExternalLinks</em> is set to <em>false</em> by default. In other words, the Cloudflare Crawl API doesn&#8217;t automatically attempt to fetch data from different domains than the target source URL.</p><p>Et voil&#224;! Implementation complete.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Benchmark Against Protected Websites</h3><p>Cool! The Cloudflare Crawl endpoint works like a charm and is easy to use. However, I was particularly concerned about its documented limitations and wanted to verify whether they actually hold up in practice&#8230;</p><p>So, I ran tests against several well-known sites protected by common WAF and anti-bot solutions (from different providers). Here&#8217;s a summary of the results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!chL4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!chL4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 424w, https://substackcdn.com/image/fetch/$s_!chL4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 848w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1272w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111887,&quot;alt&quot;:&quot;Cloudflare Crawl API vs anti-bot solutions&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cloudflare Crawl API vs anti-bot solutions" title="Cloudflare Crawl API vs anti-bot solutions" srcset="https://substackcdn.com/image/fetch/$s_!chL4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 424w, https://substackcdn.com/image/fetch/$s_!chL4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 848w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1272w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cloudflare Crawl API vs anti-bot solutions</figcaption></figure></div><p>As you can tell, the limitations are very real. The results are quite discouraging:<strong> the Cloudflare Crawl API failed against all anti-bot&#8211;protected websites I tested.</strong></p><p>So, is this solution reliable for web scraping? When (and how) should you actually use it? Let me break that down in a final comment!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Final Comment</h2><p>In this article, I introduced you to one of the newest tools in Cloudflare&#8217;s growing ecosystem: the Crawl API! This endpoint is designed to help you crawl entire websites using distributed crawling tasks running on Cloudflare&#8217;s infrastructure.</p><p>Sure, the crawling mechanism works and is easy to launch, control, and implement. With just a few lines of code, you can get started. Still, several concerns should be raised:</p><ol><li><p><strong>Opaque pricing</strong>: Costs are tied to resource usage rather than the number of pages crawled, making them harder to predict.</p></li><li><p><strong>Fixed </strong><em><strong>User-Agent</strong></em>: The API doesn&#8217;t allow <em>User-Agent</em> customization, meaning even basic server-side checks can block it.</p></li><li><p><strong>Limited effectiveness on protected sites</strong>: The API has an intended very low success rate against anti-bot&#8211;protected websites (unless you specify in Cloudflare Bot Protection settings that you <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#robotstxt-and-bot-protection">allow it against your site</a>).</p></li><li><p><strong>Rate limiting constraints</strong>: It strictly respects <em>robots.txt</em> directives and crawl delays, which can significantly slow or limit large crawls.</p></li></ol><p>In simple terms, if you want to use it for general-purpose, large-scale web crawling, I wouldn&#8217;t recommend it. The market offers more effective solutions that can actually bypass anti-bot limitations. Plus, remember that around <em><a href="https://www.securitymagazine.com/articles/101188-65-of-websites-arent-protected-from-bots">35% of the entire Internet</a></em> is estimated to be protected against bots (i.e., you won&#8217;t be able to crawl it with this API).</p><p>Yet, if you know the target site is not protected, budget isn&#8217;t a concern, and you want to remain (<em>overly?</em>) ethical and compliant, the Cloudflare Crawl API can be an option.</p><p>I hope this breakdown helps you better understand this new solution and make an informed decision on whether to adopt it. Lastly, remember that the Cloudflare Crawl API is still in beta, so things may change soon. Just <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/">keep an eye on the docs for updates</a>. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #103: Bypassing DataDome-Protected Websites in the Agentic Era]]></title><description><![CDATA[Fifteen browser configurations, one tough anti-bot, and only a couple made it to the cart]]></description><link>https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 30 Apr 2026 21:34:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9e5dad0e-b094-41c0-942c-c76f3783b289_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This year every web infrastructure company seems to be shipping a browser. But not a regular browser,  one designed to be driven by an AI agent and to look human while doing it. We wanted to know if any of those browsers actually work against a serious anti-bot, so we picked a hard target, leroymerlin.fr behind DataDome, and tested more than a dozen different setups on the same four-step task: open the homepage, search for a product, open the first result, add it to the cart.<br></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>The short answer is that a couple of tools finished the task, just one with any consistency. The story behind why is worth telling, because it explains what is happening at the intersection of AI agents and web data right now. We ran a similar exercise <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">against Cloudflare earlier this year</a>, and the conclusion is broadly the same: each anti-bot needs its own answer, and the answer changes every quarter.</p><h2>From workflows to agents, and why that changes the data problem</h2><p>Most code shipped under the AI banner is not really agentic. It is workflow code with an LLM dropped into a slot: generate a summary here, classify a record there, draft an email at the end. The control flow is hard-coded, and the model is one component among many.</p><p>The definition of an agent is quite different. The model decides the next action, observes the outcome, and decides again. The control flow lives inside the loop, not outside it. The agent has goals rather than scripts, and it picks tools and steps based on what it sees. That is what makes the engineering interesting, that is what makes it hard, and that is what sometimes makes it unreliable.</p><p>It also forces a different relationship with data. An agent that only sees its training corpus is stuck in the past. To make decisions worth anything, it has to read prices that change daily, stocks that move minute by minute, listings that did not exist last week. Some of that data sits behind APIs. Most of it does not. The web is still the largest and most current dataset in the world, and most of it is reachable only through a browser. So if we want our agents to act on real information, we have to give them a way to browse: opening a page, reading it, clicking a link, typing into a search bar, following a result, filling a form, all on sites that were never built for machines.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:69.35779816513761,&quot;width&quot;:630,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p>This is the constraint that produced the wave of &#8220;agentic browser&#8221; launches we have seen over the last twelve months. Y Combinator alone has backed a long string of them. <a href="https://www.hyperbrowser.ai/">Hyperbrowser</a> (S21) was an early entry: scalable cloud browser infrastructure with built-in CAPTCHA solving, proxy management, and now a multi-agent playground. The newer cohort followed the agent wave more directly: <a href="https://www.browseros.com/">BrowserOS</a> (S24) is an open-source agentic browser that runs the agent locally on the user&#8217;s machine; <a href="https://browser-use.com/">Browser Use</a> (W25) offers an open-source agent loop on top of Playwright, plus a cloud version. <a href="https://www.skyvern.com/">Skyvern</a> is a self-hostable browser agent that uses an LLM and computer vision instead of fixed selectors.  Outside the YC pipeline, <a href="https://lightpanda.io/">Lightpanda</a> is doing something different again, a headless browser engine written from scratch in Zig and aimed squarely at agents and crawlers (claiming roughly 9x faster execution and 16x lower memory than Chrome). It fits the &#8220;browser built for machines&#8221; line of thought we covered in <a href="https://substack.thewebscraping.club/p/rethinking-the-web-browser">Rethinking the web browser</a> earlier this year. <a href="https://www.browserbase.com/">Browserbase</a> ships a managed browser plus Stagehand for natural-language automation. And the big AI labs are now in the same space: OpenAI shipped Operator and the ChatGPT Atlas browser, Anthropic shipped Computer Use, Perplexity launched Comet. Each project attacks the same problem from a slightly different angle, but the goal is identical: a browser an agent can drive without immediately tripping every detection mechanism on the other side.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h2>The same problem scrapers have been chasing for a decade</h2><p>For anyone who has worked in web data, none of this is new. The fight over whether a request looks human or automated has been going on as long as commercial scraping has existed. The product names have changed but the purpose not.</p><p>What has changed is who is selling the bypass. The companies that have spent years selling residential proxies and unblockers noticed quickly that the agentic boom is good for their business. They already have the IP networks, the fingerprint research, the bypass code, the cat-and-mouse experience. They know what TLS handshake Chrome sends in October 2025 and what it sent in October 2024. Pivoting all of that into a managed browser is a smaller leap than building one from scratch. <a href="https://brightdata.com">Bright Data</a>, <a href="https://oxylabs.io">Oxylabs</a>, <a href="https://rayobyte.com">Rayobyte</a>, <a href="https://www.zenrows.com">ZenRows</a> have all added a managed browser product alongside the proxy. </p><p>The other side of the line is moving in the opposite direction. Bot traffic has grown faster than human traffic for years, and the operators of large public sites care more about it than ever. <a href="https://datadome.co">DataDome</a>, <a href="https://www.cloudflare.com/products/bot-management/">Cloudflare Bot Management</a>, <a href="https://www.akamai.com/products/bot-manager">Akamai Bot Manager</a>, <a href="https://www.humansecurity.com">HUMAN</a>, <a href="https://www.kasada.io">Kasada</a>: every one of them ships updates that target the exact tools we just listed. Fingerprint checks get stricter. Behavioral models get more sensitive. The JavaScript challenge changes shape every few weeks. There is no silver bullet, and there is no tool, browser, proxy, or service that bypasses every anti-bot on every site at all times. Anyone who claims otherwise is selling something that worked last quarter and might still work this week. The useful question is what works on a given target, today, at what cost.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Picking a hard target</h2><p>To answer that question concretely, we needed a target where the anti-bot was good and the signal was clean. We picked leroymerlin.fr, the French DIY retailer. Leroy Merlin runs DataDome standalone, with no other anti-bot layer on top, so attribution is straightforward. It also runs one of the more verbose DataDome configurations we have come across: response headers expose <code>x-datadome-riskscore</code>, <code>x-datadome-protection</code>, <code>x-datadome-cid</code>, and <code>x-datadome-endpointid</code>. Most DataDome-protected sites only show us the outcome. Here we see the score the engine assigns at every request, which is rare and very useful when comparing tools side by side.</p><p>The task we picked is small but realistic. From the homepage, the agent has to type &#8220;ampoule B22 led blanc&#8221; into the search bar, click the first product result, and add the product to the cart. Four steps. We dropped the login step on purpose: leroymerlin.fr requires an OTP to sign in, and we did not want OTP friction to confound an anti-bot test.</p><p>A run is a pass if the agent reaches the cart confirmation. Otherwise we record where it stopped and what DataDome said about it. Each tool runs ten times back to back, and we aggregate the results. Tools that support an external proxy use the same residential pool: Bright Data residential FR for the Bright Data runs, <a href="https://geonode.com">Geonode</a> residential FR for the Geonode runs. Tools that ship their own proxy use it. The reason behind two different providers was because we wanted to diversify the IP addresses, to be sure that blocks were not a matter of IP reputation.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The contestants</h2><p>As you&#8217;ve seen before, the browser landscape is quite crowded and we could not cover all the tools. We picked four open-source projects and seven commercial products. Let&#8217;s start with the open source.</p><p><a href="https://camoufox.com">Camoufox</a> is the stealth Firefox fork most people in the scraping world have already met (we <a href="https://substack.thewebscraping.club/p/open-source-python-libraries-scraping">introduced it</a> on TWSC back in September 2024). It rotates real-world fingerprints, patches the obvious automation tells, and ships a Playwright-compatible API. We pair it with both Bright Data and Geonode residential proxies in France. </p><p><a href="https://github.com/autoscrape-labs/pydoll">Pydoll</a> takes a different route: it drives Chromium directly over CDP without WebDriver, with built-in humanized cursor movement and typing. Importantly, Pydoll implements an explicit <code>Fetch.authRequired</code> handler, which lets it authenticate proxies that require Basic auth. </p><p><a href="https://scrapling.readthedocs.io">Scrapling</a> is a higher-level Python library. We use it in two modes. <code>DynamicFetcher</code> launches vanilla Playwright Chromium driven by Scrapling&#8217;s session manager. <code>StealthyFetcher</code> does the same, but under the hood uses <a href="https://github.com/Kaliiiiiiiiii-Vinyzu/patchright">patchright</a>, a stealth-patched Playwright fork. Each gets its own row in the comparison. </p><p><a href="https://github.com/rayobyte-data/rayobrowse">RayoBrowse</a> is the self-hosted stealth Chromium fork from Rayobyte, distributed as a Docker container that exposes a CDP endpoint on port 9222. Here we hit a wall worth flagging: for some reason RayoBrowse could not use the Bright Data residential proxy in our setup. Every navigation through that proxy failed instantly, even though the same credentials worked fine through <code>curl</code> from inside the same container. The same RayoBrowse setup worked fine with Geonode. We did not isolate the root cause, so we report RayoBrowse on Geonode only.</p><p>The commercial side is more crowded. </p><p><a href="https://browser-use.com/">Browser Use</a> exists in two flavors, and we tested both. The cloud version is the managed Browser Use, with its own residential proxy, its own stealth fingerprinting, and a fixed set of supported models; we drove it once in raw CDP mode (we steer it ourselves with Playwright) and once in agent mode (we hand the LLM the task in natural language and let it plan the steps). </p><p><a href="https://www.browserbase.com/">Browserbase</a> is a managed Chromium with optional residential proxies, Cloudflare Web Bot Auth verification, and the Stagehand agent SDK. We discovered during the test that the free tier excludes proxies entirely; without one, the session egresses from a US datacenter. We left this configuration in the test because it is what a free user would experience. </p><p><a href="https://www.browserless.io">Browserless</a> is a managed browser-as-a-service whose anti-bot story is a stealth path (<code>/chromium/stealth</code>) plus optional residential proxies for paid plans. The free plan caps sessions at 60 seconds, which is tight for a four-step flow. We tested it with the built-in residential proxy targeting France, and tried to test it with our external proxies via the <code>externalProxyServer</code> parameter; the external mode failed at connection time on every run, in the same Chromium-side authentication way that broke RayoBrowse, so we drop those configurations from the comparison. </p><p><a href="https://zenrows.com/">ZenRows</a> Scraping Browser is a managed Chromium with a built-in residential proxy network and built-in CAPTCHA solving; we connect via the WSS endpoint with <code>proxy_country=fr</code> to get a French exit point. </p><p><a href="https://brightdata.com/">Bright Data Browser API</a> sits at the other end of the same product category: a managed Chromium with built-in residential rotation and CAPTCHA solving, on a dedicated Browser API zone we configured on their dashboard.</p><p>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">103.BROWSERS</a>.</strong></p><h2>What we had to fix before the numbers made sense</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Stop Paying for Bandwidth: How to Leverage IPv6 Subnets for Infinite Proxy Rotation]]></title><description><![CDATA[Escape metered residential proxy billing. Discover how to build a self-hosted, rotating proxy gateway using IPv6 /64 subnets to drastically cut your web scraping costs at scale.]]></description><link>https://substack.thewebscraping.club/p/use-ipv6-scraping-nyxproxy</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/use-ipv6-scraping-nyxproxy</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 26 Apr 2026 20:30:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/21b6b18a-a1f6-4511-aec6-c5fc9ba435cd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p style="text-align: justify;">When your data extraction pipelines scale from a few thousand requests a day to thousands of requests per second, the bottleneck becomes network egress and IP reputation. Modern web architectures are defended by sophisticated Web Application Firewalls (WAFs) that deploy strict rate limiting, fingerprinting, and behavioral analysis.</p><p style="text-align: justify;">This means that if you route all your traffic through a single egress IP, you will be rate-limited in seconds and blacklisted in minutes. To survive at scale, you need to distribute your requests across a massive pool of IP addresses.</p><p style="text-align: justify;">Traditionally, the web scraping industry has solved this issue thanks to commercial proxy providers. However, this is not the only approach. This article responds to the following question: &#8220;<em>Is there a way to scrape at scale without burning budget on proxies</em>?&#8221;</p><p style="text-align: justify;">The answer is yes. But let&#8217;s be clear from the beginning: This approach is not a universal silver bullet. Let&#8217;s see how it works, how to build it, and what its limitations are.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>The Typical Solution for Scraping at Scale: Proxy Provider Services</h2><p style="text-align: justify;">Let&#8217;s start this discussion with the typical choice for scraping at scale. IP bans and rate limits are the #1 operational problem in scraping, especially at scale. The typical solution every web scraping engineer integrates is using proxy servers, for a simple reason: <a href="https://substack.thewebscraping.club/i/164246773/what-are-proxies-and-why-are-they-used">proxies act as intermediaries between your scrapers and the Internet</a>, avoiding your scrapers from getting banned. To do so, companies buy proxy IPs from proxy providers. The most common categories, both with their flaws, are the following:</p><ul><li><p style="text-align: justify;"><strong>Datacenter proxies:</strong> These are cheap and fast, but their ASNs(Autonomous System Numbers) are heavily scrutinized. WAFs maintain databases of known datacenter CIDR (Classless Inter-Domain Routing) blocks, so hitting a target with a static list of 100 datacenter proxies usually results in those IPs being flagged and blocked within hours.</p></li><li><p style="text-align: justify;"><strong>Residential proxies:</strong> These route traffic through actual consumer devices. They have highly trusted IP reputations, making them excellent for bypassing anti-bot systems. However, they are priced by bandwidth, so they are very expensive, especially when scraping at scale.</p></li></ul><p style="text-align: justify;">The main limitation of this approach is that it is highly expensive. So, what if you need to scrape at scale but don&#8217;t have enough budget for doing so?</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>An Alternative Approach: Scraping at Scale With Dedicated Infrastructure</h2><p style="text-align: justify;">To escape metered billing, you can move egress back to dedicated infrastructure. But before presenting the solution, let&#8217;s first point out shortly what happens when you buy and use proxies, at the infrastructure level.</p><h3>Buying Proxies Means Delegating Your Infrastructure</h3><p style="text-align: justify;">When you buy proxies from providers, you are delegating 100% of your infrastructure. When your scrapers make the requests, under the hood, the proxy provider connects to a gateway, which is a massive load balancer controlled entirely by the provider itself.</p><p style="text-align: justify;">Let&#8217;s consider the case of residential proxies, for simplicity. Behind the gateway is a peer-to-peer (P2P) network of millions of consumer devices that the provider has acquired bandwidth from. When your request hits the gateway, <strong>their proprietary routing algorithm decides which consumer device in which country will act as your final exit node</strong>.</p><p style="text-align: justify;">The second you route traffic through their gateway is the exact moment where you delegate the 100% of your scraping infrastructure.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h3>NyxProxy: The Infrastructural Solution</h3><p style="text-align: justify;"><a href="https://github.com/phanes-io/nyxproxy-oss?tab=readme-ov-file">NyxProxy</a> is a self-hosted HTTP/SOCKS5 proxy server that exploits a well-known IPv6 networking trick: When a cloud provider gives you a <em>/64</em> subnet, you legally own 18.4 <em>quintillion</em> IPv6 addresses.</p><p style="text-align: justify;">Let&#8217;s explain the number and the trick around IPv6s. An IPv6 address looks like this:</p><pre><code><code> 2a05:f480:1800:25db:0000:0000:0000:0001</code></code></pre><p style="text-align: justify;">They are 128 bits long. That gives <em>2^128</em> possible addresses. The number is so large that the designers said: &#8220;W<em>e can afford to give every organization a massive block and never worry about running out&#8221;.</em></p><p style="text-align: justify;">Now, here is the trick. An IPv6 address is split into two halves, 64 bits each:</p><pre><code><code>2a05:f480:1800:25db : 0000:0000:0000:0001
|___________________|   |_________________|
   Network prefix            Host part
   (your subnet)          (you control this)</code></code></pre><p style="text-align: justify;">The <em>/64</em> notation means: the first 64 bits identify the network, the last 64 bits are yours to assign however you want. The last 64 bits can be any value from <em>0000:0000:0000:0000</em> to <em>ffff:ffff:ffff:ffff</em>: That&#8217;s <em>2^64</em> = 18.4 quintillion combinations. All valid addresses, all routable to your server.</p><p style="text-align: justify;">Thanks to this trick, NyxProxy can assign a pool of those addresses to your network interface at startup, then rotate your outgoing traffic across them. This means having a fresh IP per request. The tool handles pool management, background rotation, NDP proxying via <em>ndppd</em>, and exposes a monitoring endpoint.</p><p style="text-align: justify;">The best part is, indeed, in the NDP proxying. When your server uses a random address like <em>2a05:f480:1800:25db:a3f1:9922:beef:1234</em> as a source IP, your router upstream needs to know <em>your server is responsible for that address</em>. Otherwise, the response packets have nowhere to go.</p><p style="text-align: justify;">IPv6 uses NDP (Neighbor Discovery Protocol) for this. The router sends an NDP query: <em>&#8220;who has 2a05:f480:1800:25db:a3f1:9922:beef:1234?&#8221;</em> and your server must answer.</p><p style="text-align: justify;"><em><a href="https://github.com/DanielAdolfsson/ndppd">ndppd</a></em> (NDP Proxy Daemon) runs on your server and answers those queries automatically for your entire /64 subnet, essentially saying <em>&#8220;yes, all of those addresses are mine&#8221;</em>. Without it, your packets go out, but responses never come back.</p><p style="text-align: justify;">Below is a summary schema of how this whole process works:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;ac241add-8e8d-40d0-a7df-518bccfc20bc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Provider gives you:  2a05:f480:1800:25db::/64
                     &#8595;
Your server can use: 2a05:f480:1800:25db:[anything]
                     &#8595;
NyxProxy assigns 200 random IPs to your interface
                     &#8595;
Each outgoing request binds to a different one
                     &#8595;
Target sees 200 different source IPs
                     &#8595;
ndppd makes sure responses route back correctly</code></pre></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>How To Use NyxProxy</h2><p>Let&#8217;s now see how to use NyxProxy with a practical implementation.</p><h3>Environment Setup &amp; Prerequisites</h3><p style="text-align: justify;">To replicate this tutorial for deploying NyxProxy and utilizing it in your scraping scripts, you must have the following system and hardware requirements:</p><ul><li><p style="text-align: justify;"><strong>Hardware</strong>: A Virtual Private Server (VPS) or bare-metal server with at least 512 MB of RAM and 100 MB of disk space. Supported architectures are <em>amd64</em> or <em>arm64</em>.</p></li><li><p style="text-align: justify;"><strong>Subnet</strong>: A cloud provider that natively delegates a full IPv6 <em>/64</em> subnet to your network interface. Note that not all the VPS providers are supported: Check out the <a href="https://github.com/phanes-io/nyxproxy-oss?tab=readme-ov-file#network-requirements">NyxProxy documentation to learn more about supported VPSs</a>.</p></li><li><p style="text-align: justify;"><strong>Operating system</strong>: A modern Linux distribution, specifically Ubuntu or Debian, to ensure compatibility with the automated setup scripts and <em>sysctl</em> kernel modifications.</p></li><li><p style="text-align: justify;"><strong>Python</strong>: <a href="https://www.python.org/downloads/">Python 3.7 or higher</a> installed on your local machine to run the scraping scripts.</p></li></ul><p style="text-align: justify;">To get your server ready to run the proxy daemon, you need to verify your IPv6 setup and gain root access. Ensure you are logged into your VPS via SSH as the <em>root</em> user, or have <em>sudo</em> privileges.</p><p style="text-align: justify;">First, verify that your server has a globally routable IPv6 <em>/64</em> subnet assigned to it. You can check this by running the following command in your server&#8217;s terminal:</p><pre><code><code>ip -6 addr show | grep "scope global"</code></code></pre><p>If done correctly, you should see an output similar to the following:</p><pre><code><code>inet6 2a05:f480:1800:25db::1/64 scope global</code></code></pre><p>If you do not see a <em>/64</em> subnet, you will not be able to rotate IPs, and you must review your cloud provider&#8217;s network settings.</p><p>Next, prepare your local development environment. Suppose you call the main folder of your Python project <em>nyxproxy_scraper/</em>. At the end of this step, the folder will have the following structure:</p><pre><code><code>nyxproxy_scraper/
    &#9500;&#9472;&#9472; main.py
    &#9492;&#9472;&#9472; venv/</code></code></pre><p>Where:</p><ul><li><p><em>main.py</em> is the Python file that will store your proxy request logic.</p></li><li><p><em>venv/</em> contains the standard Python virtual environment.</p></li></ul><p>You can create the <em>venv/</em> <a href="https://docs.python.org/3/library/venv.html">virtual environment</a> directory like so:</p><pre><code><code>python -m venv venv</code></code></pre><p>To activate it, on Windows, run:</p><pre><code><code>venv\Scripts\activate</code></code></pre><p>Equivalently, on macOS and Linux, execute:</p><pre><code><code>source venv/bin/activate</code></code></pre><p>As a final prerequisite, install the <a href="https://requests.readthedocs.io/en/latest/">Requests library</a> in your activated virtual environment so your Python script can make HTTP calls:</p><pre><code><code>pip install requests</code></code></pre><p>Well done! You are now ready to test and use Nyxproxy.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3><strong>Installing and Configuring NyxProxy</strong></h3><p style="text-align: justify;">NyxProxy provides a quick setup script that handles the infrastructural heavy lifting. It auto-detects your network interface, installs <em>ndppd</em>, tweaks the Linux kernel parameters via <em>sysctl</em> to allow non-local binding, and downloads the compiled Go binary.</p><p style="text-align: justify;">You can launch it with the following single command:</p><pre><code><code>wget &lt;https://raw.githubusercontent.com/jannik-schroeder/nyxproxy-oss/main/scripts/quick-setup.sh&gt; &amp;&amp; chmod +x quick-setup.sh &amp;&amp; sudo ./quick-setup.sh</code></code></pre><p style="text-align: justify;">During the setup, you will be prompted to configure your proxy credentials and set your rotation rules. Behind the scenes, the script generates a <em>config.yaml</em> file. Let&#8217;s look at the crucial subset of that configuration:</p><pre><code><code>network:
  rotate_ipv6: true
  ipv6_subnet: "2a05:f480:1800:25db::/64"

  # The rotation mechanics:
  ipv6_pool_size: 200
  ipv6_max_usage: 100
  ipv6_max_age: 30</code></code></pre><p style="text-align: justify;">Below is an explanation of what these three parameters mean for your scraping pipeline:</p><ul><li><p style="text-align: justify;"><em>ipv6_pool_size</em>: NyxProxy keeps 200 mathematically unique IPs &#8220;hot&#8221; and bound to your network interface at any given time. This keeps proxy startup times under 100ms while maintaining IP diversity.</p></li><li><p style="text-align: justify;"><em>ipv6_max_usage</em>: After a specific IP has been utilized for 100 requests, it is considered &#8220;burned.&#8221; NyxProxy destroys the route and spins up a fresh address to dynamically replace it.</p></li><li><p style="text-align: justify;"><em>ipv6_max_age:</em> If an IP hasn&#8217;t hit 100 requests but has been alive for 30 minutes, it gets forcefully rotated out. This prevents time-based algorithmic tracking by the target WAF.</p></li></ul><p style="text-align: justify;">Once the daemon is running as a systemd service, your VPS is officially acting as a rotating proxy gateway. When NyxProxy receives a scraper request, the underlying Go binary takes over. It looks at its internal memory, picks one of the 200 rotating IPv6 addresses in its pool, and binds to that specific address to establish the outbound connection.</p><p>The expected output is as follows:</p><pre><code><code>IPv6 rotation mode: IP Pool with dynamic rotation
  Interface: enp1s0
  Subnet: 2a05:f480:1800:25db::/64
  Pool size: 200 IPs
  Rotation: Every 100 uses or 30m0s
  Initializing IP pool...
  Progress: 50/200 IPs added
  Progress: 100/200 IPs added
  Progress: 150/200 IPs added
  Progress: 200/200 IPs added
  IP pool ready with 200 addresses
  Background IP rotation started

Starting https proxy on 0.0.0.0:8080 (Protocol: IPv6)</code></code></pre><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3><strong>Testing the Proxy Logic</strong></h3><p style="text-align: justify;">At this point, NyxProxy has done its job. To verify it works correctly, you can use the following Python script that hits <em><a href="https://www.ipify.org/">api6.ipify.org</a></em>, which is an API that simply bounces back the IP address it sees:</p><pre><code><code>import requests

# Point this to your VPS IP and the credentials you set during setup
proxies = {
    'http': '&lt;http://admin:password@your-vps-ip:8080&gt;',
    'https': '&lt;http://admin:password@your-vps-ip:8080&gt;'
}

# Test 5 consecutive scraping requests
for i in range(5):
    response = requests.get('&lt;https://api6.ipify.org&gt;', proxies=proxies)
    print(f"Request {i+1}: Target sees IP -&gt; {response.text}")
</code></code></pre><p style="text-align: justify;">(NOTE: If you are already familiar with ipify.org, note that the &#8220;api6&#8221; prefix can be used for IPv6 requests only.)</p><p>The result should be similar to the following:</p><pre><code><code>Request 1: Target sees IP -&gt; 2a05:f480:1800:25db:1a2b:3c4d:5e6f:7890
Request 2: Target sees IP -&gt; 2a05:f480:1800:25db:9988:7766:5544:3322
Request 3: Target sees IP -&gt; 2a05:f480:1800:25db:aaaa:bbbb:cccc:dddd
Request 4: Target sees IP -&gt; 2a05:f480:1800:25db:1122:3344:5566:7788
Request 5: Target sees IP -&gt; 2a05:f480:1800:25db:dead:beef:cafe:babe</code></code></pre><p style="text-align: justify;">This shows that every single HTTP request utilizes a completely different, globally routable IPv6 address generated from your subnet block. To the target server, these look like entirely distinct users connecting from across the internet.</p><p style="text-align: justify;">Perfect! You have successfully built a self-healing, infinitely rotating proxy pool without handing over your budget for metered residential bandwidth.</p><h2>The Illusion of Infinity: Critical Limitations of IPv6 Subnet Rotation</h2><p style="text-align: justify;">At this point, you may think you have found a solution to all of your budgeting problems for scraping at scale. But before you tear down your commercial proxy infrastructure, you must understand that a $5/Mo VPS and an open-source rotation daemon are not a universal silver bullet. If it were that simple, the commercial proxy industry would not exist.</p><p>This architecture has the following main limitation:</p><ul><li><p style="text-align: justify;"><strong>The IPv4 compatibility wall:</strong> This entire architecture is built on one absolute prerequisite: Your target endpoint must support IPv6. If you are scraping legacy enterprise systems or platforms that haven&#8217;t migrated to dual-stack networking, this setup is useless. You cannot route an IPv6 packet to an IPv4-only server.</p></li><li><p style="text-align: justify;"><strong>Subnet-level bans (</strong><em><strong>/64</strong></em><strong> prefix blocking):</strong> Enterprise WAFs are fully aware of IPv6 prefix delegation standards. They know that hosting providers allocate a <em>/64</em> subnet to a single client. If their heuristics detect highly concurrent behavioral patterns (like missing <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">browser fingerprints</a> or anomalous TLS handshakes) originating from <em>2a05:f480...:1a2b</em>, they will ban the entire <em>/64</em> CIDR block. Once your <em>/64</em> prefix is banned, all 18 quintillion of your &#8220;infinite&#8221; IPs are simultaneously dead. To recover, you must physically destroy the VPS and provision a new one in a different IP range.</p></li><li><p style="text-align: justify;"><strong>ASN reputation:</strong> No matter how many IPs you rotate, your traffic still originates from a Datacenter Autonomous System Number (ASN). Target firewalls assign a baseline trust score to every ASN. Traffic originating from a Datacenter ASN always starts with a highly degraded trust score compared to a Residential ASN. For highly restrictive targets, any request from a datacenter IP is instantly met with an unpassable CAPTCHA or a hard <em>403 Forbidden</em>, regardless of whether it&#8217;s IPv4 or IPv6.</p></li><li><p style="text-align: justify;"><em>nf_conntrack</em><strong> and hardware exhaustion:</strong> You cannot push enterprise-grade throughput on a $5, 1-vCPU server without consequence. Rotating thousands of IPv6 addresses requires the Linux kernel to aggressively maintain the <em><a href="https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.txt">nf_conntrack</a></em> table and the NDP proxy table. At high concurrencies, the overhead of establishing, tracking, and tearing down thousands of TCP sockets across rotating interfaces will exhaust the memory or CPU of a low-tier VPS. The kernel will begin dropping packets natively, your latency will spike to useless levels, and your scrapers will be greeted with errors.</p></li></ul><h2>Conclusion</h2><p style="text-align: justify;">In this article, you learned how to leverage your hosting provider&#8217;s IPv6 <em>/64</em> subnets to build an infinitely rotating proxy pool with NyxProxy, escaping the metered billing of residential proxy networks.</p><p style="text-align: justify;">The competitive advantage of engineering your own proxy infrastructure is in your unit economics and architectural control. However, you also learned that this solution is not a universal silver bullet for every scraping scenario: It comes with trade-offs and constraints.</p><p style="text-align: justify;">So, let us know: Have you already experimented with bare-metal IPv6 rotation for your scraping pipelines? What targets did it work best for? Let&#8217;s discuss in the comments!</p>]]></content:encoded></item><item><title><![CDATA[The Trick to Scrape Next.js Websites in Seconds]]></title><description><![CDATA[Scraping data from the most widely used full-stack framework in the world with just 3 lines of code!]]></description><link>https://substack.thewebscraping.club/p/scrape-nextjs-websites</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/scrape-nextjs-websites</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 19 Apr 2026 19:18:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/17ee7337-9a3d-445a-a255-2895a6ed8235_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Next.js is one of the most widely adopted full-stack JavaScript frameworks on the planet. If you&#8217;ve ever built or deployed a web app, you definitely know it&#8212;or at least you&#8217;ve heard of it.</p><p>Behind the scenes, it relies on hydration to make server-rendered pages interactive. And here&#8217;s the interesting part: the same mechanism that makes Next.js fast and popular also exposes a significant amount of structured data in the HTML sent by the server. From a scraping perspective, that&#8217;s a huge opportunity!</p><p>In this post, I&#8217;ll show you a simple trick to scrape data from virtually any Next.js website. Follow along as I break down how it works and how you can apply it yourself.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Next.js in Numbers</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1F7B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1F7B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 424w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 848w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1272w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Next.js&#8217; GitHub star growth&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Next.js&#8217; GitHub star growth" title="Next.js&#8217; GitHub star growth" srcset="https://substackcdn.com/image/fetch/$s_!1F7B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 424w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 848w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1272w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Next.js&#8217; GitHub star growth</figcaption></figure></div><p>Next.js needs no introduction, but it&#8217;s worth giving some context to truly understand how popular it is (<em>and therefore how useful the trick I&#8217;m about to present for Next.js web scraping can be</em>):</p><ul><li><p>According to the <a href="https://survey.stackoverflow.co/2025/">2025 Stack Overflow Developer Survey</a>, 20.8% of respondents used Next.js extensively over the past year.</p></li><li><p>Next.js is the 14th largest repository on GitHub, with <a href="https://github.com/vercel/next.js">over 138k stars</a> (and still growing!).</p></li><li><p><a href="https://w3techs.com/technologies/overview/javascript_library">According to W3Techs</a>, Next.js has a 2.9% market share among JavaScript libraries.</p></li><li><p>Major brands such as <a href="https://nextjs.org/showcase">Nike, Stripe, and Notion have chosen this full-stack framework</a> to build their official websites.</p></li></ul><h2>Before Getting Started: A Bit of Context on Hydration</h2><p>I know you probably just want the trick&#8230; Still, let me take a minute to explain why it works in the first place, why it&#8217;s even possible, and what kind of data you&#8217;ll actually retrieve with it!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h3>What Is Hydration?</h3><p><a href="https://en.wikipedia.org/wiki/Hydration_(web_development)">Hydration</a> is the process that makes a server-rendered page interactive in the browser.</p><p>Frameworks like Next.js, Remix, Nuxt, and SvelteKit employ this mechanism to combine the performance benefits of <a href="https://nextjs.org/docs/pages/building-your-application/rendering/server-side-rendering">server-side rendering (SSR)</a> with the interactivity of client-side applications.</p><p>The idea is that the server first sends fully rendered static HTML to the browser. Then, hydration happens next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jt2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 424w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 848w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png" width="1227" height="633" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:633,&quot;width&quot;:1227,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)" title="The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)" srcset="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 424w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 848w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)</figcaption></figure></div><p>The browser downloads the JavaScript bundle, and the frontend framework reconstructs the component tree in memory, attaches event listeners, and links that virtual tree to the existing DOM instead of re-rendering it from scratch. The result is a fully interactive application built on top of server-rendered HTML.</p><h3>How Does the Hydration Mechanism Work?</h3><p>It&#8217;s now clear that in Next.js and similar frameworks, hydration is the process where a static, server-rendered HTML page &#8220;comes to life&#8221; and becomes fully interactive in the browser. But what&#8217;s actually happening under the hood?</p><p>At a high level, hydration is a 3-step process:</p><ol><li><p>The server generates and sends a fully rendered HTML snapshot. The user immediately sees the content (great for <a href="https://web.dev/articles/fcp">First Contentful Paint</a>). At this point, though, the page is just static HTML. Buttons, forms, and other interactive elements are visible, but they don&#8217;t work yet because no JavaScript is attached.</p></li><li><p>The client&#8217;s browser downloads the JavaScript bundle (which includes React and your frontend application code) and executes it.</p></li><li><p>React rebuilds the component tree in memory and attaches event listeners to the existing DOM nodes. Instead of discarding the HTML and re-rendering everything from scratch, React &#8220;hydrates&#8221; the existing markup, meaning it reuses it and wires it up with state and interactivity.</p></li></ol><p>Once hydration completes, the page behaves like a normal single-page application: it responds to clicks, manages state, and updates dynamically.</p><p>And here&#8217;s an important detail: if the browser doesn&#8217;t support JavaScript (or it fails to load), the user still sees the server-rendered HTML. It won&#8217;t be interactive, but the core content is there. That&#8217;s great for SEO and perceived performance!</p><h3>Why It Matters for Scraping Next.js (and Other Full-Stack Frameworks&#8230;)</h3><p>The key insight you need to understand is simple: <strong>hydration requires data</strong>, and that data must be embedded somewhere in the HTML sent by the server!</p><p>In Next.js, when the server renders a page, it doesn&#8217;t only send markup. It also serializes the data required to rebuild the React component tree on the client. That serialized payload is embedded directly into the page&#8217;s HTML.</p><p>That&#8217;s exactly why hydration matters for scraping. Instead of parsing the DOM or simulating user interactions through <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation</a>, you can extract the structured data that React itself uses to hydrate the page.</p><p>In many cases, hydration data is cleaner and easier to parse than the rendered HTML. It can also contain more information than what&#8217;s visibly displayed on the page, including hidden and interesting metadata.</p><p>Keep in mind that this principle applies not only to Next.js! All other full-stack frameworks that rely on hydration, such as Remix, Nuxt, Angular Universal, and SvelteKit, tend to dehydrate state on the server and rehydrate it on the client.</p><p>So remember this simple rule. If a framework hydrates, it must serialize data. And if it serializes data into the HTML, you can scrape it.</p><h2>How to Scrape Next.js Websites: 2 Approaches</h2><p>The approach to scraping Next.js by targeting hydration data depends on how that data is embedded in the HTML generated on the server side.</p><p>I won&#8217;t go too deep into framework internals here (if you&#8217;re a Next.js dev, you already know things shift depending on whether you&#8217;re using the<a href="https://nextjs.org/docs/app/getting-started"> </a><em><a href="https://nextjs.org/docs/app/getting-started">App Router</a></em> or the<a href="https://nextjs.org/docs/pages/getting-started"> </a><em><a href="https://nextjs.org/docs/pages/getting-started">Pages Router</a></em>), but there are essentially two scenarios you&#8217;ll run into.</p><p>In this section, I&#8217;ll walk through both of them and show you exactly how I retrieve data from each!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Approach #1: Target the __NEXT_DATA__ Script</h3><p>As a target site, I&#8217;ll use a <a href="https://www.nike.com/t/air-jordan-5-retro-wolf-grey-mens-shoes-0M9kM1yX/DD0587-002">Nike product page</a> as a reference:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uJcE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uJcE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 424w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 848w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target Nike page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target Nike page" title="The target Nike page" srcset="https://substackcdn.com/image/fetch/$s_!uJcE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 424w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 848w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target Nike page</figcaption></figure></div><p>That&#8217;s actually a great example because Nike.com is even showcased on the Next.js homepage as a real-world site built with the framework.</p><p>Now, right-click on the page and select the &#8220;Inspect&#8221; option in your browser to open the DevTools. Scroll through the DOM and get familiar with the page structure. If the Next.js site is using the <em>Pages Router</em>, you&#8217;ll notice a <em>&lt;script&gt;</em> tag with the id <em>__NEXT_DATA__</em> containing a large JSON blob:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rV1e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rV1e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the JSON data inside the #__NEXT_DATA__ element&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the JSON data inside the #__NEXT_DATA__ element" title="Note the JSON data inside the #__NEXT_DATA__ element" srcset="https://substackcdn.com/image/fetch/$s_!rV1e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the JSON data inside the #__NEXT_DATA__ element</figcaption></figure></div><p>That JSON data is precisely the hydration data I was referring to earlier.</p><p>When a site uses the Pages Router approach in Next.js, the server embeds all the page data directly into that <em>&lt;script&gt;</em> tag. From a scraping perspective, that&#8217;s gold, as the data is already structured and ready to be captured.</p><p>Below&#8217;s a simple JavaScript snippet to extract it:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">const hydartionScript = document.querySelector("#__NEXT_DATA__")
const hydrationData = JSON.parse(hydartionScript.innerHTML)
console.log(hydrationData)</code></pre></div><p>What&#8217;s happening here is straightforward. The JS script:</p><ul><li><p>Selects the <em>&lt;script&gt;</em> element with <em>id</em> <em>__NEXT_DATA__</em>.</p></li><li><p>Reads its inner HTML (which is a JSON string).</p></li><li><p>Parses it into a JavaScript object.</p></li><li><p>Logs it to the console.</p></li></ul><p>Run this directly in the DevTools Console, and you&#8217;ll immediately see the result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2AK7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2AK7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dc28944-8842-4605-be59-b746fef469db_1746x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2AK7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the structured hydration data</figcaption></figure></div><p>What&#8217;s interesting is how much structured data you get right away. This includes product details, images, metadata, and more. All is neatly organized, and it only took three lines of code!</p><p>If you want to store the JSON hydration object, just right-click the object in the Console and select the &#8220;Copy object&#8221; option:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m1uv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m1uv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 424w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 848w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1272w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png" width="1456" height="483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Selecting the &#8220;Copy object&#8221; option&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Selecting the &#8220;Copy object&#8221; option" title="Selecting the &#8220;Copy object&#8221; option" srcset="https://substackcdn.com/image/fetch/$s_!m1uv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 424w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 848w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1272w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Selecting the &#8220;Copy object&#8221; option</figcaption></figure></div><p>From there, you can paste it wherever you need (e.g., into a local <em>.json</em> file, a MongoDB collection, etc.).</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Approach #2: Target the self.__next_f.push() Elements</h3><p>Another, more complex approach to scraping Next.js involves pages built with the <em>App Router</em>.</p><p>Even if the <em>App Router</em> has been the recommended direction for a while, in my experience, it&#8217;s still not as widely adopted as the <em>Pages Router</em>. And honestly, that&#8217;s a bit of a gift for us (as scraping hydration data in <em>App Router</em> sites is definitely more complex!)</p><p>As a reference, let&#8217;s look at the &#8220;<a href="https://openai.com/business/">Business Overview</a>&#8221; page on the OpenAI website, which is built with Next.js <em>App Router</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CAEI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CAEI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 424w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 848w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1272w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png" width="1456" height="709" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:709,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target page" title="The target page" srcset="https://substackcdn.com/image/fetch/$s_!CAEI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 424w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 848w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1272w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target page</figcaption></figure></div><p>Just like before, open DevTools and inspect the page. This time, focus on the <em>&lt;script&gt;</em> tags inside the <em>&lt;body&gt;</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LTkB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LTkB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png" width="1456" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the hydration script elements&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the hydration script elements" title="Note the hydration script elements" srcset="https://substackcdn.com/image/fetch/$s_!LTkB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the hydration script elements</figcaption></figure></div><p>You&#8217;ll notice several scripts containing content like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">self.__next_f.push(&lt;some_data&gt;)</code></pre></div><p>That &#8220;<em>&lt;some_data&gt;</em>&#8221; is serialized using the <a href="https://tonyalicea.dev/blog/understanding-react-server-components/">React Flight protocol for React Server Components (RSC)</a>. I won&#8217;t go too deep into the internals here (it&#8217;s a dense topic!), but what matters is that <strong>deserializing that data is </strong><em><strong>not</strong></em><strong> straightforward!</strong></p><p>React Flight isn&#8217;t plain JSON. It mixes control records (<em>HL</em>, <em>I</em>, <em>J</em>, etc.), module references, streaming boundaries, and serialized model fragments into a transport format that React incrementally resolves at runtime.</p><p>You might think: &#8220;Why not just reuse the frontend deserialization library?&#8221; In practice, that doesn&#8217;t work well because:</p><ul><li><p>The client decoder (<em><a href="https://www.npmjs.com/package/react-server-dom-webpack">react-server-dom-webpack</a></em>) expects a full React runtime.</p></li><li><p>It relies on module maps and webpack IDs generated at build time.</p></li><li><p>It resolves component references against the exact bundle that produced the stream.</p></li><li><p>It assumes streaming semantics and internal React wiring.</p></li></ul><p>Basically, outside that exact environment, you don&#8217;t have the module graph, build manifest, or hydration context. So even if you import the decoder, you can&#8217;t reconstruct the component tree the way the browser does.</p><p>There have been recent security issues in the React Flight payload deserialization system, highlighting just how sensitive and complex this layer is. For more details, refer to:</p><ul><li><p><em><a href="https://nextjs.org/blog/CVE-2025-66478">Security Advisory: CVE-2025-66478</a></em></p></li><li><p><em><a href="https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components">Critical Security Vulnerability in React Server Components</a></em></p></li></ul><p>Thus, instead of fighting the protocol, I&#8217;d simplify and accept that in this case, it&#8217;s better to extract the unparsed React Flight string data. Achieve that with the JS script below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">const nextFlightScripts = [...document.querySelectorAll("script")]
  .filter(script =&gt; script.textContent.includes("self.__next_f"))
  .map(script =&gt; script.textContent.trim())
console.log(nextFlightScripts)</code></pre></div><p>This selects all <em>&lt;script&gt;</em> elements containing &#8220;self.__next_f&#8221; and builds an array of their raw contents.</p><p>Run it in the Console, and you&#8217;ll get an array of React Flight chunks:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LBAG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LBAG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png" width="1456" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the React Flight strings&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the React Flight strings" title="Note the React Flight strings" srcset="https://substackcdn.com/image/fetch/$s_!LBAG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the React Flight strings</figcaption></figure></div><p>From there, the simplest way to extract structured data is often to copy the array, feed it to an AI, and ask it to reconstruct a parsed JSON representation of the meaningful payload sections:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!08ee!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!08ee!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 424w, https://substackcdn.com/image/fetch/$s_!08ee!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 848w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1272w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png" width="1456" height="973" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:973,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the parsed version of the source data produced by Gemini&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the parsed version of the source data produced by Gemini" title="Note the parsed version of the source data produced by Gemini" srcset="https://substackcdn.com/image/fetch/$s_!08ee!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 424w, https://substackcdn.com/image/fetch/$s_!08ee!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 848w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1272w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the parsed version of the source data produced by Gemini</figcaption></figure></div><p>Is this more complicated than the <em>__NEXT_DATA__</em> trick? Absolutely! Yet, it&#8217;s still a powerful way to access a large amount of page data with just a few lines of code.</p><h2>Final Script to Quickly Access Data From Next.js Sites</h2><p>If you combine the two approaches, you can build a production-ready script for brute-force hydration data scraping in Next.js:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// Pages Router approach (__NEXT_DATA__)
const hydrationScript = document.querySelector("#__NEXT_DATA__")
let nextData = null
if (hydrationScript) {
  try {
    nextData = JSON.parse(hydrationScript.textContent)
    console.log("__NEXT_DATA__ found:")
    console.log(nextData)
  } catch (err) {
    console.warn("Failed to parse __NEXT_DATA__:", err)
  }
} else {
  console.log("No __NEXT_DATA__ script found.")
}

// App Router approach (self.__next_f)
const nextFlightScripts = [...document.querySelectorAll("script")]
  .map(script =&gt; script.textContent.trim())
  .filter(content =&gt; content.includes("self.__next_f.push"))

if (nextFlightScripts.length &gt; 0) {
  console.log("React Flight scripts found:")
  console.log(nextFlightScripts)
} else {
  console.log("No React Flight scripts found.")
}</code></pre></div><p>To test it, just open the Console in DevTools, paste the script, and run it.</p><p><strong>Important</strong>: The <em>&lt;script&gt;</em> components containing hydration data aren&#8217;t loaded dynamically via client-side rendering. They&#8217;re embedded directly in the HTML generated by the server.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Km-I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Km-I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Km-I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the #__NEXT_DATA__ element in the page source</figcaption></figure></div><p>That means you can:</p><ol><li><p>Fetch the target Next.js-powered page with an HTTP client.</p></li><li><p>Parse the HTML using an HTML parsing library like Beautiful Soup or Cheerio.</p></li><li><p>Apply a similar version of the JavaScript script above, but adapt it to the API provided by your HTML parser.</p></li></ol><p>In other words, this trick for scraping Next.js doesn&#8217;t only work in the browser DevTools. It also works perfectly in regular scraping scripts!</p><h2>Pros and Cons of This Approach to Next.js Scraping</h2><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Simple and effective, requiring only a few lines of code.</p></li><li><p>Works on all Next.js websites (and, more generally, on most sites that rely on hydration).</p></li><li><p>Can let you access more data than what is actually displayed on the page.</p></li><li><p>No need for browser automation, waiting for client-side rendering, or simulating user interactions.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>You may only get partial data, meaning you might still need to complement it with a more traditional scraping approach.</p></li><li><p>React Flight data is difficult to parse and may require custom logic or even <a href="https://substack.thewebscraping.club/p/llms-ai-web-scraping">AI-assisted parsing</a>.</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve shared <a href="https://brightdata.com/blog/how-tos/web-scraping-with-next-js">a trick I personally documented years ago</a>, and that still works to this day. It allows you to quickly scrape data from virtually any Next.js site by targeting the hydration data embedded in the HTML document generated by the server and sent to the client for rendering.</p><p>As you&#8217;ve seen, with just a few lines of JavaScript, you can extract hydration data from any Next.js-powered page. What you get back is clean, or at least almost clean, data that you can process directly in your data pipelines.</p><p>Instead of fighting the frontend, this Next.js web scraping approach helps you leverage the data the framework itself needs to function!</p><p>I hope you found this useful and insightful. If you have questions or thoughts, feel free to share them in the comments below!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #102: How Fast Can You Call Polymarket's APIs?]]></title><description><![CDATA[Three languages, four locations, 1,000 requests. The biggest speed gain has nothing to do with code.]]></description><link>https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 16 Apr 2026 14:08:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/dd002f1e-6fe6-4cde-8c7d-1fdaa94d11d3_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is a platform name that has been bouncing in the news for over a year now. A new military action? Someone predicted it on Polymarket. An event that moves the price of oil or shakes a currency? Someone else, or maybe the same person, placed a bet a few hours before and walked away with a pile of money. Every time a headline breaks, Polymarket seems to have already priced it in, or worse, someone appears to have known in advance. <br>Even here on Substack, you can share the predictions coming from the platform.<br></p><div class="polymarket-embed" data-attrs="{&quot;eventSlug&quot;:&quot;claude-5-released-by&quot;,&quot;marketSlug&quot;:&quot;&quot;,&quot;profileName&quot;:&quot;&quot;,&quot;belowTheFold&quot;:false,&quot;fullEmbedUrl&quot;:&quot;https://substack.com/embed/polymarket/claude-5-released-by&quot;,&quot;isGraphMode&quot;:false}" data-component-name="PolymarketToDOM"></div><p><br>But what is Polymarket, and how does it work?<br></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What Polymarket is, and why it keeps making headlines</h2><p>Polymarket is a prediction market, a platform where you buy and sell shares tied to the outcome of real-world events. If the event happens, your share pays $1. If it doesn&#8217;t, it pays $0. The trading price at any moment reflects what the market collectively believes the probability of that outcome is. You can bet on elections, geopolitics, sports, crypto prices, and increasingly anything else with a verifiable resolution. </p><p>It is the largest prediction market by volume, built on the Polygon blockchain. Its main competitor, Kalshi, operates as a CFTC-regulated exchange in the US. Both are attracting billions in volume, and Wall Street firms are now building dedicated trading desks around them.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><p>The platform handled <a href="https://www.ccn.com/news/crypto/polymarket-7-5-billion-2025-prediction-markets/">at least $7.5 billion in volume during 2025</a> (a conservative figure, since <a href="https://www.paradigm.xyz/2025/12/polymarket-volume-is-being-double-counted">Polymarket volume is commonly double-counted</a> due to how OrderFilled events are summed, <a href="https://www.trmlabs.com/resources/blog/how-prediction-markets-scaled-to-usd-21b-in-monthly-volume-in-2026">and set a single-day record of $425 million in February 2026</a> when Iran-related markets resolved simultaneously. Those are not toy numbers. And with that kind of money flowing through, the headlines have followed.</p><p>In January 2026, a newly created Polymarket account invested $30,000 <a href="https://www.npr.org/2026/01/05/nx-s1-5667232/polymarket-maduro-bet-insider-trading">and walked away with $436,759</a> after correctly betting on Maduro&#8217;s removal from power. The account was created less than a week before the U.S. military operation, and the bulk of bids were placed hours before Trump&#8217;s announcement. In a separate case, <a href="https://www.haaretz.com/israel-news/israel-security/2026-03-28/ty-article/.premium/court-clears-air-force-officer-charged-with-leaking-iran-strike-for-online-bets/0000019d-2f2e-d868-a1bd-7fef78860000">an Israeli Air Force reservist was indicted for leaking classified detail</a>s about a strike on Iran to guide Polymarket bets, netting roughly $244,000. <a href="https://www.cnn.com/2026/03/24/politics/iran-war-bets-prediction-markets">A different trader has made nearly $1 million since 2024</a> from dozens of well-timed bets correctly predicting U.S. and Israeli military actions against Iran, winning 93% of five-figure wagers. <a href="https://www.cnbc.com/2026/04/15/kalshi-and-polymarket-congress-regulation-washington-influence.html">These incidents triggered at least eight prediction market bills in Congress</a> since January 2026, and federal prosecutors in Manhattan are <a href="https://www.cnn.com/2026/03/30/politics/prediction-markets-justice-department">actively exploring whether certain prediction market bets violate insider trading laws</a>.</p><p>But insider trading is not the only way people make money on Polymarket. There is a quieter, more interesting story happening in parallel.</p><p></p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p><br></p><h2>The efficiency gap</h2><p><a href="https://www.princeton.edu/~ceps/workingpapers/91malkiel.pdf">The Efficient Market Hypothesis</a>, formalized by Eugene Fama in the 1960s, states that asset prices reflect all available information, making it impossible to consistently beat the market. In traditional equity markets, this largely holds because massive institutional capital from hedge funds, pension funds, and proprietary trading firms constantly hunts for and eliminates mispricings. The S&amp;P 500 trades roughly $500 billion daily. Any pricing error gets corrected in milliseconds by algorithms running in colocated data centers.</p><p>Polymarket&#8217;s individual markets often have only tens of thousands of dollars in liquidity. The ratio of &#8220;smart money&#8221; to &#8220;total market cap&#8221; is fundamentally different from equity markets, and that is why edges persist longer than they would on Wall Street. <a href="https://arxiv.org/abs/2508.03474">A 2025 study by IMDEA Networks Institute</a> documented $40 million in arbitrage profits extracted from Polymarket alone between April 2024 and April 2025, analyzing 86 million bets. <a href="https://www.financemagnates.com/trending/prediction-markets-are-turning-into-a-bot-playground/)">Arbitrage opportunities on the platform last an average of 4 seconds, with 73% of profits captured by bots</a>, executing in under 100 milliseconds.</p><p>The institutional side is catching up. <a href="https://www.financemagnates.com/fintech/wall-street-quants-move-into-prediction-markets-to-hunt-for-arbitrage-not-to-bet/">DRW is hiring dedicated prediction market traders</a> at a $200,000 base salary. Susquehanna International Group became the first official market maker on Kalshi (a competing platform). Jump Trading is building specialized desks. But the market is not there yet. Liquidity is too thin for these firms to deploy serious capital without moving prices, leaving room for smaller, faster actors.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Speed as edge: what people are building</h2><p>I&#8217;ve been studying Polymarket for some months now, and I&#8217;ve probably ended up on a bubble on Instagram and other social media. I&#8217;m seeing a growing number of traders share systems designed to exploit exactly this kind of market inefficiency. The approaches vary, but the pattern is the same: faster information, faster execution, profit.</p><p>One notable case involves a trader who claims to use computer vision models processing live football match video feeds. His system watches the match in real time, detects events (goals, red cards, penalties) through frame analysis, and places bets on prediction markets seconds before the event registers on official data feeds and bookmaker odds adjust. He claims an 8-second advantage over other traders (unfortunately, I cannot find the post on Instagram about it anymore). Whether that specific claim holds up or not, this is nothing new: courtsiders have been doing this in tennis for years, <a href="https://fivethirtyeight.com/features/inside-the-shadowy-world-of-high-speed-tennis-betting">attending live matches and transmitting scores</a> faster than official data feeds reach bookmakers. In 2016, tennis umpires from Kazakhstan, Turkey, and Ukraine were banned for deliberately delaying score updates for courtside accomplices.</p><p>The same principle applies at a larger scale. <a href="https://www.bloomberg.com/news/features/2018-05-03/the-gambler-who-cracked-the-horse-racing-code">Bill Benter built a multinomial logit model</a> with over 120 variables per horse for Hong Kong racing and extracted over $1 billion between 1987 and 2001. <a href="https://www.racingpost.com/news/britain/high-court-case-alleges-tony-blooms-betting-empire-makes-600m-a-year-so-what-do-we-know-about-his-starlizard-syndicate-aNlkE7t8daxQ/">Tony Bloom&#8217;s Starlizard syndicate employs 160 people </a>to model Asian handicap football markets and reportedly generates 600 million GBP per year. <a href="https://www.financemagnates.com/trending/prediction-markets-are-turning-into-a-bot-playground/">On Polymarket itself, 14 of the top 20 most profitable wallets are bots</a>. <a href="https://www.coindesk.com/markets/2026/02/21/how-ai-is-helping-retail-traders-exploit-prediction-market-glitches-to-make-easy-money">One bot turned $313 into $414,000</a> in a single month, exploiting temporal arbitrage in 15-minute crypto markets.</p><p>All of these systems share two requirements: data and speed. They need real-time access to market prices, order books, and event outcomes, and they need to act on that data faster than everyone else. All of this is possible because Polymarket provides a full set of APIs that can be used to operate programmatically on the platform.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Polymarket&#8217;s API architecture</h2><p>Polymarket exposes <a href="https://docs.polymarket.com/api-reference/introduction">three distinct APIs</a>, each serving a different purpose. Understanding which one to use and when is the first step toward building anything that trades or monitors this market.</p><h3>Gamma API: market discovery</h3><p>The Gamma API is the browsing layer. It returns human-readable market data: questions, descriptions, outcome prices, volume, liquidity, event metadata. No authentication required.</p><p><strong>Base URL</strong>: https://gamma-api.polymarket.com</p><p>Key endpoints:</p><p>- <code>GET /markets</code> returns a paginated list of markets with filtering options (limit, offset, closed, tag_id)</p><p>- <code>GET /markets/{id} </code>returns a single market by ID or slug</p><p>- <code>GET /events</code> and <code>GET /events/{id}</code> return event-level data (events group related markets)</p><p>- <code>GET /search?query=... </code>performs keyword search across markets and events</p><p>A single call to /markets?limit=1&amp;closed=false returns something like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7e25266f-af9b-4b4c-b40f-660d9c8e031f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{
  "id": "540816",
  "question": "Russia-Ukraine Ceasefire before GTA VI?",
  "conditionId": "0x9c1a953fe92c8357f1b646ba25d983aa83e90c525992db14fb726fa895cb5763",
  "outcomes": "[\"Yes\", \"No\"]",
  "outcomePrices": "[\"0.545\", \"0.455\"]",
  "volume": "1516211.89",
  "liquidity": "62104.61",
  "clobTokenIds": "[\"850149715...\", \"252731249...\"]"
}</code></pre></div><p>The `clobTokenIds` field is the bridge to the trading layer. Each outcome (Yes/No) gets its own token ID, which is what you pass to the CLOB API to get real-time prices and order book data.</p><p>The Gamma API is rate-limited at roughly 60 requests per minute. It is useful for discovery and metadata, not for real-time price monitoring.</p><h3>CLOB API: the order book</h3><p>The CLOB (Central Limit Order Book) API is where trading happens. It has both public and authenticated endpoints.</p><p><strong>Base URL</strong>: https://clob.polymarket.com</p><p><strong>Public endpoints (no authentication):</strong></p><p>- <code>GET /price?token_id=X&amp;side=BUY|SELL</code> returns the current best price</p><p>- <code>GET /midpoint?token_id=X</code> returns the midpoint between best bid and ask</p><p>- <code>GET /spread?token_id=X</code> returns the current spread</p><p>- <code>GET /book?token_id=X</code> returns the full order book with all bids and asks</p><p>- <code>GET /last-trade-price?token_id=X</code> returns the last executed trade price</p><p>- <code>GET /tick-size?token_id=X</code> returns the minimum price increment</p><p>A call to /midpoint returns a minimal payload:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;5b93dac3-b924-4bde-9cca-74db20b575d9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{"mid": "0.545"}</code></pre></div><p>The <code>/book</code> endpoint returns the full depth:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7ba6f05f-de54-4ad7-ba8b-50d1d9f4eddc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{
  "market": "0x9c1a95...",
  "asset_id": "850149715...",
  "bids": [
    {"price": "0.54", "size": "15234.50"},
    {"price": "0.53", "size": "8920.00"}
  ],
  "asks": [
    {"price": "0.55", "size": "12100.00"},
    {"price": "0.56", "size": "6500.00"}
  ]
}</code></pre></div><p>These public endpoints are what matters for price monitoring. They are lightweight, return small payloads, and have no authentication overhead.</p><p><strong>Authenticated endpoints</strong> require a <a href="https://docs.polymarket.com/developers/CLOB/authentication">two-level authentication system</a>:</p><p><strong>Level 1 (L1)</strong> uses EIP-712 wallet signatures. You sign a structured message proving you control a specific Ethereum wallet address. This is a one-time operation that generates API credentials:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;a46d1352-1fcd-4eb2-98dd-5baea0815327&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">POST /auth/api-key
Headers: POLY_ADDRESS, POLY_SIGNATURE, POLY_TIMESTAMP, POLY_NONCE
Returns: { apiKey, secret, passphrase }</code></pre></div><p><code>Level 2 (L2) </code>uses HMAC-SHA256 signing on every request. Every authenticated call requires five headers: <code>POLY_ADDRESS</code>, <code>POLY_SIGNATURE</code> (computed HMAC of the request), <code>POLY_TIMESTAMP</code>, <code>POLY_API_KEY</code>, and <code>POLY_PASSPHRASE</code>. </p><p>Even with L2 auth, placing an order requires the user to sign the order payload locally with their private key. Three cryptographic operations total: key derivation (once), request signing (per call), order signing (per order).</p><p>The authenticated endpoints are:</p><p>- <code>POST /order</code> places a single order</p><p>- <code>POST /orders</code> places a batch of orders</p><p>- <code>DELETE /order</code> cancels an order</p><p><strong>WebSocket feeds</strong> provide real-time streaming at <code>wss://ws-subscriptions-clob.polymarket.com/ws/ </code>for order book updates, price changes, and user-specific events.</p><h3>Data API: analytics</h3><p>The Data API at <code>https://data-api.polymarket.com</code> provides analytics-oriented data: user positions, trade history, leaderboards, and holder information. It is less documented and less stable than the other two. Some endpoints returned 404 or empty responses during our testing. Useful for research, not reliable for production.</p><h2>The speed game: calling the APIs as fast as possible</h2><p>If arbitrage opportunities on Polymarket last 4 seconds on average, and 73% of profits go to bots executing in under 100 milliseconds, then the speed at which you can read prices and place orders is a direct competitive advantage. We set up a benchmark to answer two questions: where should you run your code, and which language and HTTP strategy gets you there fastest?</p><p>We did our tests and chose the <code>/midpoint </code>endpoint for the benchmark because it requires no authentication, returns the smallest possible payload, and isolates HTTP client performance from payload parsing. Each benchmark runs 1,000 requests in two modes: sequential (one request at a time, measuring per-request latency) and concurrent (50 simultaneous workers, measuring throughput).<br><br>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">102.POLYMARKET</a>.</strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Stealth Stack: A Guide to Preventing Data Leaks in Web Scraping Infrastructure]]></title><description><![CDATA[A four-layer defense strategy for making your web scraping infrastructure indistinguishable from real users]]></description><link>https://substack.thewebscraping.club/p/the-stealth-stack-web-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-stealth-stack-web-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 12 Apr 2026 03:00:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ef273b12-ade2-4ba6-a14a-701876041775_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When hearing about &#8220;data leaks&#8221;, I&#8217;m sure you think about cybersecurity, databases, and personal information lost due to malicious intent. But what if I tell you your web scraper is leaking data? But in the specific context of web scraping, no one is stealing your data. Rather, this means that your scraper is revealing its automated nature through a set of signals. </p><p>In particular, your scrapers leak information at four distinct layer levels. Modern anti-bot systems, in fact, fingerprint your browser, analyze your TLS handshake, trace your network infrastructure, and track your behavioral patterns. And a single inconsistency across these layers triggers permanent blocking.</p><p>This means your scrapers aren&#8217;t competing only against rate limits anymore. Today, they are competing against <a href="https://substack.thewebscraping.club/p/machine-learning-for-detecting-bots">machine learning models trained on billions of legitimate requests</a>, and any deviation from the expected pattern is a signal. So, if you want to scrape at scale, your infrastructure must be indistinguishable from a real user&#8217;s browser, network stack, and behavior.</p><p>This article guides you through a systematic approach: First, understanding where leaks occur, then learning how anti-bot systems detect them, and finally building a layered defense that makes your scraper invisible.</p><p></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>Identifying the Leaks: Where Your Scraper Exposes Itself</strong></h2><p>Before fixing anything, you need to understand the complete attack surface. Modern anti-bot systems analyze your scraper at four distinct layers, and a leak at any layer can expose you.</p><h3><strong>Layer 1: The Browser Level</strong></h3><p>Headless browsers are loud by default. Launch a <a href="https://pptr.dev/">Puppeteer</a> instance and check the  <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">navigator.webdriver</a> </em>flag. It surely returns <em>true</em>, and that&#8217;s a signal every major anti-bot system checks in the first 100ms of page load.</p><p>But this obvious flag is just the beginning. Anti-bot systems probe deeper:</p><ul><li><p><strong>Error messages and stack traces</strong>: They differ between headless and headed modes. The execution context leaves fingerprints in error objects.</p></li><li><p><strong>Window dimensions</strong>: Properties like <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/outerWidth#:~:text=outerWidth%20read%2Donly%20property%20returns,and%20window%20resizing%20borders%2Fhandles.">window.outerWidth</a></em> and <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/outerHeight">window.outerHeight</a></em> reveal a headless operation because headless mode doesn&#8217;t render a visible window frame.</p></li><li><p><strong>Canvas rendering</strong>: They can produce pixel-level differences. Software rendering (headless) creates different anti-aliasing and color values than GPU-accelerated rendering (headed). Color channels can differ by 1-2 units per pixel.</p></li><li><p><strong><a href="https://developer.mozilla.org/en-US/docs/Web/API/WebGLShader">WebGL shader</a> timing</strong>: This can vary a lot, depending on the underlying technology. GPU-accelerated browsers complete WebGL operations in microseconds. Software-rendered headless browsers take milliseconds.</p></li><li><p><strong>Font rendering</strong>: Headless environments often lack the full system font stack. This creates detectable layout differences when JavaScript measures text dimensions.</p></li><li><p><strong>Performance benchmarks</strong>: When run, they can reveal software rendering. For example, there are websites that run JavaScript stress tests, creating thousands of DOM elements, calculating layouts, and triggering reflows. In such scenarios, real browsers with GPU acceleration show consistent performance. Headless browsers, instead, show different timing patterns.</p></li><li><p><strong>The </strong><em><strong><a href="https://developer.chrome.com/docs/extensions/reference/api/windows">window.chrome</a></strong></em><strong> object behaves differentl</strong>y: Real Chrome populates this object with specific properties for extension management and runtime APIs. Headless Chrome, instead, either lacks this object or provides an incomplete implementation.</p><p></p></li></ul><h3><strong>Layer 2: The Network Level</strong></h3><p>Your SSL/TLS handshake identifies you before you send any application data. When your scraper connects over HTTPS, it sends a TLS Client Hello message containing supported encryption methods, protocol versions, and extensions. All in a specific order.</p><p>Here&#8217;s what makes this dangerous:</p><ul><li><p><strong>Every browser and HTTP library has a unique TLS pattern:</strong> Real browsers send their TLS parameters in a specific sequence that matches their version and underlying platform. Python&#8217;s standard HTTP libraries send a completely different pattern. So do Node.js, Go, and any other programming language you use for coding your scrapers.</p></li><li><p><strong>Anti-bot systems fingerprint your TLS handshake:</strong> They capture these patterns and convert them into a fingerprint, commonly called a <a href="https://github.com/salesforce/ja3">JA3 hash</a>. They maintain databases of known fingerprints for every major browser and HTTP library.</p></li><li><p><strong>Mismatches between User-Agent and TLS fingerprint are instant red flags:</strong> When you claim to be Chrome in your User-Agent header but your TLS handshake matches Python&#8217;s urllib library, that inconsistency triggers blocking.</p></li><li><p><strong>Detection happens before you send any application data:</strong> The first TCP connection already identifies you as automated traffic.</p></li><li><p><strong>HTTP/2 fingerprinting adds another layer:</strong> Beyond TLS, the order and priority of HTTP/2 frames, settings, and window updates create additional fingerprints. Your HTTP library&#8217;s frame ordering must match your claimed browser identity.</p></li></ul><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo </strong>with high reputatation IPs<strong>,</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3><strong>Layer 3: The Infrastructure Level</strong></h3><p>Your proxy configuration can expose your real infrastructure through network-level leaks via the following main mechanisms:</p><ul><li><p><strong>DNS leaks:</strong> They happen when your browser resolves domain names using your local DNS server instead of routing through the proxy. Your scraper might send requests through a Miami residential proxy, but if DNS queries go through your AWS datacenter in Virginia, the target site knows your real location.</p></li><li><p><strong>WebRTC leaks:</strong> <a href="https://webrtc.org/">WebRTC </a>is a browser API designed for peer-to-peer communication. Even with a proxy configured, WebRTC will attempt to discover your real local IP and public IP through STUN servers, completely bypassing your proxy.</p></li><li><p><strong>IP reputation:</strong> Not all IPs are created equal. Cloudflare and similar services maintain databases of every AWS, Google Cloud, and Azure IP range. Requests from known cloud providers receive instant higher suspicion scores before any other analysis happens.</p></li></ul><h3><strong>Layer 4: The Behavioral Level</strong></h3><p>Even if your browser, network, and infrastructure are perfectly disguised, your behavior patterns can still expose you:</p><ul><li><p><strong>Timing patterns:</strong> Requesting data at fixed and precise intervals creates a perfect periodicity. No human browses with mathematical precision.</p></li><li><p><strong>Mouse and scroll behavior:</strong> Real humans accelerate and decelerate smoothly. Instant jumps from point A to point B are mechanically impossible.</p></li><li><p><strong>Session state:</strong> Stateless scrapers that never accumulate cookies or maintain persistent sessions across days look like fresh bots on every run.</p></li><li><p><strong>Interaction sequences:</strong> The time between page load and first click, between mouse-over and click, or the pattern of how you scroll through content. They all follow detectable human patterns.</p></li></ul><h2><strong>Understanding the Detection: How Anti-Bot Systems Catch You</strong></h2><p>Now that you know where leaks occur, let&#8217;s understand how anti-bot systems actually detect them.</p><h3><strong>Fingerprint Consistency Checks</strong></h3><p>Anti-bot systems cross-reference your claimed identity with actual behavior. If your <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent">User-Agent</a> says &#8220;Chrome 120 on Windows 10,&#8221; they verify that your JavaScript features, WebGL capabilities, canvas rendering, and TLS handshake all match Chrome 120 on Windows 10.</p><p>A single mismatch anywhere flags the entire request. You can&#8217;t be Chrome in your User-Agent, Firefox in your TLS handshake, and headless Chrome in your canvas fingerprint. <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">Anti-bot systems create composite fingerprints combining dozens of properties</a>, then compare them against databases of known legitimate and bot patterns.</p><h3><strong>Machine Learning Pattern Recognition</strong></h3><p>Modern anti-bot systems use ML models trained on billions of requests. They learn what &#8220;normal&#8221; looks like for each type of visitor. This means that consumer browsers from residential IPs have different behavioral patterns than datacenter scrapers.</p><p>For ML models, statistical anomalies trigger investigation. Perfect timing intervals, impossible mouse movements, or timing patterns that don&#8217;t match human variance distributions are scored as anomalous. These models adapt continuously, so when new stealth techniques emerge, the models retrain on that data. This means that what works today might fail tomorrow.</p><h3><strong>Progressive Trust Scoring</strong></h3><p>Anti-bot systems block or allow requests, but they also score. This means that lower trust scores receive degraded service: slower response times, rate limits, or <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">CAPTCHA challen</a>ges before blocking.</p><p>Also, scores accumulate across sessions. If you leak information across multiple visits, the system builds a profile associating your various identities. In other words, one leak can poison future requests, and even fixing the leak might not restore trust if your IP or fingerprint is already marked.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2><strong>Building the Defense: A Layered Approach to Stealth</strong></h2><p>Building a defense from data leaks in web scraping requires addressing each layer systematically. Your stealth stack must work from the inside out: browser &#8594; network &#8594; infrastructure &#8594; behavior. Each layer must remain consistent with your claimed identity.</p><h3><strong>Defense Layer 1: Hardening the Browser</strong></h3><p>The goal at this layer is to make the browser fingerprint indistinguishable from a real user&#8217;s browser and ensure every property is consistent with your claimed identity.</p><p><strong>Step 1: Mask Automation Signals</strong></p><p>Start with stealth libraries that patch the most common detection vectors:</p><ul><li><p><strong>For Puppeteer:</strong> Use <em><a href="https://www.npmjs.com/package/puppeteer-extra-plugin-stealth">puppeteer-extra-plu</a>gin-stealth</em> to automatically override <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">navigator.webdriver</a></em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">,</a> DevTools Protocol signatures, and plugin arrays.</p></li><li><p><strong>For <a href="https://www.selenium.dev/">Selenium</a>:</strong> Use <em><a href="https://pypi.org/project/undetected-chromedriver/">undetected-chromedriver</a>,</em> which patches automation signals and uses real Chrome binaries instead of ChromeDriver.</p></li><li><p><strong>For Playwright:</strong> Leverage native evasion features that handle many detection vectors out of the box.</p></li></ul><p>Additionally, disable automation flags at launch. For example, in Playwright:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--no-sandbox'
        ]
    )</code></code></pre><p>But remember: Stealth libraries handle the most common 20-30 leak vectors but miss advanced fingerprinting techniques. They&#8217;re your foundation, not your complete solution.</p><p><strong>Step 2: Spoof Hardware Signatures</strong></p><p>Cloud server canvas and WebGL fingerprints are obvious red flags. AWS, GCP, and Azure rendering signatures are well-known to anti-bot systems.</p><p>You have two approaches for your defense here:</p><ul><li><p><strong>Add consistent noise:</strong> Inject deterministic noise into canvas operations so the fingerprint remains stable across sessions but doesn&#8217;t match your server&#8217;s real hardware. Override canvas methods to modify pixel data slightly before it&#8217;s read back. Keep noise minimal: just enough to mask the real hardware signature without appearing obviously manipulated.</p></li><li><p><strong>Emulate common consumer hardware:</strong> Spoof WebGL parameters to mimic common consumer GPUs. Override vendor and renderer strings returned by WebGL APIs to match your chosen hardware profile. Use existing libraries designed for canvas fingerprint defense or implement your own parameter overrides.</p></li></ul><p><strong>Step 3: Ensure Version Consistency</strong></p><p>This is where most scrapers fail, even with stealth libraries. Your User-Agent string must match your actual browser engine behavior precisely. Consider the following rules of thumb:</p><ul><li><p><strong>Use real browser binaries instead of spoofing:</strong> Tools like Playwright can launch actual Chrome, ensuring perfect consistency between claimed version and actual behavior.</p></li><li><p><strong>If you must spoof, maintain complete version profiles:</strong> Track which JavaScript features, WebGL capabilities, and API behaviors correspond to each browser version. Every property must align.</p></li><li><p><strong>Never mix components from different versions:</strong> If you claim Chrome 120 on Windows 10, every single API, from JavaScript features to WebGL renderers, must behave exactly like Chrome 120 on Windows 10.</p></li></ul><h3><strong>Defense Layer 2: Hardening the Network Stack</strong></h3><p>Your goal at this layer is to make your TLS handshake and HTTP traffic indistinguishable from the browser you&#8217;re claiming to be.</p><p><strong>Step 4: Match TLS Fingerprints to Your Browser Identity</strong></p><p>Standard HTTP libraries can&#8217;t mimic browser TLS fingerprints because they use different SSL/TLS implementations. The solution requires specialized libraries that replicate browser behavior at the protocol level:</p><ul><li><p><strong>For Python:</strong> Use <em><a href="https://curl-cffi.readthedocs.io/en/latest/">curl_cffi</a></em> or similar wrappers. These libraries use <em><a href="https://curl.se/libcurl/">libcurl</a></em> compiled with <em><a href="https://github.com/google/boringssl">BoringSSL</a></em>, which is the same SSL library Chrome uses. This creates identical JA3 fingerprints to real browsers.</p></li><li><p><strong>For Node.js:</strong> Use <em><a href="https://www.npmjs.com/package/cycletls">cycletls</a></em> or equivalent libraries that allow you to specify exact JA3 fingerprint strings matching real browsers.</p></li></ul><p><strong>Critical requirement:</strong> Your TLS fingerprint must match your User-Agent. Chrome 120&#8217;s JA3 fingerprint is different from Firefox 115&#8217;s fingerprint. The browser identity must be consistent across all layers.</p><p><strong>Step 5: Match HTTP/2 Fingerprints</strong></p><p>Beyond TLS, HTTP/2 frame ordering creates additional fingerprints. Libraries like <em>curl_cffi</em> handle this automatically when you specify a browser to impersonate, but verify that:</p><ul><li><p>Settings frames match your target browser.</p></li><li><p>Window update sequences align.</p></li><li><p>Priority headers follow the correct pattern.</p></li></ul><p>In Python, you can do so with the following code:</p><pre><code><code>response = requests.get(
    '&lt;https://tls.peet.ws/api/all&gt;',
    impersonate='chrome120'
)
print(response.json()['http2']['sent_frames'])
</code></code></pre><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3><strong>Defense Layer 3: Hardening Infrastructure</strong></h3><p>Your goal at this layer is to ensure your network traffic originates from legitimate-looking IPs and doesn&#8217;t leak your real location or identity.</p><p><strong>Step 6: Choose the Right Proxy Type</strong></p><p>IP reputation is the first filter that anti-bot systems check. This means that your<a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies"> proxy choice determines your baseline trust score</a>. Consider the following guidelines:</p><ul><li><p><strong>Datacenter IPs = instant red flag:</strong> Requests from AWS, Google Cloud, and Azure IP ranges receive instant higher suspicion scores. </p></li><li><p><strong>Residential proxies = highest legitimacy:</strong> These IPs come from real ISP connections, so they look legitimate because they are legitimate consumer connections.</p></li><li><p><strong>Mobile proxies = premium legitimacy</strong>: These IPs originate from cellular networks (4G/5G) and receive the highest trust scores. Mobile IPs rotate naturally as devices move between cell towers, making them appear even more organic than static residential connections.</p></li></ul><p><strong>Step 7: Prevent DNS Leaks</strong></p><p>Force all DNS resolution through your proxy tunnel. For SOCKS5 proxies, use the SOCKS5h protocol variant, which forces DNS resolution on the remote proxy server instead of locally.</p><p>For example, in Python, write the following:</p><pre><code><code>import requests

proxies = {
    'http': 'socks5h://proxy.example.com:1080',
    'https': 'socks5h://proxy.example.com:1080'
}

response = requests.get('&lt;https://example.com&gt;', proxies=proxies)
</code></code></pre><p>For browser automation, configure DNS-over-HTTPS to prevent local DNS leakage. The following is an example that applies to Playwright:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        args=[
            '--dns-over-https-server=https://cloudflare-dns.com/dns-query'
        ]
    )
</code></code></pre><p><strong>Step 8: Disable WebRTC Completely</strong></p><p>WebRTC will expose your real IP unless you completely disable it in browser automation. For example, in Playwright, you can do so as follows:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    
    # Remove WebRTC entirely
    await page.add_init_script("""
        delete window.RTCPeerConnection;
        delete window.RTCSessionDescription;
        delete window.RTCIceCandidate;
        delete navigator.mediaDevices;
    """)
</code></code></pre><p>When you&#236;ve done this, verify it&#8217;s actually disabled before deploying your scraper. Visit <a href="http://browserleaks.com/webrtc">browserleaks.com/webrtc</a> with your scraper. You should see &#8220;WebRTC is not supported by your browser&#8221;, or only your proxy IP should be visible. Never your real IP.</p><h3><strong>Defense Layer 4: Mimicking Human Behavior</strong></h3><p>Your goal at this layer is to make your interaction patterns indistinguishable from those of real human users.</p><p><strong>Step 9: Add Timing Jitter and Randomization</strong></p><p>Humans are inconsistent. Perfect patterns are robotic. The solution here is not to just add randomness. You also need to match the statistical distribution of real human behavior. To do so, consider the following example in Python:</p><pre><code><code>import numpy as np
import time

# Wrong example (do not use this)

# Fixed interval
time.sleep(5)  # Always 5 seconds - DETECTABLE

# Random uniform
time.sleep(random.uniform(3, 7))  # Still doesn't match human patterns

------------

# Correct example (use this!)

# Log-normal distribution (matches real human reaction times)
delay = np.random.lognormal(mean=1.5, sigma=0.5)
time.sleep(delay)
</code></code></pre><p>For improving randomization, model different action types with appropriate distributions. Use the following rules of thumb:</p><ul><li><p>Clicks: 0.3-2 seconds (short delays)</p></li><li><p>Reading: 5-45 seconds (high variance)</p></li><li><p>Scrolling: 1-8 seconds (irregular intervals)</p></li></ul><p><strong>Step 10: Implement Realistic Mouse and Scroll Behavior</strong></p><p>High-security sites like banking, ticketing, and heavily protected e-commerce websites track interaction patterns in real-time. To defend from leaking your information on such websites, you have to define mouse movements and scrolling for your automated scripts.</p><p>For mouse movements, you can:</p><ul><li><p>Use Bezier curves to create natural arcing movements between points.</p></li><li><p>Add slight randomness to destination coordinates.</p></li><li><p>Include hover delays before clicking.</p></li><li><p>Vary the number of intermediate steps based on distance.</p></li></ul><p>The following is an example you can try in Python:</p><pre><code><code>import numpy as np
from playwright.sync_api import sync_playwright

def bezier_curve(start, end, control_points, num_steps=20):
    """Generate points along a Bezier curve for natural mouse movement"""
    t = np.linspace(0, 1, num_steps)
    points = []
    
    # Simplified cubic Bezier
    for t_val in t:
        x = (1-t_val)**3 * start[0] + \\
            3*(1-t_val)**2*t_val * control_points[0][0] + \\
            3*(1-t_val)*t_val**2 * control_points[1][0] + \\
            t_val**3 * end[0]
        y = (1-t_val)**3 * start[1] + \\
            3*(1-t_val)**2*t_val * control_points[0][1] + \\
            3*(1-t_val)*t_val**2 * control_points[1][1] + \\
            t_val**3 * end[1]
        points.append((x, y))
    
    return points

async def human_like_click(page, selector):
    element = await page.query_selector(selector)
    box = await element.bounding_box()
    
    # Add slight randomness to destination
    target_x = box['x'] + box['width']/2 + np.random.normal(0, 2)
    target_y = box['y'] + box['height']/2 + np.random.normal(0, 2)
    
    # Move mouse along curve
    current_pos = await page.mouse.position()
    control_points = [
        (current_pos['x'] + np.random.uniform(-50, 50), 
         current_pos['y'] + np.random.uniform(-50, 50)),
        (target_x + np.random.uniform(-20, 20), 
         target_y + np.random.uniform(-20, 20))
    ]
    
    points = bezier_curve(
        (current_pos['x'], current_pos['y']), 
        (target_x, target_y), 
        control_points
    )
    
    for x, y in points:
        await page.mouse.move(x, y)
        await page.wait_for_timeout(np.random.uniform(5, 15))
    
    # Hover briefly before clicking
    await page.wait_for_timeout(np.random.uniform(100, 300))
    await page.mouse.click(target_x, target_y)
</code></code></pre><p>For scrolling, you can:</p><ul><li><p>Pause between scroll actions for variable amounts of time (simulating reading).</p></li><li><p>Scroll in chunks of varying size, not uniform pixels.</p></li><li><p>Occasionally scroll backwards (humans re-read).</p></li><li><p>Don&#8217;t scroll in perfect increments or at constant speeds.</p></li></ul><p>Use the following Python code to try such scrolling behaviour:</p><pre><code><code>async def human_like_scroll(page, total_distance):
    """Scroll with human-like patterns"""
    scrolled = 0
    
    while scrolled &lt; total_distance:
        # Vary chunk size
        chunk = np.random.randint(100, 400)
        
        await page.mouse.wheel(0, chunk)
        scrolled += chunk
        
        # Pause to simulate reading
        pause = np.random.lognormal(mean=1.2, sigma=0.8)
        await page.wait_for_timeout(pause * 1000)
        
        # Occasionally scroll backwards (humans re-read)
        if np.random.random() &lt; 0.15:
            await page.mouse.wheel(0, -np.random.randint(50, 150))
            await page.wait_for_timeout(np.random.uniform(500, 1500))
</code></code></pre><p><strong>Step 10: Maintain Persistent Session State</strong></p><p>Stateless scrapers look like stateless bots. Real browsers, instead, accumulate state over time because:</p><ul><li><p>Cookies persist across requests and sessions.</p></li><li><p>LocalStorage accumulates tracking data over time.</p></li><li><p>Session IDs remain stable across days or weeks.</p></li></ul><p>To mimic real browser states, you can use the following Python code:</p><pre><code><code>import pickle
import requests

# Save cookies to disk after each session
session = requests.Session()

# ... perform scraping ...

with open('cookies.pkl', 'wb') as f:
    pickle.dump(session.cookies, f)

# Before next scraping session
with open('cookies.pkl', 'rb') as f:
    session.cookies.update(pickle.load(f))
</code></code></pre><p>In case you use a browser automation tool:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    
    # Save browser storage state
    context = browser.new_context()
    # ... perform scraping ...
    context.storage_state(path='state.json')
    
    # Reload in next session
    context = browser.new_context(storage_state='state.json')
</code></code></pre><p>As a final note, consider keeping sessions alive for weeks to allow third-party tracking cookies to build up. Long-lived sessions with accumulated tracking data appear more legitimate than constantly refreshed clean states.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Conclusion</strong></h2><p>In this article, you learned that, if you don&#8217;t want your data to be leaked while scraping, you have to take several defensive measures, as no single technique makes you invisible. Anti-bot systems analyze multiple signals simultaneously, and any inconsistency across layers triggers detection and blocks your scrapers.</p><p>Also, detection methods evolve. So, what works today might fail tomorrow. This means you should also monitor the defenses you implemented and test new ones.</p><p>Now, let us know: How do you prevent data leaks in your scrapers? Did we miss some technique?</p>]]></content:encoded></item><item><title><![CDATA[rayobrowse: A Hands-On Look at the Stealth Browser From Rayobyte]]></title><description><![CDATA[Looking for a Camoufox alternative? Here&#8217;s an interesting stealth browser worth checking out!]]></description><link>https://substack.thewebscraping.club/p/rayobrowse-browser-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/rayobrowse-browser-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 05 Apr 2026 03:00:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/442d19ad-ddc9-4b14-afda-71c81a91ffc4_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The open&#8209;source nature of Camoufox is what made the project so popular and appealing. Unfortunately, that same openness is also what allowed anti&#8209;bot giants to study it closely and eventually crack down on it.</p><p>Rayobyte, the proxy and web scraping solutions provider, has taken a different approach. They recently released <em>rayobrowse</em>, a closed&#8209;source yet Docker&#8209;based, self&#8209;hostable stealth browser built for local browser automation and web scraping.</p><p>In this post, I&#8217;ll take a deep look at this solution and walk you through everything you need to know about it. By the end, you&#8217;ll understand what rayobrowse is, how its stealth browser approach works, how to set it up, and whether it&#8217;s actually worth paying attention to.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>An Introduction to rayobrowse</h2><p>Let me introduce you to the world of rayobrowse, helping you understand what it is and what makes this project special.</p><h3>What is rayobrowse?</h3><p><a href="https://github.com/rayobyte-data/rayobrowse">rayobrowse</a> is a self-hosted, Chromium-based stealth browser engineered for web scraping, AI agents, and automation workflows. It&#8217;s available as a Docker image, with optional support via a Python SDK (<em><a href="https://pypi.org/project/rayobrowse">rayobrowse</a></em> on PyPI) for simplified connection. The project is developed and maintained by Rayobyte.</p><p>The stealth browser runs inside Docker and is available via the <a href="https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi">Chrome DevTools Protocol (CDP)</a>. That means tools like Playwright, Puppeteer, and Selenium (or any other tool that speaks CDP) can natively connect to it for automation purposes.</p><p>What makes it noteworthy is its approach to <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">device fingerprinting</a>. User agents, screen size, WebGL, fonts, timezone, and other signals are tuned so each session looks like a real browser. That way, it helps your automation avoid detection on protected websites.</p><h3>Core Principles Driving the Solution</h3><p>These are the core principles and goals behind the project:</p><ol><li><p>It should run on Linux server environments without GPUs or a GUI/desktop interface.</p></li><li><p>It should patch Chromium at the C++ level, rather than at higher layers like CDP, which are easier for anti-bot systems to detect.</p></li><li><p>It should work with Playwright, a common framework in <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browsing automation stacks</a>.</p></li><li><p>It should support both headful mode (via <a href="https://www.x.org/archive/X11R7.7/doc/man/man1/Xvfb.1.xhtml">Xvfb</a>) and headless mode.</p></li><li><p>It should emulate fingerprints from real-world devices across different regions.</p></li><li><p>It should be self-hostable, so you can run it locally without relying on cloud infrastructure.</p></li><li><p>It should be free to test and use for certain user segments.</p></li><li><p>It should reliably bypass major anti-bot systems and scraping targets, including complex ecommerce and SERP platforms.</p></li></ol><p><strong>Note</strong>: If you&#8217;re not familiar with Xvfb, that&#8217;s an in&#8209;memory display server for Unix-like systems that implements the X11 display protocol without requiring a physical display or input devices. In simpler terms, it allows GUI applications to run in headless environments. rayobrowse relies on it to launch headful browser sessions even on servers without a graphical interface (that&#8217;s beneficial as headful sessions are harder to detect than purely headless ones).</p><h2>Main Features for Stealth Browsing and More</h2><p>Here is a list of the most relevant rayobrowse features:</p><ul><li><p><strong>Fingerprint spoofing</strong>:<strong> </strong>Each browser session comes with a real-world realistic device fingerprint drawn from a database of thousands of profiles. Signals include user agent, OS metadata, screen resolution, fonts, WebGL, hardware concurrency, and timezone.</p></li><li><p><strong>Human&#8209;like mouse movement</strong>: Optional human&#8209;style cursor behavior (inspired by <a href="https://github.com/riflosnake/HumanCursor">HumanCursor</a>) makes automation appear more natural. When using standard Playwright actions like <em>page.click()</em> or <em>page.mouse.move()</em>, the library applies realistic curves and timing.</p></li><li><p><strong>Proxy Integration</strong>: Traffic can be routed through any HTTP proxy, including authenticated and rotating proxies.</p></li><li><p><strong>Headless and headful Support</strong>: rayobrowse supports both execution modes, even on GUI-less Linux servers.</p></li><li><p><strong>Live session viewer</strong>:<strong> </strong>A built&#8209;in noVNC interface (available at http://localhost:6080) lets you watch browser sessions in real time directly from the browser. This is particularly useful for debugging scraping flows and visually verifying fingerprint behavior.</p></li><li><p><strong>Official integrations</strong>:<strong> </strong>The browser integrates with common automation frameworks, namely Playwright, Puppeteer, Selenium, and Scrapy (via <em><a href="https://substack.thewebscraping.club/p/basic-scrapy-configuration">scrapy-playwright</a></em>), as well as emerging <a href="https://substack.thewebscraping.club/p/my-first-week-with-openclaw">AI&#8209;driven tools such as OpenClaw</a>. As of this writing, additional integrations (e.g., Firecrawl and LangChain) are planned.</p></li><li><p><strong>Remote/Cloud mode</strong>: rayobrowse can run as a <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#remote--cloud-mode-beta">remote browser service</a>. Your server requests new browser instances through a REST API, and workers connect directly to the returned CDP WebSocket endpoint. This is still a beta feature.</p></li><li><p><strong>API&#8209;driven browser management</strong>:<strong> </strong>The daemon exposes REST endpoints for creating, listing, and deleting browser sessions, allowing you to orchestrate multiple browsers across a distributed scraping infrastructure.</p></li></ul><h2>Technical Details About the Project</h2><p>Now that you know what the project is and the features it provides, you&#8217;re ready to dive into the technical aspects.</p><h3>How rayobrowse Works</h3><p>At a high level, rayobrowse follows these steps:</p><ol><li><p><strong>Chromium patching</strong>:<strong> </strong>The project tracks upstream Chromium releases and applies a focused set of patches (relying on an <a href="https://github.com/brave/brave-core/blob/master/tools/cr/plaster.py">approach similar to Brave&#8217;s &#8220;plaster&#8221; model</a>). These patches normalize exposed browser APIs, reduce fingerprint entropy leaks, improve automation compatibility, and preserve native Chromium behavior whenever possible.</p></li><li><p><strong>Fingerprint assignment</strong>: When a browser session starts, rayobrowse assigns a realistic device fingerprint.</p></li><li><p><strong>Automation integration</strong>: Browser automation libraries connect to rayobrowse through the native CDP.</p></li></ol><h3>Architecture</h3><p>Architecturally, rayobrowse follows a clean separation between the browser runtime and the automation code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vdVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vdVO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 424w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 848w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png" width="1456" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;rayobrowse&#8217;s architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="rayobrowse&#8217;s architecture" title="rayobrowse&#8217;s architecture" srcset="https://substackcdn.com/image/fetch/$s_!vdVO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 424w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 848w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">rayobrowse&#8217;s architecture</figcaption></figure></div><p>In particular, the system runs as a Docker container that bundles three core components:</p><ol><li><p>A daemon server that manages browser sessions.</p></li><li><p>A browser manager that downloads and retrieves the correct version of Chromium, a fingerprint engine that injects realistic device profiles, and a stealth browser layer containing a custom Chromium build with stealth patches.</p></li><li><p>A <a href="https://github.com/novnc/noVNC">noVNC viewer</a>, which lets you watch browser sessions in real time. This is useful for debugging and demos.</p></li></ol><p>As you can see, the automation scripts don&#8217;t run inside the container. Instead, they run on the host machine and connect to the browser remotely through the Chrome DevTools Protocol.</p><p>When a new session starts, rayobrowse assigns a real-user-looking fingerprint from a large database of actual devices, containing thousands of permutations collected from websites Rayobyte owns.</p><h3>Requirements</h3><p>The rayobrowse project is designed to run on Linux servers without GPUs (which is a common deployment environment).</p><p>These are the required prerequisites:</p><ul><li><p>Docker, as the browser runs entirely inside a container.</p></li><li><p>~2GB of available RAM, as each browser instance uses ~300MB.</p></li></ul><p>The main benefit of this Docker-based approach is that you don&#8217;t need to install Chromium locally, configure fonts, or set up Xvfb manually. All of those dependencies live inside the container, which keeps the host machine clean, portable, and reproducible.</p><p>It also makes the project well-suited for self-hosted environments without exposing its internal Chromium patching logic, making it much harder for anti-bot solution providers to reverse engineer how it works.</p><p>In terms of compatibility, rayobrowse works on Linux, Windows (native or WSL2), and macOS. The supported architectures are <em>x86_64 (amd64)</em> and <em>ARM64</em> (Apple Silicon and AWS Graviton). Still, you don&#8217;t have to worry about the architecture, as Docker automatically pulls the correct image for the host machine.</p><p><strong>Optional</strong>: If you plan to use the stealth browser through the Python SDK, an additional requirement is Python 3.10+.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>How to Access rayobrowse</h2><p>There are two main ways you can access rayobrowse:</p><ol><li><p>The <em>/connect</em> endpoint.</p></li><li><p>The built-in Python SDK.</p></li></ol><h3>Method #1: Use the /connect Endpoint</h3><p>The first rayobrowse usage method involves connecting directly to the <em>/connect</em> endpoint. This allows any CDP&#8209;compatible tool (including Selenium, Playwright, and Puppeteer) to open a browser session simply by pointing to a WebSocket URL like <em>ws://localhost:9222/connect</em>.</p><p>For instance, take a look at the Playwright connection example below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Connect to rayobrowse via CDP
    browser = p.chromium.connect_over_cdp("ws://localhost:9222/connect")
    page = browser.new_context().new_page()

    # Automation logic...

    browser.close()</code></pre></div><p>Keep in mind that the WebSocket browser connection URL can be customized using query parameters, as follows:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">ws://localhost:9222/connect?headless=false&amp;os=android&amp;proxy=http://user:pass@host:port</code></pre></div><p>This URL creates a rayobrowse Chromium browser session in headful mode, using Android-based fingerprints, while routing all requests through the proxy <em><a href="http://user:pass@host:port">http://user:pass@host:port</a></em>.</p><p>Explore all <em>/connect</em> query parameters <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#using-connect-simplest">in the docs</a>.</p><h3>Method #2: Use the Python SDK</h3><p>You can also interact with rayobrowse through the built-in Python SDK. This exposes a <em><a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#api-reference">create_browser()</a></em> function that returns a CDP WebSocket URL for a newly created browser instance. From there, connect using Playwright or another automation framework, as shown below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from rayobrowse import create_browser
from playwright.sync_api import sync_playwright

# Configure the rayobrowse connection to run in headful mode 
# while simulating a Windows-based fingerprint
ws_url = create_browser(headless=False, target_os="windows")

with sync_playwright() as p:
    # Connect to rayobrowse with the configured URL via CDP
    browser = p.chromium.connect_over_cdp(ws_url)
    page = browser.contexts[0].pages[0]
 
    # Automation logic...

    browser.close()</code></pre></div><p>This approach gives you more control over the browser lifecycle, but it also involves more configuration and setup.</p><p>For more examples (e.g., proxy integration, multi-browser management, etc.), <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#using-the-python-sdk">check out the docs</a>.</p><h2>Get Started with rayobrowse: Step-by-Step Guide</h2><p>In this guided section, I&#8217;ll show you how to build a simple Playwright script that connects to rayobrowse.</p><p>For the sake of simplicity, I&#8217;ll assume you already have:</p><ul><li><p>A Unix-based system (Linux, macOS, or Windows via WSL).</p></li><li><p>Docker installed and running on your machine.</p></li><li><p>Git installed locally.</p></li><li><p>A Python environment set up <a href="https://substack.thewebscraping.club/p/scraping-vs-playwright-web-scraping">with Playwright installed</a>.</p></li></ul><p>Follow the instructions below!</p><h3>Step #1: Clone the rayobrowse Repository</h3><p>The first step is to clone the rayobrowse repository to your machine:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">git clone https://github.com/rayobyte-data/rayobrowse</code></pre></div><p>Then, enter the project folder with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">cd rayobrowse</code></pre></div><p>The cloned folder already includes everything you need to get started, including:</p><ul><li><p><em>docker-compose.yml</em>:<strong> </strong>For running the browser container.</p></li><li><p><em>requirements.txt</em>: For installing the Python SDK.</p></li></ul><h3>Step #2: Set Up the Environment</h3><p>rayobrowse requires a .env file that contains the configuration needed to run the browser daemon. For a full list of available environment variables and what they enable, <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#environment-variables">explore the official documentation</a>.</p><p>Start by creating a <em>.env</em> file as a copy of the <em>.env.example</em> file coming with the repository:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">cp .env.example .env</code></pre></div><p>Then open the <em>.env</em> file and make sure it contains:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">STEALTH_BROWSER_ACCEPT_TERMS=true</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zjWr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zjWr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Setting the STEALTH_BROWSER_ACCEPT_TERMS env&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Setting the STEALTH_BROWSER_ACCEPT_TERMS env" title="Setting the STEALTH_BROWSER_ACCEPT_TERMS env" srcset="https://substackcdn.com/image/fetch/$s_!zjWr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Setting the STEALTH_BROWSER_ACCEPT_TERMS env</figcaption></figure></div><p>This confirms that you accept the project&#8217;s <a href="https://github.com/rayobyte-data/rayobrowse/blob/main/LICENSE">LICENSE</a>. Without that setting, the daemon will refuse to create browser sessions.</p><h3>Step #3: Start the Docker Container</h3><p>Launch the rayobrowse Docker container:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">docker compose up -d</code></pre></div><p>Docker will automatically pull the appropriate image for your system architecture (<em>x86_64</em> or <em>ARM64</em>). Then, it&#8217;ll start the container, as explained earlier.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FB1x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FB1x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 424w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 848w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1272w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png" width="1456" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output of the &#8220;docker compose up -d&#8221; command&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output of the &#8220;docker compose up -d&#8221; command" title="The output of the &#8220;docker compose up -d&#8221; command" srcset="https://substackcdn.com/image/fetch/$s_!FB1x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 424w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 848w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1272w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output of the &#8220;docker compose up -d&#8221; command</figcaption></figure></div><h3>Step #4: Connect via CDP and Apply the Automation Logic</h3><p>You can now connect to the running rayobrowse instance through the <em>/connect</em> endpoint using any CDP-compatible client. In this example, I&#8217;ll use Playwright with Python:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Connect to the rayobrowse browser through the CDP WebSocket endpoint
    browser = p.chromium.connect_over_cdp(
        "ws://localhost:9222/connect?headless=false&amp;os=windows"
    )

    # Create a new browser context and page
    page = browser.new_context().new_page()

    # Navigate to the target (sample) page
    page.goto("https://quotes.toscrape.com/")

    # Print the page title to verify the session is working
    print(page.title()) # Output: "Quotes to Scrape"

    # Add your scraping logic here...

    # Close the browser session
    browser.close()</code></pre></div><p>At this point, write your scraping or automation logic, which will run inside the stealth Chromium browser provided by rayobrowse.</p><p>For debugging, you can watch the browser session live through noVNC at <em><a href="http://localhost:6080/vnc.html">http://localhost:6080/vnc.html</a></em>. While the script is running, you should see a headful Chromium session opening and navigating to the target page specified in the script:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v1V8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v1V8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Monitoring the target browser session at http://localhost:6080/vnc.html&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Monitoring the target browser session at http://localhost:6080/vnc.html" title="Monitoring the target browser session at http://localhost:6080/vnc.html" srcset="https://substackcdn.com/image/fetch/$s_!v1V8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Monitoring the target browser session at http://localhost:6080/vnc.html</figcaption></figure></div><p>As you can tell, the server creates a headful Chromium session (due to the <em>headless=false</em> query parameter) and connects it to the page requested by the script.</p><p><strong>Optional</strong>: If you want more control over the browser lifecycle, install the Python SDK with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">pip install -r requirements.txt</code></pre></div><p>Take a look at the <a href="https://github.com/rayobyte-data/rayobrowse/tree/main/examples">official examples in the repository</a> for more guidance.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Pricing and Limitations</h3><p>This is how the rayobrowse pricing model works:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bDvq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bDvq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 424w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 848w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202193,&quot;alt&quot;:&quot;rayobrowse&#8217;s pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/190103610?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="rayobrowse&#8217;s pricing model" title="rayobrowse&#8217;s pricing model" srcset="https://substackcdn.com/image/fetch/$s_!bDvq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 424w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 848w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">rayobrowse&#8217;s pricing model</figcaption></figure></div><p>What matters most for us, developers, is that you can run rayobrowse for free via self&#8209;hosting. In practice, the only real cost comes from proxies, which are necessary for scaling scraping workloads and avoiding IP bans (something that&#8217;s standard in most production scraping setups).</p><p>The main thing to keep in mind is that rayobrowse is still in beta. Rayobyte already uses it to scrape millions of pages per day, but results can vary depending on the target site and configuration.</p><p>Fingerprint coverage is currently strongest for Windows and Android, while macOS and Linux profiles are less mature. In addition, Canvas and WebGL fingerprinting are still evolving, which means some websites may detect the current implementation.</p><h2>Benchmarks and Final Comment</h2><p>To put rayobrowse to the test, I ran a simple script against a single page for each of the most popular anti&#8209;bot detection systems. These are the results I obtained:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lZAd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lZAd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 424w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 848w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1272w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png" width="1456" height="369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80344,&quot;alt&quot;:&quot;Playright vs rayobrowse: Benchmark comparison table&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/190103610?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Playright vs rayobrowse: Benchmark comparison table" title="Playright vs rayobrowse: Benchmark comparison table" srcset="https://substackcdn.com/image/fetch/$s_!lZAd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 424w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 848w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1272w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Playwright vs rayobrowse: Benchmark comparison table</figcaption></figure></div><p><strong>Note:</strong> These tests were performed on my local machine using my ISP&#8217;s IP address.</p><p>As you can see, in this simple experiment rayobrowse achieved a 100% success rate, while Playwright failed consistently in headless mode and even struggled in some headful scenarios.</p><p>This suggests that the project is definitely worth keeping an eye on, especially thanks to its self&#8209;hosted nature.</p><p><em>To be honest, and this is just my personal opinion as an expert who works in this field, I don&#8217;t usually get very excited about projects like this&#8230;. In my experience, many libraries of this type either get cracked down on or simply don&#8217;t receive the long&#8209;term support they deserve. In this case, however, things are a bit different. The project is closed&#8209;source and backed by a well&#8209;known company in the industry, which makes the expectations for its future understandably much higher!</em></p><p>Here, I covered what the project is about, what it offers, how it works, and how to use it. As always, remember to use rayobrowse only for legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical web scraping</a>. Until next time!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>FAQ</h2><h3>Why is rayobrowse based on Chromium and not Chrome?</h3><p>rayobrowse is based on Chromium simply because Chrome is closed-source. Plus, tests performed on difficult websites show no meaningful difference in detection rates between Chrome and Chromium. Using Chromium also avoids false positives and reflects the broader ecosystem of Chromium-based browsers like Brave, Edge, and Samsung Internet.</p><h3>Is rayobrowse open source?</h3><p>rayobrowse isn&#8217;t open-source to prevent anti-bot companies from reverse-engineering it. Similar projects, like <a href="https://github.com/daijro/camoufox">Camoufox</a>, were quickly studied and countered once their code became public. Rayobyte decided to keep the project closed-source to help maintain its effectiveness and reliability over the long term.</p><h3>Can everyone use rayobrowse?</h3><p>No, not all companies can use rayobrowse. Its license prohibits organizations listed in <a href="https://cdn.sb.rayobyte.com/list-of-prohibited-companies.txt">Rayobyte&#8217;s restricted list</a> from using the software. For everyone else, the project is free to download and run locally.</p><h3>Does rayobrowse support proxy integration?</h3><p>Yes, Rayobrowse fully supports proxy integration. You can route traffic through any HTTP proxy using the <em>proxy </em>query parameter on the <em>/connect</em> endpoint or via the <em>proxy </em>option exposed by the <em>create_browser() </em>function from the Python SDK. The proxy support includes authentication and rotating proxies.</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #101: Building an Internal Knowledge Base for Your Scraping Team]]></title><description><![CDATA[Every scraping team that survives long enough develops the same disease.]]></description><link>https://substack.thewebscraping.club/p/building-knowledge-base-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/building-knowledge-base-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 02 Apr 2026 19:17:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3dba6c6a-f027-4c60-ad27-2c2378c217c6_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every scraping team that survives long enough develops the same disease. Someone figures out how to bypass Cloudflare&#8217;s latest challenge, writes it up in Notion, and moves on. Three months later, a teammate runs into the same problem, spends two days reinventing the solution, and documents it in a Google Doc. Meanwhile, the original Notion page has become outdated because Cloudflare changed its challenge flow, and nobody updated it.</p><p>We have seen this pattern in every scraping operation we have worked with. The knowledge exists. It is just scattered across wikis, Slack threads, internal repos, and people&#8217;s heads. The real problem is not documentation; it is retrieval. People write things down. They just cannot find them when it matters.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>In <a href="https://substack.thewebscraping.club/p/ingest-web-data-rag-llm">THE LAB #77</a>, we explored the concept of RAG (Retrieval-Augmented Generation) applied to scraped data and showed how to build a basic knowledge assistant using FAISS. That was a proof of concept. This time we are going deeper. We are showing the production system we actually built and use daily, and we are explaining the reasoning behind each design choice: why markdown, how embeddings work, which chunking strategy actually performs better, and what role auto-tagging plays in retrieval.</p><p>After reading this article, we hope you will understand the mechanics well enough to build the same system for your team.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>What we are building and why</h2><p>At TWSC, we have published around 300 articles over the past four years. Tutorials, reverse-engineering deep dives, tool comparisons, anti-bot analysis. When we sit down to write a new article, we need to remember what we have already covered, find previous work to link to, and check whether a technique we are about to describe was already explained in a past issue. Doing this by memory or by searching Substack&#8217;s archive stops working after the first hundred articles. </p><p>We also follow what the broader community publishes. Projects like <a href="https://crawl4ai.dev">Crawl4AI</a>, which appeared on Hacker News, show that the need to ingest web content into structured, LLM-ready knowledge bases is shared across the industry. The tools for crawling and extracting content keep getting better, but the retrieval side, finding the right piece of information in a growing archive, still requires a purpose-built system.</p><p>So we built one. Here is what the complete pipeline looks like:<br></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;631f6d4b-586d-4ef5-ba12-640a3cb186b0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Sources                                  Processing              Storage &amp; Retrieval
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;                                &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;              &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
Substack articles                   &#9472;&#9472;&#9488;
                                      &#9500;&#9472;&#9472;&gt; HTML-to-Markdown &#9472;&#9472;&gt; Frontmatter + Tagging &#9472;&#9472;&gt; Markdown files
Hacker News and other sources       &#9472;&#9472;&#9496;

Markdown files &#9472;&#9472;&gt; Chunker &#9472;&#9472;&gt; Embedder (e5-large-v2) &#9472;&#9472;&gt; PostgreSQL + pgvector

Search query &#9472;&#9472;&gt; Query embedding &#9472;&#9472;&gt; Cosine similarity search &#9472;&#9472;&gt; Ranked results</code></pre></div><p>Three stages, each independent and replaceable. You scrape content from your sources. You process and embed it. You search it. </p><p>If your team writes in Confluence instead of Substack, you swap the scraper. If you prefer Qdrant over pgvector, you swap the vector store. The architecture remains the same.<br><br>And here&#8217;s the hardware used for most of the steps, from embedding to the storage and retrieval: my DGX Spark.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yhsw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg" width="566" height="511.689557855127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:961,&quot;width&quot;:1063,&quot;resizeWidth&quot;:566,&quot;bytes&quot;:181504,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192358785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40edcbd4-e6c6-4172-bf4c-ee62da325b0f_1280x1707.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yes, I know, probably an overkill.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>The tools</h2><p><strong>Playwright</strong> handles browser-based scraping for our own Substack articles. Substack serves content dynamically and requires authentication for premium posts, so a plain HTTP client is not an option.</p><p><strong>Algolia API</strong> (via Hacker News) provides structured search over HN stories. No scraping needed: HN exposes its full search index through public endpoints.</p><p><strong><a href="https://scrapegraphai.com/">ScrapegraphAI</a> and <a href="https://www.firecrawl.dev/">Firecrawl</a></strong> convert external article URLs into clean markdown. ScrapegraphAI is the primary extractor, Firecrawl is the fallback.</p><p><strong>sentence-transformers</strong> with the <code>intfloat/e5-large-v2</code> model generates 1024-dimensional embeddings. We will explain why we chose this model later in the article.</p><p><strong>PostgreSQL with pgvector</strong> stores embeddings and handles similarity search. We chose it over dedicated vector databases because we already need PostgreSQL for metadata, and pgvector with HNSW indexing handles our scale without adding infrastructure.</p><p><strong>Docker Compose</strong> ties everything together as three containers: PostgreSQL, the API server, and the indexer.</p><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">101.KNOWLEDGE_BASE</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Why markdown as the universal format</h2><p>The first design choice we had to make was what format our knowledge base would store. We had content from Substack (HTML), Hacker News links (various formats), and potentially Confluence, Google Docs, or Slack in the future. We needed a common representation.</p><p>We chose markdown for three reasons.</p><p><strong>First</strong>, markdown preserves document structure without carrying rendering noise. An HTML page contains navigation bars, ad slots, JavaScript, CSS classes, and layout dividers. None of that is content. When you convert to markdown, you keep headings, paragraphs, code blocks, links, and lists. Everything the embedding model needs, nothing it would choke on.</p><p><strong>Second</strong>, markdown is readable by humans and machines alike. When something goes wrong in the pipeline, you can open a markdown file and immediately see what the system is working with. Try doing that with a serialized HTML DOM or a JSON blob from an API response.</p><p><strong>Third</strong>, YAML frontmatter is a natural fit for markdown and gives us a structured metadata header without mixing it into the content. Each file gets an `id`, `type`, `title`, `publish_date`, `topics`, and `visibility` field. This metadata drives filtering at search time and never enters the embedding model. The separation is important: embeddings capture meaning, frontmatter captures facts.</p><p>There are two paths to get content into markdown. You can build your own converter using open-source libraries, or you can use commercial services that handle extraction and conversion for you. In this article we show both approaches deliberately. For our own Substack articles, we built a converter from scratch with BeautifulSoup and markdownify. It costs nothing, we control every detail, and it works because we know the source HTML structure intimately. For external content discovered on Hacker News, we use commercial services like ScrapegraphAI and Firecrawl instead, because every URL leads to a different site with a different HTML structure. Building custom converters for thousands of unknown domains would be impractical. The trade-off is clear: when you control the source, build your own; when you are scraping the open web, commercial extraction services save an enormous amount of development time.</p><p>Our Substack HTML-to-markdown converter is deliberately simple. It strips scripts, styles, buttons, navigation, and footers, then converts the remaining HTML:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;aa028f4f-e1d2-412f-88bc-29153974e70e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def html_to_markdown(html: str) -&gt; str:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.find_all(["script", "style", "button", "form", "nav", "footer"]):
        tag.decompose()

    md = markdownify(
        str(soup),
        heading_style="ATX",
        bullets="-",
        strip=["script", "style", "button", "form", "nav"],
    )
    md = re.sub(r"\n{4,}", "\n\n\n", md)
    return md.strip()</code></pre></div><p>The final output for each document looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;95188632-dd66-4b1e-a5fe-167c1807dcdc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">---
id: a1b2c3d4e5f6...
type: twsc_article
title: "THE LAB #94: Using Cookies and Session Persistence"
slug: the-lab-94-using-cookies-and-session
canonical_url: https://substack.thewebscraping.club/p/the-lab-94-using-cookies-and-session
publish_date: 2025-11-15
visibility: premium
topics:
  - browser-automation
  - cloudflare
  - scraping-infra
---

[article body in markdown]</code></pre></div><h2>Scraping your own content</h2><p>The first source we built was a scraper for our own Substack articles. The pattern applies to any CMS: discover URLs, authenticate if needed, extract content, convert to markdown with frontmatter.</p><h3>URL discovery and authentication</h3><p>Most publishing platforms expose a sitemap. We fetch it, filter for article URLs (Substack uses <code>/p/</code> in the path), and track the <code>lastmod</code> date to detect changes:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;cf6dd5c0-8f99-4466-bc88-5bfe8f8b109a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def fetch_sitemap(sitemap_url: str) -&gt; list[dict]:
    req = Request(sitemap_url)
    req.add_header("User-Agent", "Mozilla/5.0 ...")
    with urlopen(req) as response:
        content = response.read()

    root = ET.fromstring(content)
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    articles = []
    for url_elem in root.findall("sm:url", ns):
        loc = url_elem.find("sm:loc", ns)
        lastmod = url_elem.find("sm:lastmod", ns)
        if loc is not None and "/p/" in loc.text:
            articles.append({"url": loc.text.strip(), "lastmod": lastmod.text or ""})
    return articles</code></pre></div><p>Substack gates premium content behind authentication. We handle this with a persistent Playwright browser context that stores cookies across runs. On the first run you log in manually; after that, the saved session keeps you authenticated. For cron jobs, we verify the session by loading a known premium article and checking if the full content appears.</p><p>We try multiple CSS selectors for extraction because Substack has changed its DOM structure over time. The extracted HTML goes through the markdown converter we showed earlier.</p><h2>Ingesting external sources: Hacker News</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/building-knowledge-base-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Data Scraping for Market Research: A Developers Guide]]></title><description><![CDATA[Build scrapers that deliver real market intelligence, not just raw data dumps]]></description><link>https://substack.thewebscraping.club/p/data-scraping-market-research</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/data-scraping-market-research</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 29 Mar 2026 20:38:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e95388da-deb3-4a33-9e90-438b2658fddd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Market research has always been about answering a simple question: &#8220;<em>What&#8217;s happening in the market, and how do I use that to make better decisions?&#8221;</em></p><p>The traditional way to answer that question involved surveys, focus groups, and expensive reports from firms that charge you a fortune for data that&#8217;s already a few months old by the time you read it. Today, the data you need is sitting on public web pages: You just need to collect it.</p><p>In this article, we&#8217;ll discuss how to scrape data for market research, what sources actually matter, how to build a pipeline that doesn&#8217;t fall apart after a week, and where the legal lines are.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What &#8220;Market Research&#8221; Actually Means Web Scraping Professionals</h2><p>Market research needs to answer three questions:</p><ul><li><p>&#8220;<em>What are our competitors doing?</em>&#8221;</p></li><li><p>&#8220;<em>What are our customers saying?</em>&#8221;</p></li><li><p>&#8220;<em>How is the market moving?</em>&#8221;</p></li></ul><p>That&#8217;s it. Everything else is a variation of those three. And if you think about it, the web gives you access to all three, if you know where to look.</p><p>In practice, scraped market intelligence sits on three pillars:</p><ul><li><p><strong>Competitive data</strong>: Pricing, product catalogs, feature changes, hiring signals. This is the &#8220;what are they doing?&#8221; pillar.</p></li><li><p><strong>Customer sentiment</strong>: Reviews, forum discussions, social media posts. This is the &#8220;what are people saying?&#8221; pillar.</p></li><li><p><strong>Market signals</strong>: Job postings, regulatory filings, trend volumes, new product launches. This is the &#8220;where is the market going?&#8221; pillar.</p></li></ul><p>Now, why scraping instead of traditional research? Because scraping is real-time, it&#8217;s continuous, and it doesn&#8217;t depend on people filling out forms. A survey tells you what 500 people said last month. A scraper tells you what thousands of customers are saying right now, every single day, without anyone having to opt in.</p><p>That&#8217;s the competitive advantage. And it&#8217;s a big one.</p><div><hr></div><blockquote><p><em>For your scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>Where to Scrape: Sources That Actually Matter</h2><p>Not all sources are worth your time. You could scrape the entire Internet and still end up with nothing useful if you&#8217;re not targeting the right places. Below is a list of high-value targets for market research and what you can extract from each:</p><ul><li><p><strong>Competitor websites</strong>: Pricing pages, product pages, feature matrices, changelog, and blog posts. This is your primary source for understanding what competitors are offering and how they position themselves. Pricing pages, in particular, are gold. They change more often than you&#8217;d think, and tracking those changes over time tells you a lot about a competitor&#8217;s strategy.</p></li><li><p><strong>Review platforms</strong> <strong>(G2, Trustpilot, Amazon, Yelp)</strong>: Customer pain points, feature requests, sentiment shifts. Reviews are unfiltered customer feedback. Nobody writes a G2 review because they were asked nicely in a survey. They write it because they feel strongly about something&#8212;and that&#8217;s exactly the kind of signal you want.</p></li><li><p><strong>Job boards</strong> <strong>(LinkedIn, Indeed)</strong>: Hiring patterns reveal where a company is investing. If a competitor suddenly posts 20 machine learning engineer roles, that tells you something no press release will. Job postings are one of the most underrated market research signals out there.</p></li><li><p><strong>Social media and forums (Reddit, X, niche communities)</strong>: Unfiltered opinions, emerging trends, early complaints about products. Reddit threads and niche forums are where people say what they actually think, not what they&#8217;d say in a focus group.</p></li><li><p><strong>Government and public data portals</strong>: SEC filings, patent databases, import/export records. These are slower-moving signals, but they&#8217;re authoritative. A patent filing can tell you what a competitor is building 18 months before it ships.</p></li></ul><p>Here&#8217;s the key question to ask yourself before adding a source to your scraper: <em>&#8220;Does this data answer a specific research question, or am I just hoarding?&#8221;</em>. If you can&#8217;t tie a source to a concrete insight, skip it. You&#8217;ll save yourself storage costs, maintenance headaches, and potential legal issues.</p><h2>Building the Pipeline: From Raw HTML to Market Intelligence</h2><p>A market research scraper is not a one-off script you run from your terminal. It&#8217;s a pipeline. And pipelines need structure. If you treat it like a quick script, you&#8217;ll end up with a mess of cron jobs, inconsistent data formats, and no idea whether your data is fresh or stale. So, build it properly from the start.</p><p>A scraping for market intelligence pipeline should have four stages:</p><ol><li><p><strong>Collection</strong>: Fetch the pages, extract the fields you need, throw the rest away. Don&#8217;t store raw HTML &#8220;just in case&#8221; (you&#8217;ll learn why in the legal section of this article).</p></li><li><p><strong>Storage</strong>: Store facts and metadata (source URL, timestamp, extracted fields). Use a structure that makes deduplication and versioning easy. In practice, this means designing your schema around a composite key (for example: <em>source </em>+ <em>entity ID</em> + <em>scraped timestamp</em>) so you can track how a data point changes over time without overwriting previous records.</p></li><li><p><strong>Transformation</strong>: Normalize the data across sources, deduplicate records, and enrich with additional context (geocoding, industry classification, entity linking).</p></li><li><p><strong>Analysis</strong>: Turn rows into insights. This is where the actual market research happens. And to be clear: &#8220;Analysis&#8221; doesn&#8217;t mean opening a CSV and scrolling through it. The goal is to turn your pipeline&#8217;s output into dashboards, scheduled reports, or Slack alerts that reach the people who make decisions. If the data sits in a database and nobody looks at it, the whole pipeline is wasted effort.</p></li></ol><h3>Scheduling Matters More Than You Think</h3><p>Different data types have different freshness requirements. Getting this wrong means either wasting resources or working with stale data. The main ideas to consider when engineering the triggering times are the following:</p><ul><li><p><strong>Price tracking</strong>: Daily or hourly, depending on the market. Consider that e-commerce prices can change multiple times a day. SaaS pricing pages, instead, change less often. But when they do, it&#8217;s significant.</p></li><li><p><strong>Review monitoring</strong>: Monitoring reviews daily is usually enough. Reviews don&#8217;t appear in real-time, and sentiment trends are measured in weeks, not minutes.</p></li><li><p><strong>Job postings</strong>: A weekly schedule works for trend analysis of the job market. Remember that you&#8217;re looking for patterns, not individual listings.</p></li><li><p><strong>Social media</strong>: This depends on your use case. If you&#8217;re tracking a product launch or a PR crisis, you might need near-real-time. For general trend analysis, daily or even weekly batches work fine.</p></li></ul><h3>Tools That Work Well for Market Research Scraping</h3><p>You don&#8217;t need to reinvent the wheel. The software industry already provides you with the best tools for your market research scraping pipeline. Here&#8217;s a solid stack for a market research pipeline:</p><ul><li><p><strong><a href="https://www.scrapy.org/">Scrapy</a></strong> for structured crawling. <a href="https://substack.thewebscraping.club/p/scrapy-ten-years-of-scraping-framework">Scrapy&#8217;s architecture is designed for exactly this kind of work</a>: You define spiders per source, plug in middleware for proxy rotation and retry logic, and use item pipelines to clean and store data as it flows through. For market research specifically, Scrapy&#8217;s built-in feed exports let you dump results straight to JSON, CSV, or even S3 without writing custom I/O code. And if you need to coordinate multiple spiders (say, one per competitor), Scrapy&#8217;s project structure keeps things organized as your source list grows.</p></li><li><p><strong><a href="https://playwright.dev/">Playwright</a></strong> or <strong><a href="https://pptr.dev/">Puppeteer</a></strong> for JS-heavy pages. The key difference from Scrapy is that <a href="https://substack.thewebscraping.club/p/handling-infinite-scrolling-python-js">you&#8217;re running a real browser, which means you can handle dynamic content, infinite scroll</a>, and client-side rendering. The trade-off is resource cost: Each browser instance eats memory and CPU, so you don&#8217;t want to use this for targets that serve static HTML.</p></li><li><p><strong>A</strong> <strong>task queue</strong> for scheduling and orchestration. This is what turns a collection of scrapers into an actual pipeline. Instead of running scripts manually or relying on cron jobs, a task queue lets you schedule scrapes per source at different intervals, retry failed jobs automatically, and <a href="https://substack.thewebscraping.club/p/python-async-for-faster-scraping">control concurrency so you&#8217;re not overwhelming a target site with parallel requests.</a> It also gives you visibility: you can see what&#8217;s queued, what&#8217;s running, what failed, and why.</p></li><li><p><strong><a href="https://www.postgresql.org/">PostgreSQL</a></strong> for structured market data that needs querying and versioning. Relational databases shine here because market research data is inherently relational: competitors have products, products have prices, prices change over time.</p></li></ul><p>The point is this: Pick tools that let you build a maintainable system, not just a working script. Every tool in this stack solves a specific problem, and none of them requires you to build infrastructure from scratch. The best market research pipeline is the one that&#8217;s boring to operate, because boring means reliable.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Scaling Without Getting Blocked</h2><p>If you&#8217;re scraping one competitor once a week, you don&#8217;t need this section. If you&#8217;re tracking 50 competitors daily across thousands of pages, you do.</p><p>Here&#8217;s the reality: The moment you start scraping at scale, you become visible. But sites don&#8217;t like bots, even polite ones. So you need to be smart about how you scale. Consider the following rules of thumb to avoid getting blocked:</p><ul><li><p><strong>Proxy rotation</strong>: <a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies">Residential proxies for sensitive targets (sites with aggressive anti-bot systems), datacenter proxies for everything else</a>. Rotate per request or per session, depending on the site&#8217;s detection mechanisms. The key is to not send thousands of requests from the same IP in an hour.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/rate-limit-scraping-exponential-backoff">Rate limiting and backoff</a></strong>: Be a good citizen. If you hammer a site with concurrent requests, you&#8217;ll get blocked, and you&#8217;ll deserve it. Implement exponential backoff on failures, and set reasonable delays between requests. A 2-3 second delay between requests is a good starting point for most sites.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">Fingerprint management</a></strong>: Headers, TLS fingerprint, and browser-level signals matter on sites with serious anti-bot systems. Make sure your request headers look consistent and realistic.</p></li><li><p><strong>CAPTCHAs</strong>: <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">If you&#8217;re hitting CAPTCHAs regularly, your approach is too aggressive</a>. Fix the root cause (rate, fingerprint, proxy quality) before reaching for solver services. CAPTCHA solvers are a band-aid, not a solution.</p></li></ul><p>The general principle is simple: Scrape at a pace that doesn&#8217;t degrade the target site&#8217;s performance.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Turning Scraped Data into Actual Market Insights</h2><p>Let&#8217;s be clear about something: Raw scraped data is not market research. It&#8217;s just data. A CSV with 50&#8217;000 rows of competitor prices is not an insight. A chart showing that competitor X has dropped their enterprise tier price by 15% over three months: That&#8217;s an insight.</p><p>Here&#8217;s where the value gets created:</p><ul><li><p><strong>Price tracking and competitive benchmarking</strong>: Track changes over time, visualize trends, and set alerts for significant moves. The goal is not to know what a competitor charges today. It&#8217;s to understand their pricing trajectory. Are they moving upmarket? Are they running more frequent discounts? Are they simplifying their tier structure? This is where predictive <a href="https://substack.thewebscraping.club/p/predictive-analytics-web-scraped-data">analytics meets scraped data with the goal of predicting future moves</a> from your competitors.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/sentiment-analysis-amazon-reviews">Sentiment analysis on reviews</a></strong><a href="https://substack.thewebscraping.club/p/sentiment-analysis-amazon-reviews">: Use NLP to extract themes from customer reviews</a>. This is powerful for product teams who want to understand what customers love and hate about competitors. But remember: You&#8217;re analyzing the data internally, not republishing the reviews.</p></li><li><p><strong>Hiring signal analysis</strong>: Aggregate job postings by role type, department, and location. A competitor suddenly posting 15 ML engineer roles tells you they&#8217;re investing in AI. A wave of sales hiring in EMEA tells you they&#8217;re expanding geographically. This is a signal that&#8217;s almost impossible to get from any other source.</p></li><li><p><strong>Trend detection</strong>: Time-series analysis on product launches, feature changes, pricing moves, or social media mentions. <a href="https://substack.thewebscraping.club/p/scraping-data-anomaly-detection">The goal is to spot patterns or anomalies</a> before they become obvious. If three competitors all add the same feature within two months, that&#8217;s a market trend, not a coincidence.</p></li></ul><p>Overall, the <a href="https://substack.thewebscraping.club/p/building-a-scraper-dashboard-streamlit">output of your scraping pipeline should be dashboards</a>, reports, or automated alerts, not a database dump that someone has to manually dig through. If the insights don&#8217;t reach decision-makers in a usable format, the whole pipeline is wasted effort.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Legal and Ethical Considerations: Don&#8217;t Skip This Section</h2><p>I know, I know. You&#8217;re a developer, not a lawyer. But here&#8217;s a thing I&#8217;m sure you know: Most legal problems in scraping are self-inflicted. They happen because someone scraped &#8220;everything on the page,&#8221; stored it &#8220;for later,&#8221; and only then asked: <em>&#8220;Wait, can we actually use this?&#8221;</em></p><p>As discussed in detail in &#8220;<a href="https://substack.thewebscraping.club/p/avoid-copyright-violations-scraping">How to Avoid Copyright Violations While Scraping</a>&#8221;, let&#8217;s go through the key legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical principles of web scraping</a> shortly:</p><ul><li><p><strong>Scrape facts, not expression</strong>: Copyright protects expression, not facts. Prices, SKUs, dates, availability, and job titles are facts. No one owns the fact that a SaaS product costs $49/month. On the other hand, product descriptions, review text, and blog posts are creative expressions.</p></li><li><p><strong>Don&#8217;t store raw pages by default</strong>: Storing the HTML of entire pages means creating copies of copyrighted content. Instead, parse in-memory, extract only the fields you need, and discard the rest. If you need to debug, store a small sample with short retention.</p></li><li><p><strong>Respect </strong><em><strong>robots.txt</strong></em>: <a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">The </a><em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">robots.txt</a></em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications"> file is not the law, but ignoring it is evidence of bad faith if things go sideways</a>. In disputes, it can be used to show that you knew you were unwelcome and kept going anyway.</p></li><li><p><strong>Terms of Service matter</strong>: If the ToS explicitly forbids scraping and you scrape anyway, you may have a breach-of-contract problem. This is often easier for the site owner to prove than copyright infringement, because the argument is straightforward: you agreed to a contract, then you violated it.</p></li><li><p><strong>Don&#8217;t scrape behind a login</strong>: Once you log in, you&#8217;ve affirmatively agreed to a contract. Breaking that contract to scrape is a fast track to legal trouble. If your plan requires authenticated access, treat it as a licensing problem, not an engineering challenge.</p></li><li><p><strong>GDPR/CCPA</strong>: If you&#8217;re scraping anything that could be personal data (usernames, reviewer names, profile information), you need to know which privacy laws apply. This is especially relevant for review scraping and social media monitoring.</p></li></ul><p>Here&#8217;s the mental model that works: A price comparison tool that shows prices and links back to the source? Generally safe. A product catalog that copies descriptions, images, and reviews so users never need to visit the original site? That&#8217;s where you get into trouble, even if you don&#8217;t publicly display the results because you use them for internal analysis.</p><h2>Keeping Your Scrapers Alive: Monitoring and Maintenance</h2><p>Scrapers in production break for several reasons. Sites change layouts, add anti-bot measures, restructure their URLs, or just go down for maintenance. If you don&#8217;t monitor your scrapers, your data goes stale silently, and you won&#8217;t know until someone asks why the pricing dashboard hasn&#8217;t updated in three weeks.</p><p>Here&#8217;s a breakdown of what you need:</p><ul><li><p><strong>Dead selector detection</strong>: Alert when a CSS selector or XPath returns empty across multiple consecutive runs. A selector that worked yesterday and returns nothing today means the site changed its HTML structure. The keyword here is &#8220;multiple consecutive runs&#8221;. A single empty result could be a transient issue, so consider not triggering alerts on the first failure. Instead, set a threshold, like three consecutive empty results, before flagging it. When it does fire, you need to inspect the current page structure and update your selectors. Alternatively, try to <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">go beyond the DOM using AI and LLMs</a>, to make your extraction more resilient to layout changes in the first place.</p></li><li><p><strong>HTTP status monitoring</strong>: A spike in 403s means you&#8217;re getting blocked. A spike in 429s means you&#8217;re hitting rate limits. A spike in 404s means URLs have changed. Each of these requires a different response. For 403s, check your proxy pool and rotation logic: You might need fresher IPs or a lower request rate. For 429s, back off and increase your delays between requests; the site is telling you exactly what the problem is. For 404s, the target has likely restructured its URL patterns, which means you need to update your URL generation logic, not just retry the same broken links. Log these status codes per source and per run so you can spot trends early. A gradual increase in 403s over a week is a warning sign that your current setup is losing effectiveness, even if individual runs still return some data.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/ensuring-data-quality-in-web-scraping">Data quality checks</a></strong>: Row counts, null rates, value distributions. If your price tracker suddenly shows all prices as $0 or your review scraper returns empty text fields, you want to know immediately. Build quality checks into your pipeline as a post-scrape validation step, not as something you run manually. Compare each run&#8217;s output against baseline expectations: If you normally get 200 rows from a source and today you got 12, something is wrong, even if those 12 rows look fine individually.</p></li><li><p><strong>Automated tests against fixture HTML</strong>: Save sample HTML pages from your targets and write tests against them. When a test fails, you know the site has changed before your production scraper breaks. Treat your scrapers like production code, because they are. In practice, this means saving a snapshot of a relevant section in the target page as a local HTML file. Then, write unit tests that run your extraction logic against that fixture and assert expected outputs. Store these fixtures in version control alongside your scraper code. When a site changes and your production scraper breaks, update the fixture with the new HTML. This gives you a repeatable workflow for handling site changes instead of scrambling every time something breaks.</p></li></ul><p>The goal is simple: You should know when something breaks before your stakeholders do. A Slack alert that says &#8220;Competitor X pricing scraper returned 0 results&#8221; is infinitely better than a product manager asking why the dashboard is empty.</p><h2>Conclusion</h2><p>In this article, you learned that market research scraping is about building a reliable pipeline that collects the right facts, transforms them into insights, and doesn&#8217;t get you in legal trouble.</p><p>The competitive advantage of scraping for market research is in what you do with the data. Anyone can code a scraper. But building a system that delivers reliable, actionable market intelligence week after week? That&#8217;s where the real value is!</p><p>So, let us know: Are you using web scraping for market research? What sources have you found most valuable? How did you structure your scraping pipeline? Let&#8217;s discuss in the comments!</p>]]></content:encoded></item><item><title><![CDATA[Two stealth browsers just dropped. Also, your proxy provider might be overcharging you.]]></title><description><![CDATA[Use the new TWSC tools to discover proxy prices and news in the web scraping industry]]></description><link>https://substack.thewebscraping.club/p/two-stealth-browsers-proxy-prices</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/two-stealth-browsers-proxy-prices</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Wed, 25 Mar 2026 15:39:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8b292fc2-af4f-4f3e-8e94-0204b7fd08bb_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few things landed on my desk this week that I did not want to wait until the next issue to share. So here is a quick bonus edition: an update to a tool I have been building, and two projects from the scraping world that caught my attention.</p><div><hr></div><h2><strong>The Proxy Price Benchmark is now updated weekly</strong></h2><p><br>If you haven't checked it yet, the <a href="https://proxyprice.thewebscraping.club/">Proxy Price Benchmark</a> is the tool I built to answer a simple but important question: how much should you actually be paying for your proxies?<br><br>Every week, I (or better, my fleet of agents) update the pricing data directly from the vendors, so you always have a reliable reference to compare offers or negotiate with your current provider.<br><br>This week, we added two new vendors: <strong>Dataimpulse</strong> and <strong>AnyIP</strong>, bringing the total number of monitored providers to 27.<br><br><a href="https://proxyprice.thewebscraping.club/">Check the latest prices</a><br><br>If you use proxies at scale and would find API access to this data useful, I am considering a paid API plan. If you are interested, join the waitlist and tell me about your use case. I want to understand demand before I build it.<br></p><div><hr></div><h2><br><strong>This week on Scraping News: stealth browsers are getting serious</strong></h2><p><br>The <a href="https://news.thewebscraping.club/">Scraping News feed</a> has been tracking an interesting trend this week: two new stealth browser projects worth watching.<br><br><strong><a href="https://owlbrowser.net/">Owl Browser</a></strong> is a purpose-built browser engine for automation at scale. Not a Playwright wrapper but a full engine built on Chromium (CEF) with a custom C99 HTTP server, 256 parallel contexts, and sub-12ms cold start. Self-hosted, Docker-ready, with Python and TypeScript SDKs. If you are running high-volume scraping and hitting the limits of standard headless setups, this is worth a closer look.<br><br><strong><a href="https://github.com/rayobyte-data/rayobrowse">Rayobrowse</a></strong> is Rayobyte's open-source stealth Chromium browser, released from their production scraping infrastructure. It handles fingerprint randomization at the browser level (user agent, WebGL, fonts, screen resolution, timezone) and connects via CDP, so it works with Playwright, Puppeteer, Selenium, or any custom script. Runs on headless Linux with no GPU required.<br><br>Both address the same problem from different angles: standard headless Chromium is detected, and the solution is now moving from patch-level evasion to full browser-level stealth. We will be covering both in depth on TWSC soon.<br><br><a href="https://news.thewebscraping.club/">See all the latest news on Scraping News</a><br></p><div><hr></div><p>Keep in mind that both the Proxy Price Benchmark tools and Scraping News are in an early version; feel free to suggest improvements and bug fixes.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[WebSocket Bot Detection Techniques and How to Bypass Them]]></title><description><![CDATA[You may already know generic anti-bot techniques, but what about WebSocket-specific ones? Let&#8217;s find out!]]></description><link>https://substack.thewebscraping.club/p/websocket-bot-detection-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/websocket-bot-detection-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 22 Mar 2026 09:30:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/48bfc637-7402-4dd9-b7ab-d007f6fa773d_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Websites and web applications are becoming more complex than ever, with live data powering features that deliver fast insights. If you&#8217;re wondering which technology makes those live updates possible, the answer is WebSockets.</p><p>You might think that, in a web scraping scenario, the solution is simply to connect directly to the WebSocket channels. Sure, that&#8217;s possible, but there are a few obstacles along the way. The main ones are WebSocket anti-bot techniques and bot detection measures.</p><p>In this post, I&#8217;ll walk through the most common ones, explain how they work, and share proven tips and tricks to help you avoid them.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KA5B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KA5B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png" width="560" height="315.38461538461536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:560,&quot;bytes&quot;:1650775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656767?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KA5B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><h2>A Quick Intro to WebSockets</h2><p>Before diving into WebSocket bot detection, let me first provide some context about WebSocket as a protocol and its role in web scraping.</p><h3>What Is the WebSocket Protocol?</h3><p><a href="https://websocket.org/guides/websocket-protocol/">WebSocket</a>, also abbreviated as <em>WS</em> for short, is a web protocol standardized in <a href="https://datatracker.ietf.org/doc/html/rfc6455">RFC 6455 </a>that enables full-duplex, bidirectional communication between clients and servers over a single, persistent TCP connection.</p><p>Unlike HTTP, which is stateless and request-driven, WebSockets establish a long-lived connection through an initial HTTP handshake. After the handshake, both client and server can send messages independently, with data transmitted in frames that can be text, binary, or control frames (ping, pong, close).</p><p>WebSockets support fragmentation, masking, and optional compression via extensions like per-message-deflate, while newer HTTP/2 and <a href="https://substack.thewebscraping.club/p/faster-web-scraping-with-http3">HTTP/3 mechanisms</a> allow multiplexing, reduced latency, and better proxy traversal.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WRwD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WRwD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 424w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 848w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1272w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;HTTP vs WebSocket&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="HTTP vs WebSocket" title="HTTP vs WebSocket" srcset="https://substackcdn.com/image/fetch/$s_!WRwD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 424w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 848w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1272w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">HTTP vs WebSocket</figcaption></figure></div><div><hr></div><blockquote><p><em><br>For your ethical scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><blockquote><div><hr></div></blockquote><h3>Why and When Web Pages Use WebSockets</h3><p>The WebSocket protocol opens the door to live, bidirectional web communication. Unlike HTTP&#8217;s request-response model, it lets servers and clients exchange data continuously over a single, persistent connection.</p><p>In general, WebSockets are essential for any application where low latency and frequent updates are required. Common use cases include:</p><ul><li><p><strong>Live streaming</strong>: YouTube Live, TikTok LIVE, Kick, Twitch, and similar platforms.</p></li><li><p><strong>Chat applications</strong>: Slack, Discord, and other messaging services.</p></li><li><p><strong>Collaboration tools</strong>: Google Docs, Figma, and online whiteboards.</p></li><li><p><strong>Gaming and multiplayer experiences</strong>: Browser-based MMO games, turn-based games, and PvP games.</p></li><li><p><strong>Financial data feeds</strong>: Stock tickers, cryptocurrency price updates, and trading dashboards.</p></li><li><p><strong>IoT and telemetry</strong>: Sensor updates, home automation, and device monitoring.</p></li><li><p><strong>Notifications and alerts</strong>: Push updates for social networks, dashboards, or monitoring systems.</p></li></ul><p>In short, WebSocket comes into play wherever instant, continuous communication is necessary (and standard HTTP polling would be too slow or resource-intensive).</p><h3>Main Challenges of Scraping Data from WebSockets</h3><p>Connecting to a WebSocket server for collecting data isn&#8217;t as straightforward as <a href="https://substack.thewebscraping.club/p/apis-in-web-scraping">spoofing API requests for web scraping</a>. In particular, the main challenges of scraping data straight from WebSockets include:</p><ul><li><p><strong>Finding the right client implementation</strong>: You must use a WebSocket client (and there are way fewer than HTTP clients&#8230;) that supports the correct protocol version and any negotiated extensions, such as compression or subprotocols.</p></li><li><p><strong>Limited documentation and examples</strong>: WebSocket scraping is less common than API scraping, so there are fewer guides, tools, and community resources available.</p></li><li><p><strong>Proxy integration complexity</strong>: Not all clients support proxy integrations, making IP rotation a challenge.</p></li><li><p><strong>No request&#8211;response model</strong>: You can&#8217;t simply send a request and receive a response, as with API scraping. Instead, you must send the right messages and then listen to a continuous stream of events.</p></li><li><p><strong>Real-time data handling</strong>: You require a system to collect, process, and store messages in real time, often dealing with high-frequency updates.</p></li></ul><h2>Main WebSocket Anti-Bot Techniques and Solutions</h2><p>Now you&#8217;re ready to discover the most important WebSocket-specific bot detection techniques, along with practical tips to avoid and bypass them. The idea here is to target a WebSocket server from an automated script, relying on a WS client in Python, Node.js, or another programming language of your choice.</p><h3>WebSocket Handshake Issues</h3><p>The <a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers#client_handshake_request">WebSocket handshake</a> is a transition phase in which an HTTP connection is upgraded to a persistent WebSocket connection. During this step, both the client and the server negotiate the connection parameters, and either side can abort the process if the conditions aren&#8217;t acceptable.</p><p>Because the handshake is where the protocol upgrade happens, it&#8217;s also a pivotal security and bot-detection point. The server must carefully validate everything the client requests. Otherwise, protocol misuse or security issues may occur.</p><p>In detail, during the handshake, a WebSocket client must send a valid HTTP/1.1 GET request with specific headers, for example:</p><pre><code>GET /live-data HTTP/1.1
Host: example.com:9000
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: JKjeFfYU8mti9re0prPQrw==
Sec-WebSocket-Protocol: chat, superchat
Sec-WebSocket-Version: 13</code></pre><p>In practice, browsers also include additional headers such as <em>Origin</em>, <em>User-Agent</em>, <em>Referer, Cookie</em>, as well as authentication headers (e.g., <em>Authorization</em>). While these HTTP headers aren&#8217;t strictly required by the WebSocket specification, they are extremely valuable for <a href="https://substack.thewebscraping.club/p/browser-fingerprinting-test-online">fingerprinting and bot detection</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6chN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6chN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 424w, https://substackcdn.com/image/fetch/$s_!6chN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 848w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png" width="1456" height="1239" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1239,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note all extra HTTP headers&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note all extra HTTP headers" title="Note all extra HTTP headers" srcset="https://substackcdn.com/image/fetch/$s_!6chN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 424w, https://substackcdn.com/image/fetch/$s_!6chN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 848w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note all extra HTTP headers</figcaption></figure></div><p>Now, the server should respond with <em>400 Bad Request </em>and immediately close the connection if it encounters:</p><ul><li><p>An unknown or malformed header.</p></li><li><p>An invalid <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Sec-WebSocket-Key">Sec-WebSocket-Key</a></em>.</p></li><li><p>An unsupported WebSocket version.</p></li></ul><p>Instead, if the WebSocket version is unsupported, the server should return a <em>Sec-WebSocket-Version</em> header listing the versions it supports (most modern servers only accept <a href="https://datatracker.ietf.org/doc/html/rfc6455#section-1.2">version </a><em><a href="https://datatracker.ietf.org/doc/html/rfc6455#section-1.2">13</a></em>).</p><p>In practice, repeated handshake failures or non-browser-like handshake patterns are often treated as a bot indicator. Those may result in blocking, particularly after repeated handshake attempts from the same IP or when fingerprinting enables identification even across IP changes.</p><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Always send a valid </strong><em><strong>Origin</strong></em><strong> header</strong>: All major browsers include it, and many servers automatically reject WebSocket requests without one.</p></li><li><p><strong>Replicate real browser handshakes as closely as possible</strong>: Inspect the WebSocket request made by a real browser and match all headers (e.g., <em>User-Agent </em>and similar extra headers).</p></li><li><p><strong>Avoid excessive handshake attempts from the same machine</strong>: Too many connection attempts in a short time window are a common bot signal.</p></li><li><p><strong>Use IP rotation carefully</strong>: Rotation can help avoid rate-based blocks, but it doesn&#8217;t protect against fingerprint-based detection if the handshake remains identical.</p></li></ul><h3>Honeypot WebSocket Events and Channels</h3><p>If you&#8217;re familiar with <a href="https://substack.thewebscraping.club/p/scraping-high-frequency-python">common anti-bot techniques</a>, you&#8217;ve probably heard of honeypots. A honeypot is a decoy mechanism designed to attract bots by exposing fake or hidden resources, allowing systems to detect automated behavior when those resources are accessed or interacted with (e.g., invisible links or fake pages created to study bots).</p><p>In the context of WebSockets, honeypot events are a possible anti-bot technique to detect automated clients. With this approach, the server deliberately sends fake, misleading, or non-actionable events over the WebSocket connection. Similarly, the server might expose channels that aren&#8217;t meant to be accessed by regular clients.</p><p>Yet, automated scraping bots may react incorrectly to WebSocket honeypots by:</p><ul><li><p>Processing incoming data that is fake or intentionally invalid.</p></li><li><p>Requesting access to or subscribing to channels they aren&#8217;t supposed to use.</p></li></ul><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Study real browser behavior carefully</strong>: Inspect WebSocket traffic in your browser&#8217;s DevTools (&#8220;Network&#8221; &#8594; &#8220;Socket&#8221;) and observe which server messages actually trigger data flow or UI updates.</p></li><li><p><strong>Avoid assuming every message is meaningful</strong>: Remember that reacting to every event can lead to detection.</p></li></ul><h3>Connection Lifecycle Anomalies and Patterns</h3><p>Since WebSocket channels are stateful (unlike stateless HTTP requests), servers can detect bots by analyzing connection behavior over time. Scraping bots tend to prioritize speed over realistic user behavior, which can produce identifiable patterns.</p><p>In this regard, popular bot-like indicators include:</p><ul><li><p><strong>Very short-lived connections</strong>: Opening and closing sockets rapidly to collect data.</p></li><li><p><strong>Immediate reconnections after closure</strong>: Reconnecting instantly without human-like delays.</p></li><li><p><strong>High connection churn per IP</strong>: Multiple connections from the same IP within a short period.</p></li><li><p><strong>Missing browser events</strong>: Typical browser WebSocket clients trigger events like proper socket closure, whereas bots often skip them.</p></li><li><p><strong>Unnatural latency patterns</strong>: Servers use ping frames as heartbeats to check responsiveness. Real users on home Wi-Fi or mobile networks exhibit variable latency (jitter), while automated scripts deployed on data centers generally show extremely stable, low-latency responses.</p></li></ul><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Introduce some randomness</strong>: Introduce realistic delays between connections and reconnections.</p></li><li><p><strong>Replicate intended behavior</strong>: Emulate browser close events if testing automated clients.</p></li><li><p><strong>Add latency variation</strong>: Consider latency variation when sending and receiving frames to mimic real-world network jitter.</p></li><li><p><strong>Rotate connection IPs</strong>: Use proxies to <a href="https://substack.thewebscraping.club/p/how-many-ip-needed-scraping">distribute WebSocket connections across multiple IPs</a>.</p></li></ul><h3>WebSocket Binary Data Transmission</h3><p>WebSocket servers sometimes choose to send binary data instead of plain text or JSON. The main technical reasons for this are:</p><ul><li><p><strong>Reduced bandwidth</strong>: Binary messages omit field names and whitespace, making packets smaller than JSON strings and supporting high-frequency updates.</p></li><li><p><strong>Faster parsing</strong>: Binary data can be read as typed arrays or fixed-size fields, avoiding JSON parsing overhead.</p></li><li><p><strong>Custom protocols</strong>: Web apps can define their own compact binary format for predictable, high-frequency data.</p></li><li><p><strong>Efficient number storage</strong>: Numeric values can be stored in 1&#8211;4 bytes rather than as multi-character strings, saving space.</p></li></ul><p>For instance, TikTok LIVE pages use WebSockets to stream updates (e.g., chat messages, view counters, and other statistics) in binary format:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GVMj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GVMj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 424w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 848w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1272w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png" width="1456" height="862" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:862,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the binary message sent from the server&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the binary message sent from the server" title="Note the binary message sent from the server" srcset="https://substackcdn.com/image/fetch/$s_!GVMj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 424w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 848w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1272w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the binary message sent from the server</figcaption></figure></div><p>Sure, binary data can be converted to text. So, you may think that&#8217;s not a problem&#8230;</p><p>Well, keep in mind that most web applications using binary data implementations include some form of compression or encryption. This adds significant complexity!</p><p>Reverse-engineering these systems is technically possible by inspecting browser WebSocket clients, analyzing request headers for compression hints, or trial-and-error with common compression methods. Still, that&#8217;s time-consuming and error-prone. Plus, encryption keys, salts, or other details can easily change with each deployment.</p><p><strong>&#128204; Tips</strong>:</p><p>This time, the only piece of advice I have is to look for alternative data sources. Many WebSocket-based pages, including TikTok LIVE, use regular HTTP APIs to retrieve initial data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZW_3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 424w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 848w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1272w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png" width="1456" height="779" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:779,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i" title="Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i" srcset="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 424w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 848w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1272w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the RESTful HTTP request made by the client during rendering</figcaption></figure></div><p><strong>Note</strong>: Why aren&#8217;t these APIs called server-side when the HTML page is generated? In the case of live data, it&#8217;s more reliable to fetch it on the client, because even a single second of latency could result in outdated or inconsistent information.</p><p>Thus, polling over those RESTful APIs instead of the WebSocket data streams can allow you to retrieve the information of interest without dealing with binary encoding, compression, or encryption challenges.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>WebSocket-Based Bot Detection Measures</h2><p>The WebSocket protocol is built on top of HTTP, so they inherit <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">many anti-bot techniques commonly used for HTTP requests</a>. At the same time, due to its stateful and persistent nature, anti-bot solutions like WAF (Web Application Firewalls) can leverage WebSockets to detect automated behavior even more effectively&#8230;</p><p>As a result, WebSocket-based anti-bot measures are not only relevant when connecting directly to WS servers, but also when interacting with web pages through browser automation tools like Playwright and Selenium. That&#8217;s why you must know them!</p><h3>Advanced TLS Fingerprinting</h3><p>Traditional HTTP fingerprinting checks headers and TLS details. WebSockets extend this by combining the TLS handshake with WebSocket-specific framing, which is much harder to spoof. Signals include <a href="https://developers.cloudflare.com/bots/additional-configurations/ja3-ja4-fingerprint/">JA3/JA4 fingerprints</a>, unusual cipher suite ordering, frame fragmentation patterns, and incorrect masking behavior.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Continuous Device Fingerprinting</h3><p>HTTP allows basic fingerprinting on a per-request basis, but it can&#8217;t verify whether the client&#8217;s environment remains consistent. The stateful nature of WebSockets enables servers to continuously <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">validate device fingerprints</a> over time. For example, servers can request Canvas/WebGL renders, available fonts, and other browser characteristics repeatedly. Any inconsistency can lead to an immediate block.</p><h3>Real-Time User Behavior Monitoring</h3><p>WebSockets allow live streaming of mouse, keyboard, and scrolling events back to the server. This enables a much deeper level of user behavior analysis compared to static HTTP requests.</p><p>After all, most <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation scripts</a> produce perfectly straight mouse movements or instantaneous clicks, while human interactions naturally include slight jitter, variable speed, and reaction delays. These differences make automated clients easier to detect when behavior is constantly monitored over a WebSocket connection.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I introduced the WebSocket protocol and explained why and when it comes in handy. Specifically, you learned that it powers live data updates on web applications. Want to access that data? Well, it&#8217;s not as straightforward as you might think due to WebSocket anti-bot techniques.</p><p>In this post, I explored the most relevant WS bot detection methods, along with useful advice for bypassing them successfully. You also saw how WebSocket&#8217;s stateful, continuous data streaming can be used by WAFs and other advanced anti-bot systems for enhanced detection.</p><p>I hope you found this helpful and informative. If you have any questions or comments, drop them below. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #100: Hybrid Scraping - One Browser Login, Thousands of HTTP Requests]]></title><description><![CDATA[Building a pipeline that uses Camoufox for authentication and curl_cffi for extraction on Akamai-protected targets.]]></description><link>https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 19 Mar 2026 22:07:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9e52f7e3-270c-41cc-ba33-7bbbfb446247_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Browser-based scraping tools have become the default answer when a website deploys anti-bot protection. When a target runs Akamai, Cloudflare, or Datadome, the natural reflex is to reach for Playwright, Puppeteer, or one of their stealth variants like Camoufox or Pydoll. And it works. A real browser renders JavaScript, solves challenges, and presents a legitimate fingerprint. The success rate is high.</p><p>But a browser does everything the hard way. It downloads the full page, parses HTML, executes JavaScript, renders the DOM, loads images, fonts, and stylesheets. For each request, it allocates hundreds of megabytes of RAM and takes seconds to complete what an HTTP client could do in milliseconds. When a pipeline needs to scrape ten pages, this overhead is irrelevant. When it needs to scrape ten thousand pages, the browser becomes the bottleneck.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Consider a concrete scenario: we need to monitor the wishlist of an e-commerce account, pulling product data, stock levels, and price changes every hour across hundreds of items. Running Camoufox for every single API call would mean spinning up a full browser instance, navigating to each page, waiting for JavaScript to execute, extracting the data, and closing. For a hundred items, that is minutes of execution time and gigabytes of memory. The same API calls through an HTTP client would complete in seconds using a fraction of the resources.</p><p>As we measured in <a href="https://substack.thewebscraping.club/p/scraping-nike-with-open-source">THE LAB #96</a>, HTTP clients with TLS impersonation can be 27x faster than browsers on the same target. The difference is not marginal. It is the difference between a pipeline that runs on a single machine and one that requires a cluster.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><p>The problem is that these two approaches are usually treated as mutually exclusive. Either you use a browser for everything, accepting the overhead, or you try an HTTP client and hope the anti-bot system does not block it. But many websites only need a browser at the gate: for the login, the initial challenge, or the session establishment. Everything after that is plain API calls.</p><p>If we can use a browser to earn a valid session and then hand it off to an HTTP client, we get the reliability of browser automation where it matters and the speed of HTTP everywhere else. That is the pattern we want to build. But the handoff is not as simple as copying a few cookies, and the traps along the way are worth understanding before building a pipeline around this idea.</p><h2>The hybrid pattern</h2><p>The idea is simple in principle. Many websites require a browser only at the gate: the login flow, the initial anti-bot challenge, or the session establishment. Once that gate is passed, subsequent requests are plain API calls or page fetches that do not require JavaScript execution. If we can extract the session state from the browser and replay it through an HTTP client, we skip the browser for 99% of the work.</p><p>The session state, in practice, means cookies. An authentication flow sets session cookies that the server trusts for subsequent requests. If we transfer those cookies from the browser to an HTTP client, the server should treat the HTTP client as the same authenticated user.</p><p>But cookies alone are often not enough. Modern anti-bot systems like Akamai do not just check whether you have the right cookies. They also check whether the client presenting those cookies looks like the same client that earned them. </p><p>This is where TLS fingerprinting enters the picture: if the browser that logged in was Firefox, but the HTTP client that reuses the cookies presents a Python TLS fingerprint, the server may reject the request or simply drop the connection without responding.</p><p>So the real challenge is not just transferring cookies. It is maintaining continuity across two different execution models: the browser and the HTTP client must look like the same entity to the server.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Tool landscape</h2><p>For this experiment, we used two tools.</p><p><a href="https://github.com/daijro/camoufox">Camoufox</a> is a custom Firefox build designed for stealth. It spoofs fingerprints (WebGL, canvas, audio, navigator properties), patches headless detection vectors, and uses Playwright&#8217;s Juggler protocol for automation. We covered it extensively in <a href="https://substack.thewebscraping.club/p/scraping-datadome-camoufox">THE LAB #65: Scraping Datadome-protected websites with Camoufox</a>. Its role here is limited to one thing: logging in.</p><p><a href="https://github.com/yifeikong/curl_cffi">curl_cffi</a> is a Python binding for curl-impersonate, a modified version of curl that mimics the TLS and HTTP/2 fingerprint of real browsers. It supports impersonating Chrome and Firefox at specific versions, which means it can present the same TLS fingerprint as the browser that established the session. Unlike a browser, it uses negligible resources per request and can process thousands of pages per minute.</p><p>The key property that makes this pairing work: Camoufox is Firefox-based, and curl_cffi can impersonate Firefox&#8217;s TLS fingerprint. The server sees a consistent Firefox identity across both steps.</p><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">100.HYBRID_SCRAPING</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The target: Net-a-Porter</h2><p>We chose <a href="https://www.net-a-porter.com">Net-a-Porter</a> as our target. It is a luxury e-commerce platform protected by Akamai Bot Manager, with authenticated features (wishlists, account details) exposed through internal JSON APIs. This gives us a clean test case: the login requires a real browser (Akamai blocks automation tools at the login endpoint), but the authenticated API calls are plain HTTP requests that return structured JSON.</p><p><em><strong>Please keep in mind that this is an experiment for study purposes, and we&#8217;re not inciting you to scrape Net-a-Porter or any other website, especially the part behind a login.</strong></em></p><p>Before diving into code, we need to understand what we&#8217;re dealing with. Net-a-Porter&#8217;s architecture has three layers relevant to us:</p><p><strong>Akamai Bot Manager</strong> sits in front of everything. It sets a cluster of tracking cookies (<code>_abck</code>, <code>bm_sz</code>, <code>bm_s</code>, <code>ak_bmsc</code>, and others) that are generated through JavaScript execution on the client side. These cookies prove that a real browser visited the page. Without them, API calls either fail or hang indefinitely.</p><p><strong>The login API</strong> at <code>/api/nap/wcs/resources/store/nap_il/loginidentity/v2</code> accepts a JSON payload with email and password. On success, it returns a 201 status with an <code>Ubertoken</code> in the response body. This token is the key to all authenticated endpoints.</p><p><strong>Authenticated API endpoints</strong> like the wishlist API at <code>/api/nap/wcs/resources/store/nap_il/wishlist/v2/{id}</code> require both the session cookies and the <code>Ubertoken</code> passed as an <code>x-ubertoken</code> header. They return clean JSON with product details, stock levels, and metadata.</p><h2>The experiment: what worked and what did not</h2><p>We did not arrive at the final solution directly. The investigation path itself reveals the constraints of session handoff, so it is worth walking through each attempt.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Stop Getting Blocked: Upgrade Your Scraping Infrastructure with Dolphin{anty}]]></title><description><![CDATA[My review of Dolphin{anty}. Weighing the pros, cons, and unique capabilities of this anti-detect browser.]]></description><link>https://substack.thewebscraping.club/p/dolphin-anty-product-review</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/dolphin-anty-product-review</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 15 Mar 2026 15:33:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d9486151-c072-4ffa-b126-fe482a216e7e_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The web scraping industry has evolved very fast in recent years. The fact that rotating proxies is no longer enough to guarantee success is a clear sign of how advanced anti-bot systems have become.</p><p>Lots of tools have emerged to solve the issue of browser fingerprinting, which, for example, is one of the <a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies">primary reasons for blocks even when using high-quality residential proxies</a>. So, the need companies have for stable, scalable data collection makes anti-detect solutions essential for survival in the current status of the industry.</p><p>In this article, you&#8217;ll discover Dolphin{anty}: A powerful anti-detect browser that lets you orchestrate hundreds of unique, isolated browser profiles. You&#8217;ll learn its strengths, why you should consider it for your scraping or multi-accounting projects, and how it works with a practical guide.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What is an Anti-detect Browser?</h2><p>An antidetect browser is a specialized web browsing tool designed <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">to mask a user&#8217;s digital fingerprint</a>, allowing them to appear as a distinct, unique visitor to websites and tracking systems. Standard browsers like Chrome or Firefox broadcast a user&#8217;s hardware and software data. An anti-detect browser, instead, enables users to customize and spoof these parameters for every session.</p><p>In the context of web scraping, web scraping professionals use this technology to bypass anti-bot measures that rely on browser fingerprinting to identify and block automated traffic. Anti-detect browsers can also be used in &#8220;multi-accounting&#8221; strategies. You can use them to create isolated browser profiles, each with its own unique fingerprint, cookies, and proxy IP. The common use case is that a single user can manage hundreds of social media, e-commerce, or ad accounts simultaneously without triggering security flags that would normally link the accounts together and lead to mass bans.</p><div><hr></div><blockquote><p><em>A successful data pipeline is made not only by the right tool to use, but also from the right IP address. Proxy providers like <strong>Decodo</strong> help you achieving your scraping goals.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>What is Dolphin{anty} and Why Consider it for Your Web Scraping Projects?</h2><p><a href="https://dolphin-anty.com/">Dolphin{anty}</a> is an anti-detect browser that allows you to manage hundreds of unique, isolated browser profiles for web scraping and multi-accounting. You can use it via its desktop application or programmatically, as it provides a flexible API for deep integration with your scripts.</p><p>The best part of using it is that you can orchestrate wide scraping operations without worrying about browser fingerprinting. Forget about immediate IP bans, CAPTCHAs triggered by suspicious metadata, or complex cookie management. Dolphin{anty} handles the masking of your digital identity for you very simply. Also, thanks to its <a href="https://dolphin-anty.com/blog/en/dolphin-anty-has-become-even-more-effective-a-significant-update-to-the-scenarios-capabilities/">built-in &#8220;Scenarios&#8221; builder and synchronizer</a>, it can automatically replicate human-like actions across multiple profiles simultaneously. So, say goodbye to manual warm-up routines and the fear of losing accounts to anti-fraud systems.</p><p>The top reasons why you should consider it for your projects are the following:</p><ul><li><p><strong>Advanced anti-detect capabilities:</strong> If you&#8217;ve been scraping for a while, you know that standard headless browsers often leak metadata that triggers anti-bot defenses. Dolphin{anty} solves this by providing real, unique digital fingerprints for every profile. It mimics user behaviors at a granular level, allowing you to bypass sophisticated detection systems without the constant headache of being blocked.</p></li><li><p><strong>Mass profile management:</strong> Managing a few accounts is easy, but scaling to hundreds or thousands is a different beast. Dolphin{anty} is built for scale. It allows you to orchestrate hundreds of isolated browser profiles from a single interface. Whether you are managing a massive farm of accounts for data collection or need to segment your scraping tasks, the tool provides the infrastructure to keep everything organized and efficient.</p></li><li><p><strong>Flexible API integration:</strong> For those who prefer code, Dolphin{anty} offers a robust API that integrates deeply with your existing Python or Node.js pipelines. This allows you to automate profile creation, launch browsers programmatically, and integrate the anti-detect capabilities directly into your custom scraping infrastructure.</p></li></ul><div><hr></div><blockquote><p><em>For your ethical scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>Dolphin{anty}&#8217;s Main Features</h2><p>Dolphin{anty} is packed with features designed to make multi-accounting and scraping easier. The main features you should know about it are the following:</p><ul><li><p><strong>Real fingerprint generation:</strong> The core of Dolphin{anty} is its ability to provide genuine device fingerprints. Instead of just blocking trackers, it creates a unique digital identity for every profile you run. In practice, it manages over 20 parameters&#8212;from WebRTC to Canvas&#8212;so your scrapers look exactly like real users on real devices.</p></li><li><p><strong>Built-in Automation:</strong> You don&#8217;t always need to be a coding wizard to automate tasks. Dolphin{anty} offers a &#8220;Scenarios&#8221; builder that lets you create automated workflows visually. Whether it&#8217;s warming up accounts or parsing data, you can set these scripts to run automatically. And for those who prefer code, the flexible API allows you to integrate these profiles directly into your existing scripts.</p></li><li><p><strong>Profile synchronizer:</strong> This is a game-changer if you need to perform the same action across multiple accounts. The Synchronizer allows you to perform an action in a &#8220;master&#8221; profile, and the tool automatically repeats that exact action across all other selected profiles in real-time. This saves you a massive amount of time on routine interactions.</p></li><li><p><strong>Team collaboration:</strong> If you work in a team, you know that sharing browser sessions and cookies can be a nightmare. Dolphin{anty} simplifies this by allowing you to transfer profiles, cookies, and proxies to colleagues in just a few clicks. You can also manage permissions, ensuring that team members only have access to the functionality they need.</p></li><li><p><strong>Smart profile management:</strong> When you are dealing with hundreds of profiles, organization is key. The tool provides a highly intuitive interface where you can use tags, statuses, and notes to sort and find your profiles instantly. It&#8217;s built to help you navigate a large farm of accounts without getting lost in the chaos.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2><strong>Hands-on Dolphin{anty}: Step-by-step Scraping Tutorial</strong></h2><p>In this section, you will see how easy and fast it is to use Dolphin{anty}. Get ready for the tutorial!</p><h3>Setting Up Dolphin{anty} </h3><p>First of all, you need to create a new login. After <a href="https://dolphin-anty.com/panel/#/auth/registration">creating a new account on Dolphin{anty}</a>, the system will ask you to download the software. As you can see from the image below, it supports all the major Operating Systems:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qICh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qICh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 424w, https://substackcdn.com/image/fetch/$s_!qICh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 848w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1272w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:158491,&quot;alt&quot;:&quot;Dolphin Anty supports all major operating systems by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Dolphin Anty supports all major operating systems by Federico Trotta" title="Dolphin Anty supports all major operating systems by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!qICh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 424w, https://substackcdn.com/image/fetch/$s_!qICh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 848w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1272w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Dolphin{anty} supports all major operating systems</figcaption></figure></div><p>Below is how Dolphin{anty}&#8217;s interface appears after you installed it on your machine:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kn4O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 424w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 848w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1272w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png" width="1456" height="762" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:762,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:185539,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 424w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 848w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1272w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Dolphin{anty}&#8217;s first interface</figcaption></figure></div><p>Good. Everything is set up. Time to create new profiles!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Create New Profiles</h3><p>Before using Dolphin{anty}, you have to create a new profile. To do so, click on <strong>CREATE PROFILE</strong> and fill in the fields:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xU9q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xU9q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 424w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 848w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1272w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png" width="1152" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1152,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183896,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xU9q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 424w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 848w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1272w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating new profiles in Dolphin{anty} ty</figcaption></figure></div><p>Profiles are the core of Dolphin{anty}. This is where, for example, you can change the fingerprinting for your anti-detect strategies. To do so, you only need to click on <strong>NEW FINGERPRINT,</strong> and the tool will change all the fingerprinting data for you. And if the standard fingerprinting is not sufficient, you can manage advanced configurations:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rafe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rafe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 424w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 848w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1272w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png" width="1166" height="873" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:873,&quot;width&quot;:1166,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184964,&quot;alt&quot;:&quot;Changing fingerprint configuration in Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Changing fingerprint configuration in Dolphin Anty by Federico Trotta" title="Changing fingerprint configuration in Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Rafe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 424w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 848w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1272w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Changing fingerprint configuration in Dolphin{anty} </figcaption></figure></div><p>Also, if your use case needs to use a specific social media like Facebook, you can set Facebook&#8217;s URL as the starting page and the credentials to log in to a profile you need to manage:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ONwa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ONwa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 424w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 848w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1272w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png" width="1173" height="872" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:872,&quot;width&quot;:1173,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187856,&quot;alt&quot;:&quot;How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta" title="How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!ONwa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 424w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 848w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1272w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How to set up your social media profile&#8217;s login with Dolphin{anty} </figcaption></figure></div><p>When everything is set up, click on <strong>SAVE,</strong> and your profile is completed! You are now ready to use Dolphin{anty} via UI or code.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Use Dolphin{anty}  Via The UI</h3><p>The power of anti-detect browsers rely in allowing you to create different profiles and letting you use the browser with one instance, but different profiles. So, after you created the profiles, click on <strong>START</strong> to launch the instances:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!38xz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!38xz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 424w, https://substackcdn.com/image/fetch/$s_!38xz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 848w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1272w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png" width="1456" height="285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:285,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124225,&quot;alt&quot;:&quot;How to launch instances with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to launch instances with Dolphin Anty by Federico Trotta" title="How to launch instances with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!38xz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 424w, https://substackcdn.com/image/fetch/$s_!38xz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 848w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1272w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">How to launch instances with Dolphin{anty}</figcaption></figure></div><p>Dolphin{anty} will launch a new browser instance, allowing you to manage as many profiles as you have created and activated. Below is the expected result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_xVL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_xVL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 424w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 848w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1272w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png" width="1058" height="916" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:916,&quot;width&quot;:1058,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:218925,&quot;alt&quot;:&quot;Launching an instance with two different profiles with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Launching an instance with two different profiles with Dolphin Anty by Federico Trotta" title="Launching an instance with two different profiles with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!_xVL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 424w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 848w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1272w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Launching an instance with two different profiles with Dolphin{anty} </figcaption></figure></div><p>That&#8217;s it for using Dolphin{anty}  via UI!</p><h3>Use Dolphin{anty} Via Code</h3><p>Before using Dolphin{anty}  via code, you have to create an API key. To do so, navigate through the <strong><a href="https://dolphin-anty.com/panel/#/api">API</a></strong><a href="https://dolphin-anty.com/panel/#/api"> panel in the web app</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oWZ3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 424w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 848w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1272w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224126,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 424w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 848w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1272w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating an API key in Dolphin{anty} </figcaption></figure></div><p>Now you can connect to a profile through a port generated at startup and automate the browser using tools like <a href="https://substack.thewebscraping.club/p/improving-performance-puppeteer-scraping">Puppeteer</a>, <a href="https://substack.thewebscraping.club/p/web-scraping-from-0-to-hero-our-first">Playwright</a>, <a href="https://substack.thewebscraping.club/p/selenium-tutorial-course">Selenium</a>, and others.</p><p>Basic automation you can do includes the following:</p><ol><li><p>Start a profile via API with DevTools Protocol enabled.</p></li><li><p>Connect to the profile&#8217;s port using a browser tool.</p></li><li><p>Run your own automation script through the open connection.</p></li></ol><p>Dolphin{anty} allows you maximum flexibility, so you can use your favourite programming language. For example, below is how you can write an authorization script:</p><pre><code><code>import requests
api_url = "&lt;http://localhost:3001/v1.0/auth/login-with-token&gt;"
token = "your-api-key"
request_data = {"token": token}
headers = {"Content-Type": "application/json"}

response = requests.post(api_url, json=request_data, headers=headers)
if response.status_code == 200:
&#9;print("OK", response.json())
else:
&#9;print("Error", response.status_code)</code></code></pre><p>If the response is successful, you will receive a message like the following:</p><pre><code><code>{"success": true}</code></code></pre><p>Discover how to use <a href="https://help.dolphin-anty.com/en/collections/4645237-api">Doplhin{anti} via API by reading the documentation</a>!</p><h2>Pros and Cons of Dolphin{anty}</h2><p>Like any tool, Dolphin{anty} has its strengths and weaknesses. Here is a breakdown of what you need to know before deciding if it fits your stack.</p><p>&#128077; <strong>Pros:</strong></p><ul><li><p><strong>Top-tier fingerprinting:</strong> The ability to generate real, unique fingerprints for every profile is its biggest selling point. It goes beyond simple user-agents, making your scrapers look genuinely human.</p></li><li><p><strong>Built-in automation tools:</strong> The &#8220;Scenarios&#8221; builder and the Synchronizer are massive time-savers. You can automate routine warm-up tasks or replicate actions across dozens of profiles without writing a single line of code.</p></li><li><p><strong>Team-centric design:</strong> If you work with a team, the ability to transfer profiles and share them instantly is invaluable. It removes the friction of sharing session data manually via files or text.</p></li></ul><p>&#128078;<strong>Cons:</strong></p><ul><li><p><strong>REST API complexity:</strong> This is a significant friction point for developers. Unlike other solutions that offer native SDK wrappers, Dolphin{anty} relies only on REST API calls for automation. This adds &#8220;boilerplate&#8221; complexity compared to simply importing a library.</p></li><li><p><strong>Resource intensive:</strong> Running multiple browser profiles with full fingerprinting requires significant system resources. You will need a powerful machine if you plan to run dozens of concurrent sessions locally.</p></li></ul><h2>Conclusion</h2><p>In this article, you discovered Dolphin{anty}, a flexible anti-detect browser that can be used both via UI and via code. As you&#8217;ve learned, it comes packed with interesting features that can speed up your processes. In particular, we found that the &#8220;Scenarios&#8221; feature is the one that actually makes it stand out.</p><p>So, let&#8217;s discuss in the comments: Were you already using Dolphin{anty} before reading this article? What&#8217;s your experience with it?</p>]]></content:encoded></item><item><title><![CDATA[The DMCA Was Built to Stop DVD Piracy. Google Wants to Use It Against Scrapers]]></title><description><![CDATA[How a 12-page complaint is trying to turn every CAPTCHA into a federal copyright perimeter]]></description><link>https://substack.thewebscraping.club/p/google-vs-serpapi-web-scraping-case</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/google-vs-serpapi-web-scraping-case</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Sun, 08 Mar 2026 17:52:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/97060676-c153-4ea6-a3c9-7e70cd1f3c22_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On December 19, 2025, Google filed a lawsuit against SerpApi in the Northern District of California. The case number is 25-10826, and the complaint is 12 pages long. Twelve pages that could reshape how the entire scraping industry operates.</p><p>We are not talking about a cease-and-desist letter or a Terms of Service dispute. Google did not send SerpApi any communication before filing the lawsuit. No cease-and-desist, no attempt to resolve their concerns directly. SerpApi told us this was highly unusual, and that had Google reached out, they might have learned that their claims lack merit.</p><p>Google is invoking the Digital Millennium Copyright Act, specifically Section 1201, the anti-circumvention provision. The same statute originally designed to prevent people from cracking DVD encryption is now being pointed at a SERP scraping API.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>We reached out to both Google and SerpApi for comment on this case. Google did not respond. SerpApi did, and we will include their statements throughout this article where relevant.</p><p>Let us break down what happened, why it matters, and what it could mean for anyone who scrapes the web for a living.</p><h3>The Facts</h3><p>Google&#8217;s complaint tells a straightforward story. SerpApi, founded in 2017 by Julien Khaleghy, operates a paid API that sends automated queries to Google Search and returns the results as structured JSON. Google estimates that SerpApi sends hundreds of millions of artificial search requests per day, and that this volume has increased by as much as 25,000% over the past two years.</p><p>In January 2025, Google deployed a technological protection measure called SearchGuard. SearchGuard works by sending JavaScript challenges to incoming search queries. For regular browser users, the challenge is invisible: the browser runs the JavaScript, sends back the expected response, and the search results load normally. For automated systems, the challenge is a wall. Bots that cannot execute JavaScript or that fail behavioral checks get blocked.</p><p>According to Google&#8217;s complaint, SerpApi&#8217;s response to SearchGuard was to build circumvention mechanisms. The complaint alleges that SerpApi creates &#8220;fake browsers using a multitude of IP addresses that Google sees as normal users,&#8221; misrepresents device and location information when solving challenges, and syndicates authorization tokens from legitimate requests to unauthorized machines around the world. Google also alleges that SerpApi uses automated means to bypass CAPTCHAs that SearchGuard deploys as a secondary verification layer. SerpApi disputes these factual allegations.</p><p>The complaint cites SerpApi&#8217;s own blog posts, where the company reportedly described SearchGuard as making &#8220;web scraping more difficult&#8221; but claimed to be &#8220;fortunate to be minimally impacted&#8221; because its services had &#8220;already pre-solved Google&#8217;s JavaScript challenge.&#8221;</p><h2>The Legal Theory</h2><p>This is where it gets interesting for the scraping industry, because Google chose not to sue under the Computer Fraud and Abuse Act (CFAA). That would have been the traditional route. Instead, Google went with the DMCA.</p><p>The context matters. The CFAA path has been significantly narrowed by the hiQ Labs v. LinkedIn case. In that landmark decision, the Ninth Circuit held that scraping publicly available data does not violate the CFAA, and warned against allowing companies to create &#8220;information monopolies.&#8221; The Supreme Court vacated and remanded the case under its Van Buren ruling, but on remand, the Ninth Circuit reaffirmed its original position.</p><p>After hiQ, the CFAA is a much weaker weapon against scraping of publicly visible content. Google needed a different legal framework. Section 1201 of the DMCA provides one.</p><p>Section 1201 has two relevant provisions. The first, Section 1201(a)(1)(A), prohibits the act of circumventing a technological measure that effectively controls access to a copyrighted work. The second, Section 1201(a)(2), prohibits trafficking in technology designed to circumvent such measures. Google&#8217;s complaint invokes both.</p><p>The argument chain goes like this: Google&#8217;s search results contain copyrighted content, specifically images in Knowledge Panels licensed from third parties, merchant-supplied product images in Google Shopping, and licensed content from Google Maps. SearchGuard is a technological measure that controls access to these search results pages (and therefore to the copyrighted works within them). SerpApi circumvents SearchGuard. Therefore, SerpApi violates Section 1201.</p><p>Each act of circumvention carries statutory damages of between $200 and $2,500. Google alleges billions of individual circumventions. Do the math, and the potential damages exceed what SerpApi could ever pay. Google itself notes in the complaint that SerpApi &#8220;reportedly earns a few million dollars in annual revenue, but already faces liability that is orders of magnitude higher and growing.&#8221;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>SerpApi&#8217;s Position</h2><p>When we reached out to SerpApi, they were clear about their stance. On the fundamental legality of what they do, SerpApi told us: &#8220;<em>We embrace the term &#8216;scraping,&#8217; and we practice it legally and transparently. SerpApi accesses publicly visible search results, the same ones available to any browser, and delivers clean, structured JSON back to our customers. We&#8217;ve operated this way since 2017, serving developers, researchers, and businesses who need reliable access to public information at scale.&#8221;</em></p><p>On the legal boundaries of automated access to search results, their position is equally direct: &#8220;<em>The law on this is clear, and we&#8217;re prepared to defend that position in court. Scraping is legal, and we stand behind our products and customers. Our API replicates real-time searches with no login, no bypass of any paywall, and no access to anything that isn&#8217;t already available to anyone with a browser. U.S. courts have upheld this repeatedly; hiQ Labs v. LinkedIn is a key precedent. The data Google surfaces lives on the open web. Google didn&#8217;t create it.</em>&#8221;</p><p>In February 2026, <a href="https://serpapi.com/blog/google-v-serpapi-motion-to-dismiss-why-were-in-the-right/">SerpApi filed a motion to dismiss</a>. Their arguments include the assertion that the DMCA is a copyright protection statute, not a website protection statute, and that Google is improperly trying to use it to control access to public portions of its website. They also argue that mimicking browser behavior to access publicly available pages is not the same as cracking encryption or disabling authentication, and that any ambiguity in the definition of "circumvention" must be given its narrowest reasonable reading, citing the "First Amendment interest in maintaining accessibility of the Internet as an open forum."</p><p>SerpApi also pointed out what they see as an absurdity in Google&#8217;s theory. If statutory damages were calculated at scale, the total &#8220;would exceed U.S. GDP.&#8221; Congress, they argue, never intended Section 1201 to be used this way.</p><p>On the DMCA claim specifically, SerpApi told us: &#8220;<em>The DMCA&#8217;s anti-circumvention provision was designed to protect copyrighted works, full stop. Google is not protecting access to copyrighted works. Google is improperly attempting to use the DMCA to limit access to the public portions of its website. We believe that the law is on our side.</em>&#8221;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The Hypocrisy Argument</h2><p>SerpApi is not shy about making this point. <a href="https://serpapi.com/blog/google-v-serpapi-threatening-access-to-public-data/">In a blog post about the lawsuit</a>, they argue that Google&#8217;s case threatens access to public data on the open internet and this resonates widely in the scraping community. As they told us: &#8220;<em>Google indexed the web without anyone&#8217;s permission. That&#8217;s how search works. Now it&#8217;s trying to pull up the ladder behind it, prohibiting the practices that it used, and still uses today, to build its business empire. That&#8217;s why SerpApi is standing up to Google. Not just to protect our business, but to protect legal competition and open access to public information on the internet.</em>&#8221;</p><p>Google Search operates by crawling, indexing, and presenting content from billions of websites. Many of those website owners never explicitly consented to being indexed. Google&#8217;s position has always been that robots.txt provides the mechanism for opting out, and that the default state of the open web is crawlable. Now Google is arguing that its own search results should be exempt from the same logic.</p><p>The irony is not lost on legal commentators either. <a href="https://abovethelaw.com/2025/12/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google">Above the Law</a>( described the case as Google &#8220;<em>pulling up the ladder after climbing it.</em>&#8221; <a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">Eric Goldman&#8217;s blog published an extensive guest analysis</a> arguing that Google&#8217;s DMCA strategy represents an attempt to relitigate hiQ Labs through a different statutory framework.</p><h2>Why This Matters Beyond SerpApi</h2><p>If Google&#8217;s legal theory prevails, the implications extend far beyond one API company. The core question is whether deploying an anti-bot system on a publicly accessible website is enough to invoke federal copyright law against anyone who bypasses it.</p><p>Think about what that means in practice. Every CAPTCHA, every JavaScript challenge, every behavioral analysis system deployed on a public website could potentially become a &#8220;technological protection measure&#8221; under Section 1201. Any scraper that solves a CAPTCHA, executes JavaScript to render a page, or rotates IP addresses to avoid detection could be committing a federal offense.</p><p>This is not hypothetical. The legal theory applies to any website that hosts copyrighted content (which is almost all of them) and deploys some form of bot detection (which is increasingly all of them).</p><p>Eric Goldman&#8217;s blog highlighted this exact concern. <a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">The guest analysis by Kieran McCarthy</a> warns that accepting Google&#8217;s theory would allow any website deploying anti-bot technology to invoke federal law against circumvention, &#8220;transforming speed bumps and CAPTCHAs into federally enforceable copyright perimeters.&#8221;</p><p>The <a href="https://www.eff.org/">Electronic Frontier Foundation</a> has also weighed in. Staff attorney Tori Noble stated that &#8220;the right to scrape publicly available information keeps the Internet free and open,&#8221; cautioning that overly broad DMCA interpretations undermine innovation and research.</p><p>SerpApi made a similar point when we asked about the impact on consumers: &#8220;<em>Scraping-powered services benefit all kinds of consumers who use the web every day. Scraping helps to maintain the free and open flow of information across the internet, ultimately encouraging things like price transparency, competition, and informed decision-making, all to benefit consumers. Expanding the DMCA as Google has suggested would only benefit the largest tech incumbents and hinder transparency and healthy competition.</em>&#8221;</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Emerging Legal Pattern</h2><p>Google&#8217;s lawsuit does not exist in isolation. In October 2025, <a href="https://copyrightalliance.org/wp-content/uploads/2025/10/Reddit-v.-SerpApi.pdf">Reddit filed a 41-page complaint</a> against SerpApi, Perplexity AI, Oxylabs, and AWMProxy in the Southern District of New York. The complaint is far more aggressive than Google&#8217;s, both in tone and in scope: six legal counts including three separate DMCA claims, unfair competition, unjust enrichment, and civil conspiracy.</p><p>Reddit&#8217;s framing is vivid. It describes the defendants as &#8220;similar to would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead.&#8221; AWMProxy is characterized as &#8220;a former Russian botnet.&#8221; Perplexity is compared to &#8220;a North Korean hacker.&#8221; The language is clearly designed to make scrapers look like criminals.</p><p>The underlying theory is similar to Google&#8217;s. Reddit has signed licensing deals with both <a href="https://blog.google/inside-google/company-announcements/expanded-reddit-partnership/">Google</a> and <a href="https://openai.com/index/openai-and-reddit-partnership/">OpenAI</a> to grant them programmatic access to its data. Companies that want Reddit content at scale are expected to pay for it. But when scrapers circumvent SearchGuard to harvest Google&#8217;s search results, they also harvest Reddit content without paying a cent. According to data Reddit obtained through a subpoena to Google, the three scraping defendants accessed almost three billion Google SERPs containing Reddit content in just two weeks during July 2025. SerpApi alone accounted for over 1.8 billion of those page accesses. Like Google, Reddit did not send SerpApi any communication before filing suit. SerpApi disputes these figures and the other factual allegations in Reddit&#8217;s complaint, and has filed a motion to dismiss in that case as well.</p><p>Reddit also produced a piece of evidence that reads like a detective novel. It created a hidden &#8220;test post&#8221; that could only be crawled by Google&#8217;s search engine and was not otherwise accessible anywhere on the internet. Within hours, the contents of that post appeared in Perplexity&#8217;s &#8220;answer engine.&#8221; The only way Perplexity could have obtained that content was through scraping Google&#8217;s search results. Reddit calls this technique the equivalent of &#8220;marked bills&#8221; in a bank robbery investigation.</p><p>The Reddit complaint also reveals a detail that connects directly to our industry: after Reddit sent a cease-and-desist letter to Perplexity in May 2024, Perplexity&#8217;s citations to Reddit content did not decrease. They increased forty-fold.</p><p>And in December 2025, in Ziff Davis v. OpenAI, a federal judge in the Southern District of New York ruled that robots.txt files do not &#8220;effectively control access&#8221; under Section 1201. Judge Sidney Stein compared robots.txt to a &#8220;keep off the grass&#8221; sign that &#8220;relies on readers to decide to comply rather than enforcing any kind of access control itself.&#8221; The ruling is important because it sets a baseline: passive, voluntary measures are not enough to trigger DMCA protection.</p><p>But SearchGuard is not robots.txt. It is an active system that executes JavaScript, performs behavioral analysis, deploys CAPTCHAs, and makes real-time decisions about whether to grant access. Whether this kind of system meets the &#8220;effectively controls access&#8221; standard is the open legal question. The answer will likely set the direction for the entire industry.</p><p><a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">Legal commentators</a> have identified what they call the &#8220;DMCA 1201 scraping strategy&#8221;: platforms deploy technological protection measures specifically to create legal standing under Section 1201, then sue when those measures are circumvented. The sequence is intentional. Deploy, document, sue. Whether courts view this as legitimate copyright protection or as <a href="https://abovethelaw.com/2025/12/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/">strategic rent-seeking</a> will determine the outcome.</p><p>There is also a relevant doctrinal debate. The Lexmark case in the Sixth Circuit introduced the &#8220;front door/back door&#8221; argument: if a house&#8217;s front door is unlocked, putting a lock on the back door does not mean the house is &#8220;access-controlled.&#8221; Applied here: if anyone with a regular browser can access Google Search results, does deploying SearchGuard against automated systems meaningfully &#8220;control access&#8221; to the copyrighted works within those results?</p><h2>The AI Angle</h2><p>There is one more layer worth noting. <a href="https://searchengineland.com/openai-chatgpt-serpapi-google-search-results-461226">As Search Engine Land reported</a>, OpenAI used SerpApi to scrape Google Search results for ChatGPT responses on current events, after Google declined to provide direct access to its search index. SerpApi listed OpenAI as a customer on its website as recently as May 2024 before removing the listing. Other reported customers include Meta, Apple, and Perplexity.</p><p>This context matters because Google already has a massive structural advantage in the AI race when it comes to fresh web data. <a href="https://finance.yahoo.com/news/google-huge-edge-over-openai-110102636.html">Cloudflare CEO Matthew Prince put numbers on it</a>: &#8220;For every one page that OpenAI sees, Google is seeing 3.2 pages.&#8221; Against Microsoft, the ratio is 4.8 to 1. The reason is simple. Publishers cannot block Googlebot without disappearing from search results. So Google gets access to the web at a scale that no competitor can match, and it can use that data not just for search but also for training and running its AI products.</p><p>In this context, suing companies that make it easier for competitors to scrape Google&#8217;s search results is not just about protecting copyrighted images in Knowledge Panels. It is also an act of defense of a competitive advantage. If OpenAI or any other AI company can get structured search data through SerpApi, they partially close the gap that Google&#8217;s crawler monopoly creates. Shutting down that channel through litigation serves Google&#8217;s position in the AI race, even if the complaint is framed purely in terms of copyright protection.</p><h2>What Happens Next</h2><p>The case is still in its early stages. <a href="https://ppc.land/serpapi-files-motion-to-dismiss-googles-dmca-scraping-lawsuit/">SerpApi filed its motion to dismiss</a> on February 20, 2026. <a href="https://www.courtlistener.com/docket/72059948/google-llc-v-serpapi-llc/">According to the court docket</a>, the initial case management conference before Judge Yvonne Gonzalez Rogers is scheduled for March 30, 2026, and a hearing on the motion to dismiss is set for May 19, 2026.</p><p>If the motion to dismiss fails and the case proceeds to discovery and trial, it will force courts to answer questions that have been left open since hiQ. Is a JavaScript challenge a &#8220;technological protection measure&#8221; under the DMCA? Can anti-bot systems on publicly accessible websites invoke federal anti-circumvention law? Does the DMCA protect the act of accessing a public webpage, or only the copyrighted works behind genuine access controls like encryption and authentication?</p><p>For the scraping industry, the stakes are high. A ruling in Google&#8217;s favor would give any website with copyrighted content and a bot-detection system a federal cause of action against scrapers. A ruling in SerpApi&#8217;s favor would confirm that the DMCA was not designed to protect public webpages from automated access, regardless of the technical measures deployed.</p><p>We will follow the case closely. Whatever happens, the days of operating in a legal gray area are coming to an end. The courts will have to draw a line, and that line will define the rules for the next decade of web scraping.</p><p>*<em>Disclaimer: We are not lawyers. This article represents our analysis of publicly available court filings and legal commentary. Consult legal counsel for advice specific to your situation.</em>*</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #99: HTTP Caching for Web Scraping]]></title><description><![CDATA[How Conditional Requests Can Cut Your Proxy Bill, using HTTP caching.]]></description><link>https://substack.thewebscraping.club/p/http-caching-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/http-caching-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 05 Mar 2026 15:18:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5c39bf0e-6c50-4c30-bb29-fe68b7b616d5_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the biggest cost drivers in recurring scraping operations is fetching pages daily or even more times a day, especially if we need to use proxies, just to discover that have not changed since the last run. <br>In price monitoring application this is fairly common: let&#8217;s say you are monitoring prices every hour across 50,000 product pages, it&#8217;s highly probable that  most of them still show the same price they showed an hour ago. You are paying your proxy provider for bandwidth that carries identical data, over and over.</p><p>The scraping industry is well aware of this problem. A <a href="https://scrapeops.io/blog/scraping-shock/">recent analysis by ScrapeOps</a> found that even though proxy prices have dropped by 67% over the past five years, the cost per successful payload has actually increased by 133%, mostly because anti-bot defenses now require heavier infrastructure. When each request costs more, wasting them on unchanged pages hurts even more.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Several approaches try to solve this. Tools like <a href="http://changedetection.io">changedetection.io</a> monitor pages for visual or structural changes and alert you when something is different. On the more technical side, <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Altay Akkus&quot;,&quot;id&quot;:272178059,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13f918be-3a3b-4cc1-b442-cd912cb5efbe_144x144.png&quot;,&quot;uuid&quot;:&quot;dcac9616-c6b5-4389-950b-847aa67e589d&quot;}" data-component-name="MentionToDOM"></span> <a href="https://altayakkus.substack.com/p/partial-content-web-crawling-using">recently explored</a> using SimHash as a client-side fingerprint to determine whether a document has changed since the last crawl, without downloading the full body. These are valid strategies, but they all share one trait: they require you to build and maintain the change detection logic yourself.</p><p>What you might not know is that the HTTP protocol already has a native mechanism for this, and it has been part of the spec since 1999. It is called conditional requests, and it lets the server itself tell your scraper &#8220;nothing has changed&#8221; by responding with a 304 status and zero bytes of body. No diffing, no hashing, no client-side state management beyond storing a single header value.</p><p>We have written about proxy cost optimization before in articles like <a href="https://substack.thewebscraping.club/p/optimizing-proxy-costs">Optimizing Proxy Usage for Large-Scale Scraping</a> and <a href="https://substack.thewebscraping.club/p/analyzing-cost-web-scraping">Analyzing the Cost of a Web Scraping Project</a>, but we have never covered this technique. In this article, we will test it against real e-commerce sites and measure exactly how much bandwidth and money it can save.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>How HTTP caching works (the short version)</h2><p>When a web server responds to a request, it can include headers that describe the freshness and identity of the content. Two of these headers are relevant for our purposes.</p><p><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/ETag">The first is </a><code>ETag</code><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/ETag">, short for Entity Tag</a>. It is a string that uniquely identifies a specific version of a resource. Think of it as a fingerprint of the page content. When the content changes, the ETag changes.</p><p><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Last-Modified">The second is </a><code>Last-Modified</code>, a timestamp indicating when the resource was last updated.</p><p>These two headers enable what HTTP calls conditional requests. The idea is simple. After your first request, you store the ETag (or the Last-Modified value) returned by the server. On the next request to the same URL, you send it back using the `If-None-Match` header (for ETags) or `If-Modified-Since` (for timestamps). The server compares your stored value with the current one. If they match, the server responds with status 304 Not Modified and an empty body. If they do not match, you get a regular 200 response with the fresh content.</p><p>A 304 response contains zero bytes of body. For a proxy billed per GB, that is a request that costs almost nothing in bandwidth.</p><h2>The tools we used</h2><p>The HTTP caching technique itself is protocol-level and works with any HTTP client that allows setting custom headers. You could implement it with Python&#8217;s `requests`, <code>httpx</code>, or even raw <code>curl</code>.</p><p>For this article, we used <a href="https://github.com/lexiforest/curl_cffi">curl_cffi</a>, a Python HTTP client built on top of curl-impersonate. Its main strength for our purposes is TLS fingerprinting: it can impersonate the TLS handshake of real browsers (Chrome, Firefox, Safari), which prevents e-commerce sites from blocking the request before we even get to test caching behavior. Without TLS fingerprinting, some of the e-commerce targets we wanted to test would have returned 403 immediately, making it impossible to evaluate their caching support.</p><p>Then later in the article, we&#8217;ll see if we can use the same approach with Scrapy.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p></p><h2>The audit methodology</h2><p>Before attempting conditional requests, we need to check whether a target supports them. We wrote a simple audit function that makes two requests to any URL.</p><p>The first request is a standard GET. We capture the <code>ETag</code>, <code>Last-Modified</code>, and <code>Cache-Control</code> headers from the response, along with the response body size.</p><p>If an ETag or Last-Modified header is present, we make a second request with the corresponding conditional header (<code>If-None-Match</code> or <code>If-Modified-Since</code>). If the server responds with 304, the site supports conditional requests and we measure the bandwidth saving.<br></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;38438c17-6240-4fb7-b820-bcab5f5bf7d7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import time
from curl_cffi import requests


def audit_caching(url: str) -&gt; dict:
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    }

    resp = requests.get(url, headers=headers, impersonate="chrome", timeout=30)

    resp_headers = {k.lower(): v for k, v in resp.headers.items()}
    etag = resp_headers.get("etag")
    last_modified = resp_headers.get("last-modified")
    cache_control = resp_headers.get("cache-control")
    response_size = len(resp.content)

    result = {
        "url": url,
        "status": resp.status_code,
        "etag": etag,
        "last_modified": last_modified,
        "cache_control": cache_control,
        "response_size_bytes": response_size,
        "supports_304": False,
    }

    if etag or last_modified:
        time.sleep(2)

        cond_headers = dict(headers)
        if etag:
            cond_headers["If-None-Match"] = etag
        if last_modified:
            cond_headers["If-Modified-Since"] = last_modified

        cond_resp = requests.get(
            url, headers=cond_headers, impersonate="chrome", timeout=30
        )

        result["conditional_status"] = cond_resp.status_code
        result["conditional_size_bytes"] = len(cond_resp.content)
        result["supports_304"] = cond_resp.status_code == 304

    return result</code></pre></div><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">99.CONDITIONAL_SCRAPING</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Shopify stores: full conditional request support</h2><p>We focused our testing on Shopify stores because, while working on various scraping projects, we came across several Shopify-hosted sites that had this caching system enabled. Shopify powers hundreds of thousands of online stores and is one of the most common scraping targets in e-commerce, so the finding felt worth investigating systematically. The results were clear: Shopify stores with the native page cache enabled support conditional requests out of the box.</p><p>Allbirds, Kylie Cosmetics, and Brooklinen all returned 304 responses consistently. Here is what we measured on Allbirds:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;df447db3-cd85-4495-a2ce-1e3336e6b09e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">URL: https://www.allbirds.com/products/mens-tree-runners.json
Status: 200
Response size: 7,961 bytes

Caching headers:
  ETag: "page_cache:11044168:ProductDetailsController:de822deb7906aa6f9932541f4fe3dae9"
  Last-Modified: not present
  Cache-Control: not present

Conditional request support:
  304 Not Modified: YES
  Conditional response size: 0 bytes
  Bandwidth saving: 100.0%</code></pre></div><p>The saving is 100% because the 304 response body contains exactly zero bytes. The only cost is the request/response headers, which are a few hundred bytes.</p><p>This behavior was consistent across three types of Shopify endpoints. The Product HTML page is the standard storefront URL that a browser would load (e.g. <code>/products/mens-tree-runners</code>), which includes the full rendered page with images, reviews, and theme assets. The Product JSON endpoint is the same URL with .json appended (e.g. <code>/products/mens-tree-runners.json</code>), which returns only the structured product data: variants, prices, inventory, and metadata. The Catalog JSON endpoint <code>(/products.json</code>) returns the first page of the store&#8217;s entire product catalog in a single response.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sMAm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sMAm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 424w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 848w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1272w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png" width="914" height="159" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:159,&quot;width&quot;:914,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39412,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/189924926?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sMAm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 424w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 848w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1272w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We ran repeated conditional requests on each endpoint and confirmed that all returned 304 consistently. The ETag stayed stable as long as the product data did not change.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/http-caching-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Kadoa: Simplify Your Scraping Workflows with Automation and AI]]></title><description><![CDATA[My review of Kadoa: An AI-powered tool that lets you create scraping workflows in minutes]]></description><link>https://substack.thewebscraping.club/p/kadoa-review-ai-powered-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/kadoa-review-ai-powered-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 01 Mar 2026 12:34:31 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ad9731e7-7825-4d82-afea-27d4bd727905_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The web scraping industry has evolved very fast in recent years. The fact that <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">web scraping professionals needed to pivot their careers from scripts to agents</a> is only one of the facts that confirm how resilient this industry is. In particular, the scraping industry has changed not only due to AI, which is relatively recent, but also due to developments in infrastructure, bot detection, and more.</p><p>Lots of tools and libraries for the main programming languages have indeed driven web scraping to significant growth. The need companies have for data also makes such growth the actual reason for existing.</p><p>In this article, I&#8217;ll talk about Kadoa: A tool that lets you create resilient scraping workflows in minutes. I&#8217;ll show you its strengths, why you should consider it, and how it works, with a practical guide.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What is Kadoa?</h2><p><a href="https://www.kadoa.com/">Kadoa</a> is a web scraping tool that automatically and programmatically extracts web data at scale. You can use it either via the UI or via code, as it has SDKs and provides you with REST APIs.</p><p>The best part of using it is that you can just paste the target URL and the tool retrieves the data for you. Forget about anti-bot measures, fingerprinting issues, or proxy management: Kadoa does all of that for you very simply. Also, thanks to its AI engine, it can automatically recognize the structure of the data you want to scrape from a target website. So, say goodbye also to <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">CSS selectors and any other strategy you use to go beyond the DOM using LLMs</a>.</p><div><hr></div><blockquote><p><em>Using the right tool is just the first steps for a successful data extraction pipeline. Having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Why Consider Kadoa for Your Web Scraping Projects?</h2><p>The top reasons why you should consider Kadoa are the following:</p><ul><li><p><strong>Scrape via workflows</strong>: Kadoa&#8217;s UI is built to help you set scraping workflows step-by-step. Insert your target URL(s), define the data schema (or let AI make the work for you), and choose to scrape all the available pages or to remain on page and see the agent work for you.</p></li><li><p><strong>Write code only if you need it</strong>: Other than the UI, Kadoa provides you with Python and JavaScript SDKs and a wide set of REST APIs you can call. This allows you to create workflows via UI, but to manage and call them via code if you need to.</p></li><li><p><strong>Integrated data quality management</strong>: Before starting the scraping process of your target data, Kadoa allows you to manage data quality. In practice, it allows you to set data quality rules or to manage the rules it provides you, thanks to its AI agent.</p></li><li><p><strong>Easy proxy management</strong>: If you&#8217;ve been scraping for a while, you know that you have low chances of successfully scraping the majority of the content you need without using proxies. Using proxies is not a very big issue if you are used to it and if you already have a favourite provider. However, Kadoa simplifies proxy management. It already provides you with a list of countries you can choose from and, under the hood, it manages everything that&#8217;s needed to integrate proxies in your workflow.</p></li><li><p><strong>Scheduling feature</strong>: There are cases where you need to scrape the same target data from time to time. Or, eventually, you&#8217;d like to be notified when data in a target page has changed. Kadoa provides both these features. You can choose to schedule your workflow to scrape at precise time intervals. You can also choose among different notifications, one of which is getting notified when data is changed.</p></li></ul><h2>Kadoa&#8217;s Main Features</h2><p>Below is a list of Kadoa&#8217;s top features to help better understand its potential:</p><ul><li><p><strong>Simple and intuitive UI</strong>: Kadoa&#8217;s UI is simple and intuitive. It allows you to create workflows in minutes. Every scraping workflow is subdivided into steps, and Kadoa provides you with different screens. In a matter of a few minutes, you can define your preferred setup, insert the target page(s), and leave it scraping for you.</p></li><li><p><strong>Chrome extension</strong>: Other than the UI, <a href="https://www.kadoa.com/chrome-extension">Kadoa provides you with a Chrome extension</a>. If you are a Chrome user, this feature allows you to define everything you need directly on the target page, then trigger the workflow to let Kadoa&#8217;s agent start scraping.</p></li><li><p><strong>Code integrations</strong>: If you are a developer or if you simply need to invoke your workflows via code, Kadoa offers you two possibilities. It provides you with <a href="https://github.com/kadoa-org/kadoa-sdks">Python and JavaScript SDKs</a> in an open-source repository, so that you can use custom code to invoke your scrapers. Also, if you like to use code but prefer <a href="https://docs.kadoa.com/api-reference/introduction">REST APIs, Kadoa provides you with several endpoints</a>.</p></li><li><p><strong>Scraping suitable for structured or unstructured data</strong>: One of the difficult aspects you may encounter when manually scraping websites is defining how to grab unstructured data. This is one of the typical use cases where you could <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">use AI to detect patterns in data in your scraping projects</a>. The good news is that you don&#8217;t need to come up with imaginative solutions. Kadoa automatically retrieves unstructured data for you thanks to its AI engine.</p></li><li><p><strong>Data schemas definition</strong>: The tool provides you with a feature that allows you to define recurrent data structures. This can be helpful when you retrieve similar data from different websites. If you leave its AI engine to automatically define the data structure, in such cases, you could lose consistency across similar data.</p></li><li><p><strong>Proxy and anti-detection features</strong>: Forget about anti-bot measures and proxy management. Kadoa manages anti-bot solutions under the hood. It also provides you with a predefined list of locations you can choose from, and it will automatically set coherent proxies.</p></li><li><p><strong>Error handling</strong>: It provides you with advanced error handling management. Common cases are when the target site goes offline, is under maintenance, or encounters a technical issue. When this happens, Kadoa detects the problem, it notifies you, and automatically retries the data extraction. If recovery still fails, its support team is notified and investigates.</p></li><li><p><strong>Integration capabilities</strong>: The software allows you to integrate with several third parties. One interesting one is the <a href="https://n8n.io/integrations/kadoa/">integration between n8n and Kadoa</a>, which allows you to get your scraping automation workflow a step forward.</p></li><li><p><strong>Pricing model and usage graphs</strong>: Kadoa offers a <a href="https://www.kadoa.com/pricing">free tier option</a>, for which you can use 500 credits. Its pricing model is based on credit consumption, and it provides you with a UI section where you can see a graph of the consumption.</p></li><li><p><strong>Extensive docs</strong>: <a href="https://docs.kadoa.com/docs/introduction">Kadoa has extensive documentation</a> that covers both UI and API usage.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Hands-on Kadoa: Step-by-step Scraping Tutorial</h3><p>In this section, I&#8217;ll show you how to use Kadoa on an actual scraping task via the UI. The workflow will retrieve <a href="https://finance.yahoo.com/quote/INTC/history/?period1=1737538396&amp;period2=1769074385&amp;frequency=1wk">Intel&#8217;s historical price from Yahoo Finance</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XhPg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XhPg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 424w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 848w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1272w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png" width="1106" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1106,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138487,&quot;alt&quot;:&quot;Intel historical stock price data, image from their website taken by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Intel historical stock price data, image from their website taken by Federico Trotta" title="Intel historical stock price data, image from their website taken by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!XhPg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 424w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 848w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1272w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Intel historical stock price data</figcaption></figure></div><p>In this scraping workflow, I will:</p><ul><li><p>Set the target web page.</p></li><li><p>Define the data schema.</p></li><li><p>Set scheduling options and notifications.</p></li><li><p>Retrieve the actual data.</p></li></ul><p>Before starting the actual workflow, log in to Kadoa. Below is the first access page you will see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!taKn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!taKn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 424w, https://substackcdn.com/image/fetch/$s_!taKn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 848w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1272w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png" width="1456" height="705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:705,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:166980,&quot;alt&quot;:&quot;Kadoa's first access page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa's first access page by Federico Trotta" title="Kadoa's first access page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!taKn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 424w, https://substackcdn.com/image/fetch/$s_!taKn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 848w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1272w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa's first access page</figcaption></figure></div><p>Perfect! You are now ready to create your first scraping workflow with Kadoa.</p><h3>Step #1: Create a New Workflow</h3><p>From the main page, click on <strong>Add workflow</strong> to create a new one and paste the target URL. The <strong>Proxy location</strong> box allows you to select a country where proxies are localized; leave it to <strong>AUTO</strong> to let the tool automatically manage it. Click on <strong>Continue</strong> to proceed with the next step:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nd3l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 424w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 848w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png" width="1456" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240236,&quot;alt&quot;:&quot;A new workflow in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A new workflow in Kadoa by Federico Trotta" title="A new workflow in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 424w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 848w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A new workflow in Kadoa</figcaption></figure></div><p>Note that inside the <strong>Enter one or more URLs </strong>box<strong>,</strong> you have to insert the target page. If the target page is more than one, you can insert all the target pages you are interested in.</p><p>Alright, you created a new workflow in Kadoa. Let&#8217;s proceed with the next step and customize it!</p><h3>Step #2: Define the Data Schema</h3><p>As the next step, define the data schema:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ib01!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ib01!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 424w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 848w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1272w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png" width="1456" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8febc847-1387-489a-815c-cb8fa342897e_1840x861.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:200560,&quot;alt&quot;:&quot;Define the data schema in a Kadoa workflow by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Define the data schema in a Kadoa workflow by Federico Trotta" title="Define the data schema in a Kadoa workflow by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Ib01!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 424w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 848w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1272w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Define the data schema in a Kadoa workflow</figcaption></figure></div><p>If you want to insert the schema manually, Kadoa already provides you with some predefined schemas. For this tutorial, I&#8217;ve chosen to let AI do the job. So I selected <strong>AI Suggest Fields</strong>.</p><p>The system, then, asks you how you want to navigate the data. For the sake of this example, I decided to scrape only the current page from the target one, but you can also choose among three different options:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2REz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2REz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 424w, https://substackcdn.com/image/fetch/$s_!2REz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 848w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1272w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:318930,&quot;alt&quot;:&quot;Scraping data on a single page in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scraping data on a single page in Kadoa by Federico Trotta" title="Scraping data on a single page in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!2REz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 424w, https://substackcdn.com/image/fetch/$s_!2REz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 848w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1272w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Scraping data on a single page in Kadoa</figcaption></figure></div><p>After clicking on <strong>Continue</strong>, the agent will start doing its job:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!05pv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!05pv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 424w, https://substackcdn.com/image/fetch/$s_!05pv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 848w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1272w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png" width="1456" height="702" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:702,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183128,&quot;alt&quot;:&quot;Kadoa&#8217;s AI agent working by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s AI agent working by Federico Trotta" title="Kadoa&#8217;s AI agent working by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!05pv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 424w, https://substackcdn.com/image/fetch/$s_!05pv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 848w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1272w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s AI agent working</figcaption></figure></div><h3>Step #3: Review Extracted Fields and Schedule the Workflow</h3><p>Because I let AI work, the agent automatically tries to extract the data from the target page. But before proceeding, Kadoa asks for your review:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uYWt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uYWt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 424w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 848w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1272w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png" width="1456" height="783" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:783,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:305054,&quot;alt&quot;:&quot;The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta" title="The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!uYWt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 424w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 848w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1272w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The proposed extraction data schema by Kadoa&#8217;s AI agent</figcaption></figure></div><p>As you can see from the previous image, the agent has correctly detected the data to extract from the target page. Also, this job is finely improved as the tool provides you with a screenshot of the data it will extract, so that you can visualize it even better.</p><p>In the next step, you have to define the scheduling:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qxMP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qxMP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 424w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 848w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1272w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png" width="1456" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:260383,&quot;alt&quot;:&quot;Scheduling workflows in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scheduling workflows in Kadoa by Federico Trotta" title="Scheduling workflows in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!qxMP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 424w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 848w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1272w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Scheduling workflows in Kadoa</figcaption></figure></div><p>For the sake of this example, I decided to run the workflow only once. But, as you can see, you can choose among several scheduling options.</p><h3>Step #4: Set Notifications and Final Details</h3><p>As the next step, define the way you want to be notified:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CuRd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CuRd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 424w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 848w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1272w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png" width="1456" height="772" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:288124,&quot;alt&quot;:&quot;Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta" title="Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!CuRd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 424w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 848w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1272w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Setting up notifications in Kadoa</figcaption></figure></div><p>In this case, I decided to be notified via email if the workflow fails. You can add different notification channels by clicking on <strong>Add channel</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qp2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 424w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 848w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1272w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png" width="856" height="471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e10a8400-3f90-4793-832d-01537d8ef16b_856x471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:471,&quot;width&quot;:856,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45084,&quot;alt&quot;:&quot;Adding notification channels in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Adding notification channels in Kadoa by Federico Trotta" title="Adding notification channels in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 424w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 848w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1272w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Adding notification channels in Kadoa</figcaption></figure></div><p>Next, define the latest details of your scraping workflow:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zCk9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zCk9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 424w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 848w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1272w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png" width="1456" height="674" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:674,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:261303,&quot;alt&quot;:&quot;Define your workflow&#8217;s latest details by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Define your workflow&#8217;s latest details by Federico Trotta" title="Define your workflow&#8217;s latest details by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!zCk9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 424w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 848w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1272w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Define your workflow&#8217;s latest details</figcaption></figure></div><p>Before starting with the actual scraping, the system asks you to approve the sample data it proposes to you or to review the data quality rules:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hpCi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hpCi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 424w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 848w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1272w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png" width="1456" height="678" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229183,&quot;alt&quot;:&quot;Decide whether to review rules or not by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Decide whether to review rules or not by Federico Trotta" title="Decide whether to review rules or not by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!hpCi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 424w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 848w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1272w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Decide whether to review data quality rules or not</figcaption></figure></div><p>By clicking on <strong>Review rules</strong>, the tool provides you with automated data quality rules. You can select them if you think this will improve the quality of the scraping result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1tPf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1tPf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 424w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 848w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1272w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png" width="1456" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:328419,&quot;alt&quot;:&quot;Reviewing data quality rules in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reviewing data quality rules in Kadoa by Federico Trotta" title="Reviewing data quality rules in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!1tPf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 424w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 848w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1272w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reviewing data quality rules in Kadoa</figcaption></figure></div><p>When you are done reviewing quality rules, click on <strong>Approve</strong>. The actual scraping workflow will start and will be queued:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cov2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cov2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 424w, https://substackcdn.com/image/fetch/$s_!cov2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 848w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1272w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161278,&quot;alt&quot;:&quot;New Kadoa&#8217;s workflow queued by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="New Kadoa&#8217;s workflow queued by Federico Trotta" title="New Kadoa&#8217;s workflow queued by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!cov2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 424w, https://substackcdn.com/image/fetch/$s_!cov2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 848w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1272w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">New Kadoa&#8217;s workflow queued</figcaption></figure></div><p>Et voil&#224;! You have launched your first scraping workflow with Kadoa.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Download Data, See Logs and Statistics in Kadoa</h3><p>The <strong>workflow</strong> section reports all the workflows you created, their status, and the token consumption for each scraper:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GQl0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GQl0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 424w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 848w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1272w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png" width="1456" height="631" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:631,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176411,&quot;alt&quot;:&quot;Kadoa&#8217;s workflows summary and statistics by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s workflows summary and statistics by Federico Trotta" title="Kadoa&#8217;s workflows summary and statistics by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!GQl0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 424w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 848w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1272w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s workflows summary and statistics</figcaption></figure></div><p>By clicking on one workflow, you can see the data it retrieved and can decide the format you want to download it:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QgG1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QgG1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 424w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 848w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1272w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png" width="1456" height="724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:724,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233419,&quot;alt&quot;:&quot;Visualizing and retrieving scraped data in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Visualizing and retrieving scraped data in Kadoa by Federico Trotta" title="Visualizing and retrieving scraped data in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!QgG1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 424w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 848w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1272w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visualizing and retrieving scraped data in Kadoa</figcaption></figure></div><p>The <strong>Activity log</strong> page reports detailed logs of every action occurred to your workflows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UduM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UduM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 424w, https://substackcdn.com/image/fetch/$s_!UduM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 848w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1272w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png" width="1456" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178535,&quot;alt&quot;:&quot;Kadoa&#8217;s logs page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s logs page by Federico Trotta" title="Kadoa&#8217;s logs page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!UduM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 424w, https://substackcdn.com/image/fetch/$s_!UduM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 848w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1272w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s logs page</figcaption></figure></div><p>The <strong>Usage</strong> page reports graphs of the trend in terms of active workflows and the number of rows extracted for workflow, as well as the remaining total tokens on your plan:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!snXw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!snXw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 424w, https://substackcdn.com/image/fetch/$s_!snXw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 848w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1272w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png" width="1456" height="843" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:843,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139493,&quot;alt&quot;:&quot;Kadoa&#8217;s tokens usage page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s tokens usage page by Federico Trotta" title="Kadoa&#8217;s tokens usage page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!snXw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 424w, https://substackcdn.com/image/fetch/$s_!snXw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 848w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1272w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s tokens usage page</figcaption></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Manage Kadoa&#8217;s Workflows via APIs</h2><p>As introduced before, <a href="https://docs.kadoa.com/api-reference/introduction">Kadoa provides you with several endpoints for making calls via REST APIs</a>. The APIs allow you to perform several actions that are not strictly necessary for workflows already created. For example, you can start <a href="https://docs.kadoa.com/api-reference/crawling/start-crawling-session">crawling sessions</a> and <a href="https://docs.kadoa.com/api-reference/schemas/create-schema">create data schemas</a>.</p><p>Before using the API, get your API Key under the <strong>Settings</strong> page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pwMH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pwMH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 424w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 848w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1272w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png" width="1456" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108703,&quot;alt&quot;:&quot;Get your Kadoa API key by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Get your Kadoa API key by Federico Trotta" title="Get your Kadoa API key by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!pwMH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 424w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 848w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1272w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Get your Kadoa API key</figcaption></figure></div><p>If you want to manage already existing workflows, either created via the UI or APIs, you have to use the specific workflow&#8217;s ID via the UI.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b0VW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b0VW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 424w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 848w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1272w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png" width="1456" height="205" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:205,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66109,&quot;alt&quot;:&quot;Get a workflow ID by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Get a workflow ID by Federico Trotta" title="Get a workflow ID by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!b0VW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 424w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 848w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1272w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Get a workflow ID</figcaption></figure></div><p>Then you can perform several actions by invoking the REST endpoints. For example, you can <a href="https://docs.kadoa.com/api-reference/workflows/schedule-a-workflow">schedule a particular workflow</a> for later:</p><pre><code><code>curl --request PUT \\
  --url &lt;https://api.kadoa.com/v4/workflows/{workflowId}/schedule&gt; \\
  --header 'Content-Type: application/json' \\
  --header 'x-api-key: &lt;api-key&gt;' \\
  --data '
{
  "date": "2025-02-07T10:00:00.000Z"
}
'</code></code></pre><p>Where you have to insert the following:</p><ul><li><p><em>workflowId</em> : Is the ID of the workflow you want to schedule.</p></li><li><p><em>&lt;api-key&gt;</em>: Is your KadoaAPI key.</p></li><li><p>The actual date you want your workflow to start the scraping task. You have to use the ISO format for the date in UTC.</p></li></ul><h2>Kadoa: Final Comments</h2><p>After analyzing and testing the tool, I can say the following are its main advantages and disadvantages:</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Ready for AI integration. You can download the scraped data or integrate it into your AI projects directly via API.</p></li><li><p>Suits all the user needs, as it provides APIs, SDKs, and the UI.</p></li><li><p>Supports structured output formats, including JSON.</p></li><li><p>Offers virtually unlimited scalability on the side of infrastructure management and the number of URLS to scrape.</p></li><li><p>Focuses on data quality before scraping, not later.</p></li><li><p></p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Currently, it supports only 5 proxy locations.</p></li><li><p>You can&#8217;t scrape all the websites you&#8217;d like:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nPP4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nPP4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 424w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 848w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1272w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png" width="1226" height="616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:1226,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93999,&quot;alt&quot;:&quot;Unsupported scraping URL in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Unsupported scraping URL in Kadoa by Federico Trotta" title="Unsupported scraping URL in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!nPP4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 424w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 848w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1272w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Unsupported scraping URL in Kadoa</figcaption></figure></div><h2>Conclusion</h2><p>In this article, I&#8217;ve presented Kadoa: An AI-powered scraping tool that helps you simplify your scraping projects. As you&#8217;ve seen, this is a ready-to-use tool that creates scraping workflows in minutes via UI and also supports code.</p><p>Let us know in the comments: Did you know this tool before? Have you already tested it?</p>]]></content:encoded></item><item><title><![CDATA[Why LLM-Ready Scrapers Return Content in Markdown: A Deep Dive]]></title><description><![CDATA[Why do all AI-ready scraping solutions produce Markdown results? Let&#8217;s find out!]]></description><link>https://substack.thewebscraping.club/p/why-scraping-return-markdown-llm-ai</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/why-scraping-return-markdown-llm-ai</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 22 Feb 2026 21:35:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3da96938-1add-42c9-a64d-3888021f9eba_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What do FireCrawl&#8217;s <em>/scrape</em> endpoint, Bright Data&#8217;s Web Unlocker API, Craw4AI, and a whole bunch of other AI-ready scraping libraries and products have in common? They all give you the option to return scraped content from web pages in Markdown (and sometimes, it&#8217;s even the default behavior!) Ever wondered why?</p><p>In this article, I&#8217;ll break down the main reasons so you can understand why LLM-ready scrapers work this way&#8212;and how you could even build a simple one yourself!</p><h2>A Brief Reminder About Web Scraping Tools for AI</h2><p>With the <a href="https://substack.thewebscraping.club/p/how-ai-is-changing-the-web-scraping">recent rise of AI</a>, some web scraping solutions have specialized in returning content optimized for LLM ingestion.</p><p>That means the content returned by <a href="https://substack.thewebscraping.club/p/web-scraping-ai-tools-landscape">AI-ready web unlockers or open-source scraping libraries</a> isn&#8217;t just plain HTML. On the contrary, you often get an optimized Markdown version of the page. (Sometimes it&#8217;s even parsed JSON, but that&#8217;s a different story I won&#8217;t cover here.)</p><p>The Markdown content is then ready to be processed by an LLM as part of an AI agent, an AI workflow or pipeline, a multi-agent system, or similar system. In some cases, <a href="https://substack.thewebscraping.club/p/build-an-ai-agent-for-scraping-papers">these web scraping tools are even accessed autonomously by AI agents</a>, which decide when to use them to retrieve web content based on the task at hand.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Why AI-Ready Web Scrapers Choose Markdown in the First Place</h2><p>Let me introduce you to Markdown, the language spoken by LLMs.</p><p><strong>Note</strong>: I assume you already know what Markdown is, but if not, <a href="https://en.wikipedia.org/wiki/Markdown">read its Wikipedia page</a> (as it gives a quick overview with everything you need to know about its syntax).</p><h3>A Bit of Context About Data Formats in LLMs</h3><p>Most LLMs can handle pretty much any text-based format you throw at them, whether it&#8217;s plain text, HTML, JSON, CSV, XML, or others. <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">Some even have vision capabilities</a> and can understand images or other multimodal content.</p><p>Still, under the hood, most LLMs actually &#8220;speak&#8221; Markdown. That&#8217;s how they handle code blocks, tables, and other structured content, if you&#8217;ve ever wondered&#8230;</p><p>I&#8217;m sure I&#8217;m not the only one who has received a response from ChatGPT or Gemini in pure Markdown, even if I didn&#8217;t ask for it. Or sometimes, you can even catch the LLM responding in Markdown, and the page renders it in real time:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e_7Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e_7Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 424w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 848w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 1272w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png" width="1010" height="563" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:563,&quot;width&quot;:1010,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the `` characters returned by the LLM&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the `` characters returned by the LLM" title="Note the `` characters returned by the LLM" srcset="https://substackcdn.com/image/fetch/$s_!e_7Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 424w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 848w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 1272w, https://substackcdn.com/image/fetch/$s_!e_7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4bc03-723f-4cfc-9dda-641787704551_1010x563.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the `` characters returned by the LLM</figcaption></figure></div><p>Note that the LLM is writing the &#8220;```&#8221; characters used in Markdown to signify code blocks.</p><h3>Why Markdown Is a Perfect Data Format for LLMs (and AI-Ready Scrapers Use It Too)</h3><p>So, cool, LLMs love Markdown and use it behind the scenes. But why? Well, because Markdown is a versatile format that hits all the sweet spots LLMs care about:</p><ul><li><p><strong>Structured content</strong>: Markdown gives you hierarchy and organization out of the box (H1, H2, H3, lists, images, code blocks, etc.), making it easy for LLMs to parse and understand the structure of your content.</p></li><li><p><strong>Concise and LLM-friendly</strong>: Compared to raw HTML or JSON, Markdown is much more concise. Less unnecessary markup or structure means fewer tokens consumed, which also reduces the risk of hallucinations, truncations, or context overflows.</p></li><li><p><strong>De facto standard</strong>: While there&#8217;s no single formal Markdown standard, <a href="https://github.github.com/gfm/">GitHub-flavored Markdown </a>has become the widely adopted baseline, so most tools and scrapers default to it.</p></li><li><p><strong>Rich content support</strong>: Markdown supports images, links, tables, code snippets (and in some cases, such as with MDX/MarkdownX, even raw HTML or embedded React components), making it flexible for a wide range of content types.</p></li><li><p><strong>Alignment with training data:</strong> LLMs are trained on <a href="https://commoncrawl.org/">massive datasets like Common Crawl</a>, where a huge portion of high-quality technical documentation (READMEs, wikis, Stack Overflow posts, etc.) is written in Markdown. This means most AI models don&#8217;t just &#8220;understand&#8221; Markdown. Instead, they learned to reason through its structure during training, giving them a natural intuition for the format.</p></li></ul><p>Long story short, that&#8217;s why most web scraping solutions built for AI integrations return content in Markdown (or at least give you the option).</p><p>By converting a scraped HTML page directly into Markdown, AI-ready scrapers help the underlying LLM (whether it&#8217;s part of a machine learning pipeline, AI agent, <a href="https://substack.thewebscraping.club/p/web-scraping-assistant-gpt">RAG workflow</a>, plugin, or other application) process the content efficiently and effectively while also saving on token usage.</p><h3>Markdown vs HTML</h3><p>Still not convinced? Take a look at an HTML-to-Markdown conversion of a <a href="https://www.espn.com/tennis/story/_/id/45732583/jannik-sinner-defeats-carlos-alcaraz-rematch-win-wimbledon-2025-men-singles-title">sports news page from ESPN</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ac9N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ac9N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 424w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 848w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 1272w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png" width="1456" height="671" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:671,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;HTML vs Markdown representation of the same news article&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="HTML vs Markdown representation of the same news article" title="HTML vs Markdown representation of the same news article" srcset="https://substackcdn.com/image/fetch/$s_!Ac9N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 424w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 848w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 1272w, https://substackcdn.com/image/fetch/$s_!Ac9N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bbff4cc-6c8f-448b-b6de-1ede5679fc5e_3009x1386.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">HTML vs Markdown representation of the same news article</figcaption></figure></div><p>As you can see, the original HTML page contains 125.88 KB of content. After converting to Markdown, it drops to 35.84 KB. That&#8217;s a <strong>~28% reduction</strong> in size just from a simple data format conversion, without any significant loss of actual content!</p><p>If we look at token usage, the difference can appear even more striking. The original HTML page translates to 40,125 tokens:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dVDt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dVDt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 424w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 848w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 1272w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dVDt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Tokens for the HTML page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Tokens for the HTML page" title="Tokens for the HTML page" srcset="https://substackcdn.com/image/fetch/$s_!dVDt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 424w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 848w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 1272w, https://substackcdn.com/image/fetch/$s_!dVDt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc4b574-b221-486e-a8a6-e903332bf745_3022x1468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tokens for the HTML page</figcaption></figure></div><p>Meanwhile, the Markdown version corresponds to only 11,175 tokens:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2LXh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2LXh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 424w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 848w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 1272w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2LXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png" width="1456" height="703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:703,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Tokens for the Markdown-converted page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Tokens for the Markdown-converted page" title="Tokens for the Markdown-converted page" srcset="https://substackcdn.com/image/fetch/$s_!2LXh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 424w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 848w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 1272w, https://substackcdn.com/image/fetch/$s_!2LXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F049f3230-d1e3-4097-a359-071b3240e65a_3027x1462.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tokens for the Markdown-converted page</figcaption></figure></div><p>Again, that&#8217;s roughly a <strong>3.6&#215; reduction in token usage </strong>(which usually translates directly into cost savings, since most AI providers charge based on LLM usage).</p><p>For a more direct comparison between data formats, explore the <a href="https://www.kaggle.com/code/brightdataml/benchmarking-ai-on-different-data-formats">AI data format comparison </a>research piece (which I wrote in collaboration with Bright Data on its Kaggle account).</p><h3>But Raw Markdown Alone Isn&#8217;t Enough&#8230;</h3><p>Now, you might be thinking: <em>&#8220;Okay, I&#8217;ll just convert HTML pages to Markdown using one of the many HTML-to-Markdown libraries out there, and I&#8217;m done.&#8221; </em>Well&#8230; not quite.</p><p>The problem is that a direct HTML-to-Markdown conversion isn&#8217;t enough, and below are the main reasons why (and how to address them).</p><h3>1. Non-Content HTML Tags Get Treated as Content</h3><p>HTML pages are full of blocks that are required for rendering, but are completely useless for understanding the page itself. Think <em>&lt;script&gt;</em>, <em>&lt;style&gt;</em>, inline JSON configs, and similar tags.</p><p>After all, those HTML blocks contain plain text. Thus, conversion libraries (rightfully) treat them like any other text node and include them in the Markdown output:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u51a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u51a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 424w, https://substackcdn.com/image/fetch/$s_!u51a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 848w, https://substackcdn.com/image/fetch/$s_!u51a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!u51a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u51a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png" width="1456" height="625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:625,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Notice how the <script> tags are converted&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Notice how the <script> tags are converted" title="Notice how the <script> tags are converted" srcset="https://substackcdn.com/image/fetch/$s_!u51a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 424w, https://substackcdn.com/image/fetch/$s_!u51a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 848w, https://substackcdn.com/image/fetch/$s_!u51a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!u51a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f2857f-e2ed-4d1f-a0e6-e52b90dc7aca_2954x1268.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Notice how the &lt;script&gt; tags are converted</figcaption></figure></div><p>This clearly pollutes the output with noise the LLM doesn&#8217;t need, while also greatly increasing token usage (as <em>&lt;script&gt;</em> and <em>&lt;style&gt;</em> blocks can be surprisingly long!)</p><p><strong>&#127919; Solution</strong>: Use an HTML parser (or, in simpler cases, well-scoped regexes) to remove <em>&lt;script&gt;</em>, <em>&lt;style&gt;</em>, and similar rendering-only HTML tags before converting the page to Markdown.</p><p>For instance, simply removing <em>&lt;script&gt;</em> and <em>&lt;style&gt;</em> tags from the input HTML produces an impressive reduction, from 1.93 MB down to 68.37 KB, which also translates into huge token savings!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RvpI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RvpI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 424w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 848w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 1272w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RvpI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png" width="1456" height="634" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:634,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the size of the new output&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the size of the new output" title="Note the size of the new output" srcset="https://substackcdn.com/image/fetch/$s_!RvpI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 424w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 848w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 1272w, https://substackcdn.com/image/fetch/$s_!RvpI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b306d5a-6de0-450d-ba8b-1b866420ed3a_2957x1288.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the size of the new output</figcaption></figure></div><h3>2. Ads and Promotional Content</h3><p>Ads, sponsored sections, and &#8220;recommended for you&#8221; blocks might have nothing to do with the main content of the page. Leaving them in the converted Markdown can confuse the LLM or skew its understanding of what the page is really about.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kFvv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kFvv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 424w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 848w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 1272w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kFvv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The rendered <iframe> ad is leaking content into the output Markdown&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The rendered <iframe> ad is leaking content into the output Markdown" title="The rendered <iframe> ad is leaking content into the output Markdown" srcset="https://substackcdn.com/image/fetch/$s_!kFvv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 424w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 848w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 1272w, https://substackcdn.com/image/fetch/$s_!kFvv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0417bea-79ab-4b17-af4e-70e1b790b6a0_2950x1280.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The rendered &lt;iframe&gt; ad is leaking content into the output Markdown</figcaption></figure></div><p><strong>&#127919; Solution</strong>: Use proxies that support ad-blocking when retrieving the HTML page, <a href="https://adguard.com/en/blog/adguard-for-linux-nightly.html?utm_source=reddit">enable OS-based ad blockers</a> on your deployment server, or apply rules to remove ads after fetching the HTML and before converting it to Markdown.</p><h3>3. Navigation, Headers, and Footers</h3><p>Menus, breadcrumbs, and footer links are all technically &#8220;content,&#8221; but could be semantically irrelevant for your use case (particularly if you&#8217;re not interested in links for crawling or further exploration).</p><p>If those elements aren&#8217;t removed or downweighted, they increase token usage. Plus, the LLM may overemphasize them or mistake them for part of the main content.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DwXc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DwXc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 424w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 848w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 1272w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DwXc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png" width="1215" height="1157" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1157,&quot;width&quot;:1215,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The conversion of the <header> element results in a list of URLs appearing in the target Markdown&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The conversion of the <header> element results in a list of URLs appearing in the target Markdown" title="The conversion of the <header> element results in a list of URLs appearing in the target Markdown" srcset="https://substackcdn.com/image/fetch/$s_!DwXc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 424w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 848w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 1272w, https://substackcdn.com/image/fetch/$s_!DwXc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feab1576b-2217-4b70-9fc9-e92592d9f922_1215x1157.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The conversion of the &lt;header&gt; element results in a list of URLs appearing in the target Markdown</figcaption></figure></div><p><strong>&#127919; Solution</strong>: Conveniently remove tags like <em>&lt;header&gt;</em> and <em>&lt;footer&gt;</em>, or design your HTML-to-Markdown system to accept only specific CSS selectors for the blocks you want to include in the conversion process (<a href="https://docs.crawl4ai.com/core/content-selection/">just like Crawl4AI does</a>).</p><h3>4. Repeated and Boilerplate Text and Content</h3><p>Things like &#8220;Sign up,&#8221; &#8220;Log in,&#8221; newsletter popups, cookie banners, or legal disclaimers (like GDPR notices) appear on almost every page of a site. Including them wastes tokens and adds repetition, which can degrade reasoning quality and increase the risk of hallucinations.</p><p><strong>&#127919; Solution</strong>: This is a tricky problem, as there&#8217;s no easy way to identify and remove all of these elements automatically. I know for a fact that some industry leaders have trained small LLMs specifically for this task, letting them process the remaining HTML (after earlier cleaning steps) to filter out all irrelevant content.</p><h3>How to Convert a Web Page from HTML to LLM-Optimized Markdown</h3><p>I was recently asked by a client to analyze specific web pages from competitors&#8217; websites. These included structured pages with hidden elements that required basic interactions (like dropdowns). Plus, some information was spread across badge images, links, etc.</p><p>Now, if you&#8217;re trying to get high-level insights from web pages, your first idea might be to just copy all the text on a page (CTRL+A + CTRL+C) and paste it into ChatGPT (or a similar AI solution), analyzing it with the right prompt. That&#8217;s far from optimal, because you lose structure, links, image URLs, and other important context.</p><p>Instead, I wrote a simple Python script that:</p><ol><li><p>Reads HTML from an <em>index.html</em> file.</p></li><li><p>Keeps only the <em>&lt;body&gt;</em> tag with Beautiful Soup for restricting the content to what you&#8217;re typically interested in.</p></li><li><p>Remove <em>&lt;script&gt;</em>, <em>&lt;style&gt;</em>, <em>&lt;header&gt; </em>and <em>&lt;footer&gt;</em> nodes.</p></li><li><p>Converts it to Markdown using <em><a href="https://github.com/matthewwithanm/python-markdownify">markdownify</a></em>.</p></li><li><p>Writes the output to an <em>output.md</em> file.</p></li></ol><p>Let me show you this script!</p><h3>HTML to LLM-Optimized Markdown Script</h3><p>Here&#8217;s the simple script for converting HTML files to LLM-ready Markdown outputs:</p><pre><code># pip install beautifulsoup4 lxml markdownify

from bs4 import BeautifulSoup
from markdownify import markdownify as md

def html_to_markdown(html_input_path: str, output_markdown_path: str):
    # Load the input HTML from a file
    with open(html_input_path, "r", encoding="utf-8") as f:
        html = f.read()

    # Parse the HTML (using lxml for high performance)
    soup = BeautifulSoup(html, "lxml")

    # Remove the undesired tags
    for tag in soup(["script", "style", "header", "footer"]):
        tag.decompose()

    # Keep only the &lt;body&gt; content (if present)
    body_html = soup.body.decode_contents() if soup.body else str(soup)

    # Convert the HTML to Markdown
    markdown = md(
        body_html,
        bs4_options="lxml" # Set the underlying HTML parser
    )

    # Write the Markdown output to disk
    with open(output_markdown_path, "w", encoding="utf-8") as f:
        f.write(markdown)


if __name__ == "__main__":
    html_to_markdown("index.html", "output.md")</code></pre><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>How to Use the Script</h3><p>The script above still involves a few manual steps, but it greatly improves the transformation of a web page into content that&#8217;s ready to be sent to any LLM.</p><p>First, <a href="https://substack.thewebscraping.club/p/anycrawl-testing-the-llm-ready-web">load the target page</a> in your browser (ideally with an ad blocker enabled):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ruqB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ruqB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 424w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 848w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 1272w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ruqB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target page" title="The target page" srcset="https://substackcdn.com/image/fetch/$s_!ruqB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 424w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 848w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 1272w, https://substackcdn.com/image/fetch/$s_!ruqB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfa07616-22ef-4fce-a426-42f281547f69_3045x1604.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target page</figcaption></figure></div><p>Once the page has fully rendered, right-click and select the &#8220;Inspect&#8221; entry. Locate the <em>&lt;html&gt;</em> tag, then use &#8220;Copy &gt; Copy outerHTML&#8221; option to get the complete HTML of the rendered page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CTKj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CTKj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 424w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 848w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 1272w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CTKj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png" width="1456" height="789" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:789,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Copying the rendered HTML of the target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Copying the rendered HTML of the target page" title="Copying the rendered HTML of the target page" srcset="https://substackcdn.com/image/fetch/$s_!CTKj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 424w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 848w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 1272w, https://substackcdn.com/image/fetch/$s_!CTKj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f9f63b-2487-44e8-9527-d3e958e38631_2249x1219.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Copying the rendered HTML of the target page</figcaption></figure></div><p><strong>Note</strong>: Copying the rendered HTML is better than copying the HTML from the &#8220;View page source&#8221; option. The latter misses all dynamic content (basically, anything that requires JavaScript execution and rendering in the browser won&#8217;t appear in the raw page source).</p><p>Next, in your project folder, paste the HTML into a file named <em>index.html</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0mEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0mEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0mEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The index.html file in the project folder&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The index.html file in the project folder" title="The index.html file in the project folder" srcset="https://substackcdn.com/image/fetch/$s_!0mEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!0mEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe229e379-8f2b-41cd-9d3c-7be1b7da1803_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The index.html file in the project folder</figcaption></figure></div><p>Run the Python script, and it&#8217;ll generate an <em>output.md</em> file:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hGQJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hGQJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The resulting output.md file generated by the script&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The resulting output.md file generated by the script" title="The resulting output.md file generated by the script" srcset="https://substackcdn.com/image/fetch/$s_!hGQJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!hGQJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89696d0e-a896-4950-8d1e-b421ad365221_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The resulting output.md file generated by the script</figcaption></figure></div><p>You can now pass the resulting Markdown to an LLM for processing.</p><p>Compared to a traditional HTML-to-Markdown approach, the tweaks in this process save tons of tokens. In particular, the output produced by this method is just 41.2 KB (compared to 611.68 KB for the original HTML), which corresponds to <strong>11,006 tokens</strong>.</p><p>If you applied a basic HTML-to-Markdown conversion, you&#8217;d end up with a 430.21 KB Markdown file, resulting in <strong>154,191 tokens</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Z-C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Z-C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 424w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 848w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png" width="1456" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Traditional HTML-to-Markdown conversion approach&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Traditional HTML-to-Markdown conversion approach" title="Traditional HTML-to-Markdown conversion approach" srcset="https://substackcdn.com/image/fetch/$s_!4Z-C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 424w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 848w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!4Z-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14ca52df-cddb-4cc6-944d-0f13073acf9b_2949x1275.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Traditional HTML-to-Markdown conversion approach</figcaption></figure></div><p>In other words, these basic tricks lead to <strong>over 14&#215; token savings</strong>. Not bad!</p><p>Et voil&#224;! Simple, manual, but highly effective.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Next Step</h3><p>The script I just shared achieves its goal, but it&#8217;s super straightforward. I presented it simply to prove how a basic HTML-to-Markdown conversion is essentially suboptimal.</p><p>For a more sophisticated result, you could integrate a similar process into your LLM-ready scraper, including CLI options for more control over which tags to remove, which nodes to select, and other conversion settings.</p><p>As a project idea, you could even turn this approach into a browser extension that converts rendered web pages in a user&#8217;s browser into LLM-ready Markdown output files. Clearly, if you go this route, make sure to follow <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical web scraping practices</a>.</p><h2>Is Markdown Always the Right Choice?</h2><p>This is the final question to ask after all this discussion. Now, you might be thinking: <em>&#8220;Okay, there are no good reasons to stick with plain HTML when passing web pages to an LLM.&#8221;</em></p><p>That&#8217;s not true since there are situations where having access to the raw HTML can make a difference. Think of when the HTML contains semantic attributes or metadata. This information would be lost during HTML-to-Markdown conversion.</p><p>In detail, these are some scenarios where sticking to HTML for LLM ingestion is beneficial:</p><ul><li><p><em><a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Global_attributes/data-*">data-*</a></em> attributes storing product IDs, prices, etc.</p></li><li><p><a href="https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA">ARIA attributes</a> that convey accessibility or structural information.</p></li><li><p><em>class</em> or other HTML attributes that reveal context beyond the visible content on a node.</p></li><li><p>HTML comments contain useful information about the page.</p></li></ul><p>For example, consider this HTML node on an Amazon page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1MJt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1MJt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 424w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 848w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 1272w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1MJt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the hidden element on this Amazon product page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the hidden element on this Amazon product page" title="Note the hidden element on this Amazon product page" srcset="https://substackcdn.com/image/fetch/$s_!1MJt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 424w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 848w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 1272w, https://substackcdn.com/image/fetch/$s_!1MJt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9ef9b0b-f54b-48a0-b5c2-4917912076ef_2434x1366.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the hidden element on this Amazon product page</figcaption></figure></div><p>This is just an empty <em>&lt;span&gt;</em>, but its <em>data-state</em> attribute contains information-rich JSON data that you would lose during the Markdown conversion (as the node doesn&#8217;t contain text).</p><p>Another common example is visual elements, which often carry semantic information not captured by visible text. For instance, based on the image below, you might think the rating is 5/5, but the aria-label attribute reveals it&#8217;s actually 4.3/5:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Uznb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Uznb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 424w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 848w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 1272w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Uznb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png" width="1456" height="694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the aria-label attribute&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the aria-label attribute" title="Note the aria-label attribute" srcset="https://substackcdn.com/image/fetch/$s_!Uznb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 424w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 848w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 1272w, https://substackcdn.com/image/fetch/$s_!Uznb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf4a8c3-aac4-4571-bc5b-5bbf4270eed7_2325x1108.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the aria-label attribute</figcaption></figure></div><p>In short&#8212;let&#8217;s be honest, as it always happens in IT&#8212;converting to Markdown isn&#8217;t a one-size-fits-all solution. Therefore, it&#8217;s no surprise that most web scraping solutions built for direct AI integrations also offer the option to return raw HTML.</p><p>That said, based on my experience in the field and everything highlighted here, I highly recommend sticking to Markdown when feeding web pages to LLMs for processing or data parsing, as the benefits far outweigh the downsides in the vast majority of use cases.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>In this post, I&#8217;ve outlined why Markdown is the language of LLMs and, consequently, the preferred output format for all web scraping tools that integrate directly into AI systems like workflows, pipelines, agents, and so on.</p><p>The reasons are intuitive: Markdown is concise and strips unnecessary markup (reducing token usage) while preserving structure, images, links, lists, tables, and more.</p><p>As highlighted, plain HTML-to-Markdown conversion isn&#8217;t always optimal, and you need to apply some extra tricks to get the best results.</p><p>If you have any questions or comments, drop them below. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #98: Scraping Google Search Results in 2026: Device, Location, and Identity]]></title><description><![CDATA[Google does not have one set of results. It has millions. The hard part is knowing which one you are looking at.]]></description><link>https://substack.thewebscraping.club/p/scraping-serp-google-search</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/scraping-serp-google-search</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 19 Feb 2026 06:00:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/97f57072-a449-4a11-beab-aad59c0ec80c_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Using a search engine, you probably have noticed that results are not static. The same query returns different results depending on where you are, what device you use, and whether you are logged into a Google account. <br>When it comes to SERP scraping, this adds several layers of complexity. While for most scraping targets, you send a request and get the page, for search engines, you send a request and get <em>*a version*</em> of the page, shaped by signals you may not even be aware of.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>This makes SERP scraping fundamentally different from conventional web scraping. The data you collect is only as reliable as your control over these variables. Scrape from a datacenter IP in Virginia with a desktop Chrome fingerprint while logged out, and you will get one set of results. Scrape the same query from a mobile device in Milan while logged into a Google account, and you will get something entirely different. Both are &#8220;correct&#8221; Google results. Neither tells the full story.</p><p>In this article of The Lab, we wanted to understand how much these variables actually change the output, and more importantly, how to control them reliably. </p><h2>Google does not want you scraping its results</h2><p>Before we get into the technical setup, we need to acknowledge something that changed the landscape significantly in early 2025.<br><br>Starting in January 2025, Google began releasing SearchGuard, a technical protection measure designed to make scraping search results harder. </p><p>SearchGuard works by sending JavaScript challenges to search queries originating from unrecognized sources, <a href="https://substack.thewebscraping.club/p/google-hiding-serp-results-javascripts">as we covered on these pages when it started</a>. When a query arrives, Google&#8217;s system transmits JavaScript code that requires the browser to compute and return a &#8220;solve&#8221;, a set of specific information about the browser environment and the user generating the request. For human users, the solution happens transparently in the browser. For automated systems, it is a wall.</p><p>This change in strategy put pressure on all &#8220;SEO tools&#8221; and the operators that needed to scrape Google search results, suddenly increasing their day-to-day operational costs. </p><div><hr></div><blockquote><p><em>Need public web data, not scraper headaches?</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vag1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vag1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 424w, https://substackcdn.com/image/fetch/$s_!vag1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 848w, https://substackcdn.com/image/fetch/$s_!vag1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!vag1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vag1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png" width="398" height="143.78296703296704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:526,&quot;width&quot;:1456,&quot;resizeWidth&quot;:398,&quot;bytes&quot;:243844,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187899616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vag1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 424w, https://substackcdn.com/image/fetch/$s_!vag1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 848w, https://substackcdn.com/image/fetch/$s_!vag1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!vag1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132bdbd3-9f2b-4f91-9e6f-6d30074151be_2958x1068.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>SerpApi turns search results into predictable JSON with built-in scale, location options, and speed. All with no maintenance. </em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://serpapi.com/?utm_source=thewebscrapingclub&quot;,&quot;text&quot;:&quot;Try for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://serpapi.com/?utm_source=thewebscrapingclub"><span>Try for free</span></a></p></blockquote><div><hr></div><p>The change in this strategy, and especially its timing, prompted professionals to raise questions that will likely never receive an answer. Does this have something to do with the AI race? Is this a way to make it harder for other AI companies rely on Google searches for their answers?<br>We&#8217;ll probably never know the answers, but they&#8217;re legitimate questions: SERP scraping is as old as Google search, so why bothering stop bots in 2025 and not some years ago?<br>However, this is today&#8217;s reality, and we need to adapt to it. Let&#8217;s examine the specifics of SERP scraping on Google (<em>as always, we&#8217;re showing this for educational purposes; be aware of current copyright and scraping laws</em>).</p><h2>What shapes a Google SERP response</h2><p>To scrape Google Search reliably, we need to model the system we are interacting with. Google personalizes search results along several axes, each of which produces measurably different output.</p><p><strong>Geographic location</strong> is one of the most impactful variables. Google determines your location through your IP address and, when available, browser geolocation permissions. A query for &#8220;pizza restaurant&#8221; from a New York IP returns local results for Manhattan. The same query from a Milan IP returns pizzerias in Milan. This extends beyond local searches: news results, shopping results, and even organic ranking order shift based on geography.</p><p>We&#8217;ll see in the test part of this article that changing location and mimicking another one is less trivial than expected, since not every proxy type works as expected.</p><p><strong>Device type</strong> determines the structure and content of the SERP page itself. Mobile and desktop results are not just different layouts of the same data. Google serves genuinely different content. Mobile SERPs prioritize featured snippets, location-based answers, and nearby points of interest. Desktop SERPs give more space to organic links and Knowledge Panels. Some results appear exclusively on mobile or exclusively on desktop. For anyone collecting SERP data for analysis, this distinction is not cosmetic. It is structural.</p><p><strong>Login state</strong> introduces personalization based on your Google account history. When you are logged in, Google uses your search history, location history, and account preferences to tailor results. When logged out, you get a more &#8220;generic&#8221; version of the results for your location and device. The difference can be subtle for generic queries and dramatic for anything Google considers personal.</p><p><strong>Keywords,&nbsp;</strong>of course, are the main driver of change. But in addition to returning different results for different keywords, the answer layout also varies accordingly. If you look for &#8220;trousers&#8221;, you&#8217;ll see more shopping results and product data, while if you&#8217;re looking for &#8220;aspirin&#8221;, you&#8217;ll see a more traditional layout.</p><p>These four variables interact. A logged-in mobile user in Tokyo sees a fundamentally different page than a logged-out desktop user in London, even for the same query. Controlling all four simultaneously is what makes SERP scraping an infrastructure problem, not just a coding problem.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p></p><h2>Tools: An Anti-Detect browser and Selenium</h2><p>Given the variables we need to control (device type and login state, specifically), and the fact that we are not building a massive scraping operation here, the best setup we can use is Playwright paired with an anti-detect browser.</p><p>We need a real browser, not just an HTTP request library like <code>requests</code> or <code>httpx</code>, because Google&#8217;s SearchGuard validates the browser environment through JavaScript challenges. A raw HTTP client has no JavaScript engine, no DOM, no window object. It cannot compute the &#8220;solve&#8221; that SearchGuard requires. The request simply fails or returns a challenge page. To pass these checks, we need something that renders JavaScript and exposes a complete browser environment.</p><p>But a standard browser is not enough either. Regular Chrome or Firefox, even when automated with Playwright or Selenium, carries detectable signals: the <code>navigator.webdriver</code> flag, predictable fingerprint values, and missing or inconsistent browser properties. Google&#8217;s systems can identify these inconsistencies and treat the session as automated.</p><p>That&#8217;s why we&#8217;re pairing Selenium with an anti-detect browser, which is a modified browser engine that spoofs the properties websites use for fingerprinting. Navigator properties, screen resolution, WebGL parameters, canvas behavior, AudioContext values, font lists, language headers, and device type. Instead of presenting the same default fingerprint every time, an anti-detect browser generates a consistent, realistic identity that looks like a genuine user on a specific device and operating system.</p><p>The critical feature for our use case is <strong>persistent profiles</strong>. An anti-detect browser manages browser profiles that survive across sessions. Each profile stores its fingerprint configuration, cookies, local storage, proxy, and device settings. When we start a profile, it resumes exactly where it left off. This means we can log into a Google account through one profile, close the browser, and reopen it days later with the session still active. Without persistent profiles, we would need to authenticate on every run, which is both impractical and a red flag for Google&#8217;s security systems.</p><p>For this article, we use <a href="https://kameleo.io/">Kameleo</a> as our anti-detect browser. It runs as a local service (Kameleo.CLI) exposing a REST API on port 5050, controllable via a Python client. It supports Chromium-based profiles (Chroma) for Chrome and mobile device emulation, and Firefox-based profiles (Junglefox). Each profile is an isolated browser session with its own fingerprint, proxy, and cookies.<br></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Setting up the infrastructure: deploying Kameleo on AWS</h2><p>Our Kameleo instance runs on a Windows EC2 in the US. This means that without a proxy, all traffic exits via a US-based AWS IP address. We will use this setup to demonstrate the difference between the instance&#8217;s own IP and a proxy claiming to be somewhere else. I&#8217;m sure you&#8217;ll be surprised by what we&#8217;ll find later.</p><h3>Installing Kameleo on AWS</h3><p>We installed Kameleo on a Windows EC2 instance using the standard graphical installer, no rocket science here. Once Kameleo is running on the AWS machine, it exposes its API on port 5050. Our Python scripts run locally and connect to the remote Kameleo instance over the network.</p><p>The architecture is straightforward: Kameleo manages browser profiles and runs the actual browsers on the AWS instance. Our local machine sends API commands (create profile, start browser, stop browser) and connects to the browser via WebSocket for Playwright automation. The AWS instance needs port 5050 open in its security group for this to work.</p><p>Every script in this article follows the same initialization pattern. We read the remote IP from an environment variable:</p><pre><code>from kameleo.local_api_client import KameleoLocalApiClient
import os
kameleo_ip = os.getenv(&#8217;KAMELEO_IP&#8217;)

kameleo_port = os.getenv(&#8217;KAMELEO_PORT&#8217;, &#8216;5050&#8217;)

client = KameleoLocalApiClient(endpoint=f&#8217;http://{kameleo_ip}:{kameleo_port}&#8217;)</code></pre><h2>Test 1: setting the right location</h2><p>As we said, one of the keys to extracting SERP data is setting the location we&#8217;d like to know more about. Our Kameleo installation is on an AWS US machine, so we expect to get SERP data from there. But if we want to change location? </p><p>We run the same query, &#8220;weather&#8221;, three times from the same AWS instance in the US. First, without any proxy, the traffic exits from the instance&#8217;s own IP. Then, through a residential proxy geolocated in Italy. Finally, through a datacenter proxy also claiming to be in Italy. For each run, we first visit whatismyipaddress.com to verify the exit IP, then navigate to Google, type the query in the search bar with randomized keystroke delays, and capture the results.<br><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">98.SERP-DATA</a>. </strong>If you&#8217;re one of them but cannot access the repository, <a href="https://twsc-private-form.lovable.app/">please fill out this form</a>.</p><p>In the file <strong>test_location_comparison.py,</strong> we&#8217;ll see how Google responds to us when we&#8217;re using different types of proxies.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/scraping-serp-google-search">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Avoid Copyright Violations While Scraping ]]></title><description><![CDATA[Discover how copyright violations can occur in web scraping and how to avoid them]]></description><link>https://substack.thewebscraping.club/p/avoid-copyright-violations-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/avoid-copyright-violations-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 15 Feb 2026 04:00:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5eee30d1-21ea-4010-a335-5d9e31803bfd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As it core, web scraping is based on a simple process: You retrieve data from a target website with the goal of doing something meaningful with the data. Regardless of your experience in the industry, this process should immediately make you ask a question to yourself:&#8221; <em>I&#8217;m retrieving and using someone else&#8217;s data, so</em> a<em>m I violating copyright or something while scraping?</em>&#8221;.</p><p>In this article, we&#8217;ll discuss what copyright in the context of web scraping is, when it occurs, and how to avoid it.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What is Copyright Violation in the Context of Scraping?</h2><p>Generally speaking, a copyright violation occurs when you reproduce, display, distribute, or create derivative works from someone else&#8217;s creative work without their permission (or without a valid legal exception). In the context of web scraping, the &#8220;creative work&#8221; involves (but is not limited to) the following:</p><ul><li><p>Articles.</p></li><li><p>Images.</p></li><li><p>Audio and video.</p></li><li><p>Code (under particular conditions).</p></li></ul><p>In other words, if you scrape and reproduce an article (even a small part of it)  on your website without the author&#8217;s permission, you can be infringing copyright. Whether it is actually infringement depends on context (how much content you copied, how you used it, and which jurisdiction applies), but &#8220;a small part of a whole article&#8221; is not a safe harbor.</p><p>So here&#8217;s the thing to bear in mind: Just because some content is accessible on the Internet, it doesn&#8217;t mean you can take it. Even though some content is publicly accessible, ownership and reproducibility are not. This is why <a href="https://substack.thewebscraping.club/i/179653589/best-practice-5-mind-the-data-you-scrape-and-the-goal">minding the data you scrape is one of the best practices for ethical scraping.</a></p><div><hr></div><blockquote><p><em>For your ethical scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>How Can Copyright Violations Occur While Scraping?</h2><p>To avoid copyright infringements, you should know the common cases to take care of. Below is a list of common situations where copyright can be violated while scraping data from websites:</p><ul><li><p><strong>Copying content</strong>: Technically speaking, scraping is copying. When you download a webpage&#8217;s HTML to your disk, you&#8217;ve made a copy. If that HTML contains creative expression, you have created a copy of copyrighted material. That does not automatically mean you are infringing, but this is the exact action copyright law regulates. And if you store, reuse, or republish that expression without permission (or a solid exception), you&#8217;re in infringement territory. Note that courts don&#8217;t need the copied content to be 1:1 identical. For them, &#8220;substantial similarity&#8221; can be enough.</p></li><li><p><strong>Copying images and media</strong>: Images are typically strongly protected. Scraping image URLs and hotlinking can still be risky, even if you report the source URLs while republishing the images. And, of course, downloading and rehosting is even more direct copying.</p></li><li><p><strong>Copying &#8220;creative fields&#8221; that look like &#8220;data&#8221;</strong>: Product descriptions, editorial blurbs, &#8220;about&#8221; sections, hotel/restaurant descriptions, FAQs, and similar content is often copyrighted text. While editorial blurbs and similar text are obviously copyrighted content, the others are not so obvious. The point to always take care of is in relation to &#8220;creative work&#8221;. A product description can be creative work when it contains original language, structure, or marketing copy. But not every description is protected. For example, a purely functional description text may have weak or no copyright protection, depending on the jurisdiction and the originality of the content itself.</p></li><li><p><strong>Scraping for training LLMs</strong>: Scraping web pages to get data for training LLMs is surely part of <a href="https://substack.thewebscraping.club/i/173603764/the-future-ai-llms-and-the-next-frontier">the evolving career of web scraping professionals</a>. However, scraping data to train Large Language Models can trigger reproduction/derivative-work arguments in courts. This is still an evolving legal area, so you should not assume &#8220;transformative&#8221; automatically saves you from legal troubles, especially at scale. The issue between <a href="https://techcrunch.com/2025/11/03/studio-ghibli-and-other-japanese-publishers-want-openai-to-stop-training-on-their-work/">studio Ghibli and OpenAI on copyright violations due to LLMs&#8217; training</a> is one among the open ones, but keep in mind: allegations, investigations, and lawsuits are not the same thing as a final court ruling.</p></li></ul><h3>How to Avoid Copyright Violations While Scraping</h3><p>Having legal issues is probably the worst nightmare for professional scrapers. So, how can you be sure you are not violating copyright while scraping? Below is a list of guidelines to take into consideration:</p><ul><li><p><strong>Scrape facts, not expression</strong>: Copyright protects expression, not facts. Scraping the price of a stock, the temperature in London, or a flight arrival time doesn&#8217;t infringe any copyright because these are facts. No one owns the fact that today it is 20 degrees in London. On the other hand, scraping a journalist&#8217;s analysis about why the price of a stock moved in a certain direction, or a photographer&#8217;s image of London, is a creative expression.</p></li><li><p><strong>Transform, don&#8217;t replicate</strong>: When repurposing content (on your website or anywhere else), transform it. This is a general rule of thumb, but if you are in the US, one of your best defenses is &#8220;Fair Use&#8221;. But to claim this, your use must be transformative. For example, scraping Amazon reviews and posting them on your own e-commerce site is replicating, not transforming. Even summarizing reviews cannot be considered transformative in some cases, and even when it is, it&#8217;s not a guaranteed shield.</p></li><li><p><strong>Don&#8217;t store raw pages by default</strong>: As said before, storing the HTML of entire pages means creating copies. To solve this, you can follow two paths:</p><ul><li><p>Parse in-memory.</p></li><li><p>Extract only the necessary content, not whole pages.</p></li></ul></li><li><p><strong>Treat images as a separate &#8220;danger zone&#8221;</strong>: Images are a type of content that, during the whole Internet era, had the majority of copyright issues so far. The safest options are:</p><ul><li><p>Using the website&#8217;s official APIs when scraping images, if available.</p></li><li><p>Scraping images under a Creative Commons license with compliance.</p></li><li><p>Asking and getting direct licensing from the owner.</p></li></ul></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Standardized Processes to Stay Safe</h3><p>So far so good, but let&#8217;s be honest: When you are taken by your daily job tasks, it&#8217;s easy to lose your compass. To avoid it, the best thing to do is to create standardized (and documented) processes and procedures so that you always operate under a guardrail. This section provides you with a couple of ideas you can implement as standardized processes to be sure you don&#8217;t violate any copyright while scraping.</p><h3>Procedure #1 to Avoid Copyright Violations While Scraping: Develop a Copyright Risk Check</h3><p>Most copyright problems in scraping are self-inflicted. This happens because developers often scrape &#8220;everything on the page,&#8221; save it &#8220;for later,&#8221; and only then do they ask: &#8220;<em>Wait, can we ship this?</em>&#8221;.</p><p>Before you add a field (or a selector) to your scraper, ask yourself the following questions:</p><ul><li><p><strong>&#8220;Is this a fact, or is this someone&#8217;s writing?&#8221;</strong>: Prices, dates, SKUs, addresses, and opening hours are facts. A paragraph of an article is someone&#8217;s writing. Remember to treat those differently.</p></li><li><p><strong>&#8220;If I publish this, would it compete with the source?&#8221;:</strong> If your application lets users consume the content without clicking the original, you&#8217;re not &#8220;aggregating.&#8221; You&#8217;re substituting.</p></li><li><p><strong>&#8220;Am I copying just what I need, or am I copying the entire page?&#8221;</strong>: If the answer to this question is: &#8220;<em>We only store it for debugging</em>&#8221;, then you are building a copy.</p></li><li><p><strong>&#8220;How much am I taking?&#8221;:</strong> A single excerpt is one thing. Thousands of excerpts across a site start looking like a dataset designed to recreate the whole content.</p></li><li><p><strong>&#8220;What am I going to do with it later?&#8221;:</strong> Internal analysis is one risk profile. A public API that returns the scraped text is a completely different risk profile.</p></li><li><p><strong>&#8220;Is my plan defensible if someone sends a legal notice?&#8221;:</strong> If your only defense is &#8220;<em>but the content is publicly available</em>&#8221;, you don&#8217;t have a defense. As said before, public availability is different than ownership-</p></li></ul><p>If answers to these questions feel shaky, the fix is usually boring: don&#8217;t collect it, collect less, or get permission.</p><h3>Procedure #2 to Avoid Copyright Violations While Scraping: Build Your Scraper So It&#8217;s Hard to Do Something Dumb</h3><p>If you want to stay out of trouble, don&#8217;t rely on &#8220;policy.&#8221; Rely on defaults and standards.</p><p>Here&#8217;s what I mean: The safest scraper is the one that can&#8217;t casually vacuum up article bodies, image files, and review text unless you deliberately build it that way.</p><p>Below is a process that works safely:</p><ol><li><p>Fetch the page.</p></li><li><p>Extract only what you came for.</p></li><li><p>Store facts + metadata (source URL, timestamp).</p></li><li><p>Throw the rest away.</p></li></ol><p>When you really do need to keep anything close to &#8220;content&#8221; (ie, media), treat it as a special case: short retention, locked-down access, and a reason written down somewhere if needed. Not &#8220;<em>maybe we&#8217;ll need it later</em>&#8221;: You must have a valid reason.</p><p>If you want a mental model, you can think of it like so: You&#8217;re not building a web scraper. You&#8217;re building a pipeline. And pipelines need guardrails.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Examples: What &#8220;Safe-ish&#8221; Looks Like vs What Can Surely Get You in Trouble</h3><p>Let&#8217;s be practical now and see some examples of what is generally safe and what is not. Of course: The following examples are not court outcomes. They&#8217;re the kind of setups that tend to be boring (safe-ish) or spicy (trouble-ish):</p><ul><li><p><strong>Price tracker (safe-ish)</strong>: You scrape SKU + price + availability + timestamp and show a price history chart. You don&#8217;t copy product descriptions or images. This is the classic &#8220;facts + original output&#8221; use case.</p></li><li><p><strong>Product catalog clone (risky)</strong>: You scrape titles, descriptions, bullet points, images, and reviews, then you show them on your site. That&#8217;s not &#8220;data.&#8221; That&#8217;s content. You&#8217;re rebuilding their user experience.</p></li><li><p><strong>News aggregation (high risk)</strong>: If you store headlines + links and add your own tags/filters, you&#8217;re closer to indexing. If you store full articles and users can read all the content as is without leaving your site, then you&#8217;re highly risking getting a trip to the nearest court.</p></li><li><p><strong>Review analytics (mixed)</strong>: Using reviews internally to compute &#8220;top complaints this month&#8221; is one thing. Republishing reviews precisely as they are is another.</p></li><li><p><strong>Business directory (often safer, until you start copying the fluff)</strong>: Name, address, phone, opening hours: These are usually factual. &#8220;About us&#8221; sections and photos, on the other hand, are where you cross over into copyrighted expression.</p></li></ul><p>So notice the pattern: The moment your product starts looking like a substitute for the source, your legal risk goes up fast.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Traps That Have Nothing to Do with Copyright (But Still Hurt)</h2><p>Copyright is only one way scraping can go wrong. Plenty of scraping disputes are won on issues that are simpler to prove than infringement. Below are the big ones you should treat as &#8220;no trespassing&#8221; signs:</p><ul><li><p><strong>Circumvention (DMCA Section 1201):</strong> If the site uses a login wall, CAPTCHA, paywall, anti-bot challenges, or IP blocking to stop you, and you write code to bypass those measures, you are potentially violating anti-circumvention laws. This is not &#8220;copyright infringement&#8221; in the traditional sense, but the practical takeaway is simple: If you have to defeat a technical barrier to get the data, you&#8217;re walking into a high-risk territory fast.</p></li><li><p><strong>Disregarding </strong><em><strong>robots.txt</strong></em><strong>:</strong> The <em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">robots.txt</a></em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications"> isn&#8217;t the law, but ignoring it has its implications</a>. In disputes, it can be used as evidence that you knew you were unwelcome and kept going anyway. It can also be relevant to arguments about authorization and &#8220;bad faith,&#8221; even if it doesn&#8217;t create copyright liability by itself.</p></li><li><p><strong>Terms of service (contract risk):</strong> If the ToS explicitly forbids scraping (and most do), and you scrape anyway, you may be liable for breach of contract. This is often easier for the content owner to win than a copyright claim because the argument is straightforward: You agreed (explicitly or implicitly) to a contract, then you violated the agreement.</p></li><li><p><strong>Do not scrape behind a login:</strong> Once you log in, you have affirmatively agreed to a contract. Breaking that contract to scrape is a fast track to a lawsuit. If your plan requires authenticated access, treat it as a licensing/permission problem, not an engineering challenge.</p></li></ul><h2>Conclusion</h2><p>In this article, we&#8217;ve discussed how copyright infringements can occur while scraping and how to avoid them. As said, it&#8217;s not always easy to understand when you are actually infringing copyright, as it depends on the governing laws which, often, are local ones. Still, the main ideas proposed can help you be conservative and stay pretty safe while scraping web pages.</p><p>So, let us know: Did you find those practices useful? Do you apply other frameworks to be sure you&#8217;re not violating copyrighted content? Let us know in the comments!tat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.&#8221;</p>]]></content:encoded></item><item><title><![CDATA[Google vs IPIDEA: Anatomy of a Residential Proxy Takedown]]></title><description><![CDATA[Google Took Down 16 Million Proxy IPs. Here is Why It Will Not Be Enough.]]></description><link>https://substack.thewebscraping.club/p/google-vs-ipidea-takedown</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/google-vs-ipidea-takedown</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Sun, 08 Feb 2026 20:21:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d1ec9374-4ca3-4d8d-9afb-783dcabe3e9b_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On these pages, we have written several times about how proxy networks work and how they source their IPs. I&#8217;m exposing no secret by saying that, in some cases, companies act in a gray area. </p><p>Think about it: how would you convince several million people to share their spare internet connection with companies that you don&#8217;t know how they will use it? <br>Not an easy task, and some companies took shortcuts, as the IPIDEA case shows.<br>Let&#8217;s see it in detail: more than the takedown itself, we&#8217;ll use it as an opportunity to look under the hood: how do residential proxy networks acquire millions of IP addresses, what keeps them running, and why are they so difficult to shut down permanently? The IPIDEA case provides unusually detailed answers to all these questions.</p><h2>What happens when Big Tech goes after the infrastructure that powers both scrapers and threat actors</h2><p>On January 28, 2026, <a href="https://cloud.google.com/blog/topics/threat-intelligence/disrupting-largest-residential-proxy-network">Google Threat Intelligence Group (GTIG) announced what they called the disruption of &#8220;one of the largest residential proxy networks in the world.&#8221;</a> The target was IPIDEA, a name that most people outside the proxy industry had never heard. Yet according to Google&#8217;s analysis, IPIDEA&#8217;s infrastructure was being used by over 550 distinct threat groups in a single week, including state-sponsored actors from China, North Korea, Iran, and Russia.</p><p>This is not just a story about a takedown. It is a detailed look at how residential proxy networks actually work, how they acquire millions of IP addresses, and why disrupting them is harder than it sounds.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What Google Actually Did</h2><p>The disruption involved three coordinated actions:</p><p>First, Google took legal action to seize the domains used to control enrolled devices and route traffic through them. Without these command-and-control (C2) domains, the SDK code running on millions of devices loses the ability to receive instructions and proxy traffic.</p><p>Second, GTIG shared technical intelligence about IPIDEA&#8217;s SDKs with platform providers, law enforcement, and research firms. The goal was to trigger ecosystem-wide enforcement, getting these SDKs flagged and removed across multiple app stores and platforms.</p><p>Third, Google updated Play Protect to automatically warn users and remove applications known to contain IPIDEA SDKs. This blocks the network&#8217;s ability to recruit new devices on certified Android devices.</p><p>Google claims these actions reduced IPIDEA&#8217;s available device pool by millions. Whether that number holds up over time is a different question, and we will get to that.</p><div><hr></div><blockquote><p><em>Not all residential proxy networks operate in gray zones. Decodo built theirs on user consent, ISO 27001 certification, and co-founded the Ethical Web Data Collection Initiative to prove the model works.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>The Anatomy of a Residential Proxy Network</h2><p>To understand why this matters, we need to understand what a residential proxy network actually is and how it differs from datacenter proxies.</p><p>A datacenter proxy routes your traffic through IP addresses belonging to cloud providers or hosting companies. These IPs are easy to identify and block because they belong to known ASNs (Autonomous System Numbers) associated with commercial hosting.</p><p>A residential proxy routes traffic through IP addresses assigned by consumer ISPs to regular households. When you connect through a residential proxy, your request appears to come from someone&#8217;s home internet connection in Omaha, Tokyo, or Milan. This makes detection and blocking significantly harder because the traffic looks indistinguishable from a regular consumer browsing the web.</p><p>The challenge for proxy providers is obvious: they need access to millions of consumer devices to build a usable network. These devices need to be online, geographically distributed, and willing (or unwilling) to forward traffic.</p><p>There is an important nuance here. While Google&#8217;s report focuses on residential proxies, the same SDK installation on a mobile phone yields two distinct proxy classes. When the phone is connected to home WiFi, traffic exits through a residential IP assigned by the home ISP. When that same phone disconnects from WiFi and switches to 5G or LTE, traffic now exits through a mobile carrier IP. The device has not changed. The SDK has not changed. But the proxy class has shifted from residential to mobile.</p><p>This matters because mobile proxies are typically sold at a premium, sometimes 2-3x the price of residential proxies. Mobile carrier IPs are considered even harder to block than residential IPs because they are shared across thousands of legitimate mobile users through carrier-grade NAT. A single SDK deployment on mobile devices effectively generates inventory for two separate product lines.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p></p><h3>How Residential Proxy Providers Acquire IP Addresses</h3><p>The IPIDEA takedown revealed the specific mechanisms used to build and maintain a large-scale residential proxy network. These methods fall into several categories, ranging from semi-legitimate to clearly deceptive.</p><h3>SDK Integration</h3><p>The primary method is embedding Software Development Kits into legitimate applications. IPIDEA operated multiple SDK brands: PacketSDK, HexSDK, CastarSDK, and EarnSDK. These SDKs are marketed to app developers as monetization tools. The pitch is simple: integrate our SDK, and we will pay you based on downloads or active users.</p><p>Once embedded, the SDK turns the device into an exit node for the proxy network. The device will accept incoming connections from the proxy infrastructure and forward requests to target websites. The app continues to function normally. The user has no obvious indication that their device is being used as a proxy.</p><p>Google&#8217;s analysis found that many applications containing these SDKs did not disclose this functionality to users. The SDK was hidden, not mentioned in the terms of service, and ran silently in the background.</p><h3>Trojanized Applications</h3><p>Beyond SDK integration, IPIDEA directly operated or controlled VPN applications that served as trojan horses. Galleon VPN, Radish VPN, and Aman VPN all provided genuine VPN functionality while simultaneously enrolling devices into the proxy network.</p><p>The logic is effective: users who install VPN applications expect their traffic to be routed through external servers. They are primed to accept unusual network behavior. The proxy functionality hides inside this expected behavior.</p><p>Google identified over 600 Android applications across multiple download sources with code connecting to IPIDEA&#8217;s C2 infrastructure. On Windows, they found 3,075 unique executables making DNS requests to IPIDEA&#8217;s Tier One domains, including applications masquerading as OneDriveSync and Windows Update.</p><h3>Pre-Infected Devices</h3><p>Researchers have documented cases of uncertified Android devices shipping with residential proxy payloads already installed. Set-top boxes, TV boxes, and other IoT devices from off-brand manufacturers have been found with hidden proxy software baked into the firmware.</p><p>This method bypasses the need for user installation entirely. The device arrives compromised.</p><h2>The Technical Architecture</h2><p>Google&#8217;s reverse engineering of the SDK code revealed a two-tier command-and-control system.</p><p>When an infected device starts up, it contacts a Tier One server. The device sends diagnostic information including OS version, device identifier, and a key parameter that appears to be used for affiliate tracking (determining which app developer gets paid for the enrollment). The Tier One server responds with timing configuration and a list of Tier Two server IP addresses.</p><p>The device then periodically polls a Tier Two server, checking for proxy tasks. When a task arrives, it contains a target FQDN (like www.google.com:443) and a connection ID. The device establishes a connection to the target, receives data payloads from the Tier Two server, and forwards them unmodified to the destination.</p><p>Google found approximately 7,400 Tier Two servers at the time of their analysis, hosted globally including in the United States. The number fluctuated daily, suggesting a demand-based scaling system.</p><p>The infrastructure analysis revealed something important: despite different brand names (PacketSDK, HexSDK, CastarSDK, EarnSDK), all the SDKs connected to the same pool of Tier Two servers. The brands were marketing fronts for a single unified network.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>The Brand Proliferation Problem</h3><p>This brings us to one of the most interesting findings. IPIDEA did not operate under a single name. Google identified at least 13 ostensibly independent proxy and VPN brands controlled by the same actors:</p><ul><li><p>360 Proxy</p></li><li><p>922 Proxy</p></li><li><p>ABC Proxy</p></li><li><p>Cherry Proxy</p></li><li><p>Door VPN</p></li><li><p>Galleon VPN</p></li><li><p>IP 2 World</p></li><li><p>Ipidea</p></li><li><p>Luna Proxy</p></li><li><p>PIA S5 Proxy</p></li><li><p>PY Proxy</p></li><li><p>Radish VPN</p></li><li><p>Tab Proxy</p></li></ul><p>These brands operated separate websites, separate marketing, and separate pricing. A customer comparing &#8220;922 Proxy&#8221; to &#8220;Luna Proxy&#8221; would have no obvious indication that they were buying access to the same underlying network.</p><p>This is not unique to IPIDEA. Industry analysis suggests there are only about 7 truly unique residential proxy networks globally, despite hundreds of brands competing in the market. The rest are resellers, white-labels, or, like IPIDEA, multiple storefronts for the same infrastructure.</p><h2>Why Residential Proxies Are Hard to Block</h2><p>The fundamental problem for defenders is that residential proxy traffic looks legitimate. When a request arrives from a Comcast IP in Chicago, there is no technical marker indicating whether it comes from an actual Comcast customer browsing normally or from a proxy network routing traffic through that customer&#8217;s compromised device.</p><p>The proxy networks exploit this by design. The value proposition they sell is precisely this difficulty of detection.</p><p><a href="https://deviceandbrowserinfo.com/learning_zone/articles/inside-ipidea-residential-proxy-network">Security researcher Antoine Vastel published concrete data that illustrates the scale of this problem</a>. By actively testing proxy endpoints, he verified more than 16 million unique IP addresses that were functional and associated with the IPIDEA network during the 30 days preceding the takedown. The breakdown by brand shows the relative sizes within the IPIDEA ecosystem: PY Proxy (PyProxy) accounted for 13.4 million IPs, PIA S5 Proxy for 2.2 million, and Luna Proxy for 549,000.</p><p>These are not theoretical numbers from marketing materials. These are IP addresses through which Vastel routed traffic and confirmed as working proxy endpoints. And here is the critical insight from his analysis: even with 16 million identified proxy IPs, defenders cannot simply block them.</p><p>The reason is that residential exit nodes mix traffic from automated tools and legitimate human users on the same IP. The device owner browses the web normally, while the SDK, in the background, forwards proxy traffic over the same connection. Blocking these IPs based on proxy activity would inevitably block real users who happen to share an IP or whose IP was previously used as an exit node.</p><p>Vastel&#8217;s recommendation is telling: use these IoCs for risk scoring, behavioral enrichment, and incident investigation, but not for direct blocking. The data is context, not a verdict. This fundamental asymmetry makes residential proxies valuable to attackers and frustrating for defenders.</p><p>His research also confirmed another pattern: IP addresses frequently appear across multiple proxy ecosystems simultaneously. IPIDEA did not rely exclusively on its own residential pool. Requests were routed through or resold from other networks. The same IP might be accessible through IPIDEA, a competitor, and a reseller all at once. This interconnection means that even identifying an IP as &#8220;IPIDEA-linked&#8221; does not tell you the full story of how it is being used.</p><p>Traditional IP reputation systems struggle with this complexity. Blocking known bad IPs works for datacenters where the IP assignments are stable, and the ASNs are identifiable. Residential IP addresses rotate frequently as ISPs reassign addresses, and blocking residential IP ranges blocks legitimate users.</p><h2>Google&#8217;s Approach: Attacking the Infrastructure</h2><p>Rather than trying to block individual IP addresses, Google attacked the control infrastructure. By taking down the Tier One C2 domains, they severed the connection between infected devices and the proxy operators. Without C2 connectivity, the SDK code on millions of devices becomes inert.</p><p>This approach has precedent. It is the same strategy used against botnets: identify the command-and-control infrastructure and take it down. The infected devices remain infected, but they can no longer receive instructions.</p><p>Google also partnered with Cloudflare to disrupt IPIDEA&#8217;s domain resolution, adding another layer of infrastructure disruption beyond the legal domain seizures.</p><h2>Will It Work? The Persistence Problem</h2><p>Here is where we need to be realistic about the limitations of this approach.</p><p>The takedown disrupted IPIDEA&#8217;s current infrastructure. The domains are gone. The C2 servers are unreachable. Millions of devices are no longer participating in the proxy network. But &#8220;no longer participating&#8221; is not the same as &#8220;cleaned up.&#8221;</p><p>The fundamental problem is that infected devices remain infected. The SDK code is still installed on millions of phones, tablets, TV boxes, and computers worldwide. Google can take down domains. Google can update Play Protect to block new installations. What Google cannot do is reach into millions of devices and uninstall the malicious code that is already there.</p><p>For the SDK to be removed, one of these things needs to happen: the user manually uninstalls the app containing it, the device gets factory reset, or the device gets replaced. None of these happens at scale. Most users are unaware that the SDK exists. The apps that contain it often provide real functionality, games, utilities, and VPNs that users want to keep. There is no mechanism to notify millions of people across dozens of countries that their flashlight app is secretly a proxy node.</p><p>This creates an asymmetry that favors the attackers. Google invested significant legal, technical, and coordination resources to take down IPIDEA&#8217;s infrastructure. The IPIDEA operators (or anyone who acquires their codebase) can spin up new C2 domains, update their DNS configuration, and potentially reactivate a substantial portion of the dormant network. The SDK code often includes fallback mechanisms and update capabilities precisely for this scenario.</p><p>The brand proliferation we discussed earlier is part of this resilience. IPIDEA operated 13+ brands. If some domains get seized, others may survive. If the entire IPIDEA operation is compromised, the operators can rebrand entirely and inherit a pre-installed base of millions of devices waiting for new instructions.</p><p>Google acknowledged this reality in its announcement, noting that &#8220;this industry appears to be rapidly expanding&#8221; and that &#8220;there are significant overlaps across providers.&#8221; The reseller and partnership agreements that connect different proxy brands mean that disruption propagates unpredictably through the ecosystem, but so does recovery.</p><p>This is not a battle that can be won definitively. It is a cost-imposition strategy. The goal is to make operating these networks so expensive and risky that some operators exit the market or shift to more legitimate practices. But as long as the infected device base exists, the infrastructure can be rebuilt. The realistic outcome is degradation, not eradication.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Economics Behind Residential Proxy Networks</h2><p>Understanding why these networks exist requires understanding the economics.</p><p>Residential proxy bandwidth sells for $4-8 per gigabyte or more to end customers. The cost to acquire that bandwidth through SDK partnerships is measured in cents per gigabyte. The gross margins appear enormous.</p><p>But running a proxy operation is not just bandwidth arbitrage. The real costs include:</p><p><strong>Engineering and Infrastructure</strong>: Maintaining thousands of C2 servers globally, building rotation logic, handling the unreliable nature of consumer devices going online and offline unpredictably.</p><p><strong>SDK Distributio</strong>n: Paying app developers for integration, maintaining relationships with publishers, and navigating app store policies that increasingly scrutinize monetization SDKs. The silver lining is that mobile app SDKs generate dual inventory: residential IPs when devices are on WiFi, mobile IPs when they switch to cellular. This allows providers to sell the same underlying device pool across two product categories at different price points.</p><p><strong>Customer Acquisition</strong>: Finding buyers for proxy services is expensive. The market is niche, competition is intense, and customers are price-sensitive.</p><p><strong>Legal and Compliance</strong>: Or in IPIDEA&#8217;s case, the lack thereof. Operating in legal gray zones creates ongoing risk. The Google takedown demonstrates what happens when that risk materializes.</p><p>Industry estimates suggest customer acquisition costs consume 40-60% of revenue even at scale. The apparent margin compression means that most residential proxy providers operate on thin actual profits despite the high sticker prices.</p><p>This economic pressure explains the proliferation of brands. Running multiple storefronts for the same underlying network lets operators segment the market, test different price points, and spread legal risk across multiple corporate entities.</p><h2>The Bigger Picture</h2><p>The IPIDEA takedown is part of a larger pattern. Google previously took action against the BadBox2.0 botnet, which shared infrastructure with IPIDEA. Law enforcement agencies worldwide are paying greater attention to residential proxy networks, recognizing the role this infrastructure plays in facilitating activities ranging from credential stuffing to espionage.</p><p>The residential proxy industry has partially operated in a legal gray zone for years. The Google action, particularly the legal component, establishes a clearer precedent that enrolling devices without consent and facilitating malicious activity creates meaningful legal exposure.</p><p>This does not mean residential proxies are going away. The demand exists, and where demand exists, supply follows. However, the industry may be compelled toward more transparent practices: clearer consent mechanisms, improved disclosure in applications that embed SDKs, and more careful vetting of customers who purchase proxy access.</p><p>For those of us who work with web scraping, the takedown is a reminder that the infrastructure we rely on has a supply chain. Understanding the supply chain, including its technical architecture, its business model, and its vulnerabilities, helps us make better decisions about which providers to trust and how to build resilient scraping systems.</p><p>The IPIDEA network may be rebuilt, rebranded, or replaced by competitors. However, the detailed technical analysis Google published provides the entire industry with greater visibility into how these networks operate. That visibility, more than the takedown itself, may be the most lasting impact.</p>]]></content:encoded></item></channel></rss>