<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Web Scraping Club]]></title><description><![CDATA[News, solutions and interviews about web scraping.
In this substack you will find weekly content about:
- Web Scraping techniques
- Interviews with key people in the industry
- Anti bot infos and counter measures
- Real world examples and code]]></description><link>https://substack.thewebscraping.club</link><image><url>https://substackcdn.com/image/fetch/$s_!gJt2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1e343ec9-7946-4440-8c00-57209a1d99a1_1024x1024.png</url><title>The Web Scraping Club</title><link>https://substack.thewebscraping.club</link></image><generator>Substack</generator><lastBuildDate>Mon, 15 Jun 2026 00:27:51 GMT</lastBuildDate><atom:link href="https://substack.thewebscraping.club/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Web Scraping Club SRL]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[pier@thewebscraping.club]]></webMaster><itunes:owner><itunes:email><![CDATA[pier@thewebscraping.club]]></itunes:email><itunes:name><![CDATA[Pierluigi Vinciguerra]]></itunes:name></itunes:owner><itunes:author><![CDATA[Pierluigi Vinciguerra]]></itunes:author><googleplay:owner><![CDATA[pier@thewebscraping.club]]></googleplay:owner><googleplay:email><![CDATA[pier@thewebscraping.club]]></googleplay:email><googleplay:author><![CDATA[Pierluigi Vinciguerra]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Kameleo Docker: Exploring the Docker-Based Anti-Detect Browser]]></title><description><![CDATA[Kameleo is finally available on Linux. How? Via Docker!]]></description><link>https://substack.thewebscraping.club/p/kameleo-docker-exploring-the-docker</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/kameleo-docker-exploring-the-docker</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 14 Jun 2026 03:01:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9d36ab11-f536-4b56-829a-540e3ba41ad8_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Docker-based stealth browsers are quickly becoming a new standard for automation and scraping infrastructures. The main reason is that they can be integrated directly into CI pipelines or your own fleet of scalable stealth browsers in the cloud.</p><p>Kameleo Docker brings Kameleo&#8217;s anti-detect browser capabilities into a containerized setup, enabling production-ready automation with real fingerprinting and multi-profile isolation on Linux servers.</p><p>In this post, I&#8217;ll take a deep look at this solution and walk you through everything you need to know about it. By the end, you&#8217;ll understand what Kameleo Docker is, how its stealth browser approach works, how to set it up, and whether it&#8217;s actually worth trying.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><div><hr></div><h2>An Introduction to Kameleo Docker</h2><p>Dig into the world of Kameleo Docker!</p><h3>What Is Kameleo Docker?</h3><p><a href="https://kameleo.io/">Kameleo</a> is an anti-detect browser engineered to make browser sessions look like real user devices. Instead of exposing a generic automation fingerprint, it creates realistic browser identities by spoofing hardware, browser, and environment signals such as WebGL, Canvas, fonts, screen resolution, and geolocation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AV8o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AV8o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 424w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 848w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 1272w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AV8o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png" width="1456" height="577" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:577,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Kameleo Docker&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kameleo Docker" title="Kameleo Docker" srcset="https://substackcdn.com/image/fetch/$s_!AV8o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 424w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 848w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 1272w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kameleo Docker</figcaption></figure></div><p>Kameleo Docker brings <a href="https://substack.thewebscraping.club/p/kameleo-anti-detect-browser">that same stealth stack</a> into a self-hosted, containerized deployment model. Rather than relying on a desktop app, you can run Kameleo inside Docker on Linux or Windows servers, CI pipelines, VPSs, or Kubernetes environments.</p><p>Playwright, Puppeteer, and Selenium can connect to the container via CDP, meaning you can keep using your existing browser automation logic with minimal changes.</p><p>Further reading:</p><ul><li><p><em><a href="https://kameleo.io/docker">Kameleo Docker product page</a></em></p></li><li><p><em><a href="https://developer.kameleo.io/integrations/docker/">Kameleo Docker docs</a></em></p></li><li><p><em><a href="https://kameleo.io/blog/kameleo-on-linux-via-docker-what-we-built-what-broke-whats-next">Kameleo on Linux via Docker: What We Built, What Broke, What&#8217;s Next</a></em></p><p></p></li></ul><h3>Why Kameleo Docker Exists</h3><p>Cloud servers, Kubernetes clusters, CI/CD pipelines, and VPS environments are overwhelmingly Linux-native, making Linux compatibility a practical requirement for automation and scraping teams.</p><p>For years, this created a problem for Kameleo users, as the solution only supported Windows and macOS. Because of that, teams wanting to run stealth browsers in production often had to rely on fragile workarounds.</p><p>Some deployed Windows virtual machines alongside Linux scraping stacks in AWS. Others used <a href="https://some-natalie.dev/blog/ssh-x11-forwarding/">X11-over-SSH</a> tunnels to remotely access browsers running on servers. These setups were difficult to maintain, resource-intensive, and far from ideal for scalable automation.</p><p>As explained in the <a href="https://kameleo.io/blog/kameleo-on-linux-via-docker-what-we-built-what-broke-whats-next">product announcement blog post</a>, Kameleo customers started to ask for a version of Kameleo that could run directly where their automation already lived.</p><p>As Barnabas Szenasi, founder and lead engineer at Kameleo, explained when I met him at Prague Crawl 2026:</p><blockquote><p><em>&#8220;We could see from customer messages that a significant slice of our automation-first users were running Linux cloud servers and simply couldn&#8217;t use Kameleo at all... At Prague Crawl 2025, Tamas [Kameleo&#8217;s CEO] and I heard the same story from industry peers around the world: scraping pipelines were getting harder, and the need to run real browser environments instead of faking HTTP requests was growing fast.&#8221;</em></p></blockquote><h3>The Philosophy Behind the Project</h3><p>From the beginning, Kameleo&#8217;s philosophy has been simple: masking quality matters more than shipping quickly.</p><p>After all, <a href="https://substack.thewebscraping.club/p/anti-detect-browser-royal-rumble-comments">an anti-detect browser</a> is only useful if it can convincingly behave like a real device. That&#8217;s why Kameleo relies on fingerprints sourced from real-world device traffic and continuously tested against modern anti-bot systems.</p><p>That same quality-first mindset also shaped the Docker project. According to founder and lead engineer Barnabas Szenasi, Linux support took longer than expected because the goal was never just to make Kameleo run in a container.</p><p>The objective was to reach the same masking quality users already expected on Windows and macOS. Shipping a functional but lower-fidelity Linux version would have compromised the product&#8217;s core standard.</p><blockquote><div><hr></div></blockquote><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Technical Architecture Behind Kameleo Docker</h2><p>Now that you know why the project exists, let me explain how it works and the engineering behind it.</p><h3>How Kameleo Docker Works</h3><p>Kameleo Docker runs inside either:</p><ul><li><p>a Linux-based container (Ubuntu 22.04), or</p></li><li><p>a Windows-based container (Windows Server Core 2022).</p></li></ul><p>When you pull the image, Docker downloads the correct variant based on your container configuration (Linux containers are the default in most environments, including <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a>). Regardless of the underlying platform, Kameleo exposes the same <a href="https://developer.kameleo.io/reference/api-reference/">Local API</a> for browser and profile management as the <a href="https://kameleo.io/downloads">desktop app</a>.</p><p>The process begins when you create a browser profile through the Local API. If you aren&#8217;t familiar with that concept, <a href="https://developer.kameleo.io/concepts/profiles/">Kameleo profiles</a> are reusable browser environments that bundle a complete browser fingerprint together with persistent browser state, such as cookies, browsing history, local storage, and bookmarks.</p><p>Profiles can also include user-defined settings like proxies, browser extensions, and startup preferences. Each profile is tied to a specific browser kernel and can be started, stopped, imported, or exported as needed.</p><h3>Architecture Overview</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sgEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sgEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 424w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 848w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sgEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png" width="1456" height="1019" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1019,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sgEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 424w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 848w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kameleo Docker&#8217;s architecture</figcaption></figure></div><p>Kameleo Docker separates browser execution from automation logic. The container hosts the stealth browsers, fingerprinting systems, and orchestration layer, while your automation scripts run independently on your machine, server, or orchestration platform.</p><p>At the center of the architecture is the Local API, exposed on port <em>5050</em>. This API handles profile creation, fingerprint selection, browser startup, lifecycle management, and more.</p><p>Behind the API sit <a href="https://developer.kameleo.io/concepts/kernels/">Kameleo&#8217;s browser kernels</a>:</p><ul><li><p><strong>Chroma</strong>: A Chromium-based engine.</p></li><li><p><strong>Junglefox</strong>:<strong> </strong>A Firefox-based engine.</p></li></ul><p>These kernels are modified with engine-level masking patches and connected to Kameleo&#8217;s continuously updated fingerprint database (more on this later). Since they are exposed through the same API, you can switch between them without making any code changes.</p><p>When a profile starts, Kameleo launches a browser session with the configured fingerprint and settings. Playwright and Puppeteer can then connect to the running browser through a WebSocket endpoint via <a href="https://chromedevtools.github.io/devtools-protocol/">Chrome DevTools Protocol (CDP)</a>.</p><p>In other words, your automation script stays outside the Docker container. The browser behaves as if it were running locally, while execution, fingerprint masking, and browser management happen entirely inside the Docker container.</p><p>Persistent storage is handled through Docker volumes. Profile data, downloaded browser kernels, and runtime state are stored outside the container, allowing environments to be recreated without losing configuration or repeatedly downloading browser components. This makes deployments easier to scale, recover, and reproduce.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Qrb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png" width="479" height="239.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:479,&quot;bytes&quot;:911444,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196394917?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Trusted by teams running ad verification, web scraping, SERP tracking, and market research. Ethically sourced proxies, globally accessible, and fairly priced.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataimpulse.com/&quot;,&quot;text&quot;:&quot;Get Started With DataImpulse&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://dataimpulse.com/"><span>Get Started With DataImpulse</span></a></p></blockquote><div><hr></div><h2>Core Features of Kameleo Docker</h2><p>Time to explore the main features and capabilities provided by Kameleo Docker. For more information, <a href="https://developer.kameleo.io/integrations/docker/">read the official documentation</a>.</p><h3>Real Device Fingerprint Masking</h3><p>Kameleo fingerprints are derived from real-world device traffic, not synthetic templates. Each profile represents a coherent combination of OS, browser, hardware signals, and behavioral characteristics.</p><p>All fingerprint surfaces are kept internally consistent, including Canvas, WebGL, audio context, screen resolution, and fonts.</p><p><strong>Note</strong>: TLS fingerprint spoofing isn&#8217;t required, as Kameleo matches the browser kernel version precisely. The TLS stack remains the original, unmodified implementation shipped with the corresponding browser release.</p><p>The goal of the project isn&#8217;t to spoof everything, but to maintain realism across signals. That&#8217;s because overriding too many surfaces increases inconsistency risk, which detection systems can flag. For example, running a macOS fingerprint on a Windows host forces heavy compensation across system-level signals.</p><h3>Proxy Integration and Geo Consistency</h3><p>Each Kameleo profile can be assigned a dedicated proxy (including a rotating proxy), allowing IP-level isolation between browser identities.</p><p>Now, mismatches between IP geography and browser signals (language, timezone, WebRTC, and system locale) are a common detection vector. To address that, Kameleo provides <a href="https://help.kameleo.io/article/74-recommended-settings">automatic geo-location matching</a> to align the browser&#8217;s settings with the geographic location of the selected proxy IP address.</p><h3>Multi-Profile Isolation</h3><p>Kameleo Docker is built around strict profile isolation, as each browser profile runs as a fully independent environment. This separation opens the door to safe multi-accounting (referred to as &#8220;account management&#8221; in Kameleo terminology). Thanks for this feature, you can operate multiple identities simultaneously without cross-contamination of session data or signals.</p><h3>Linux-Specific Docker Image Features</h3><p>Compared to the Windows-based container, the Linux version of Kameleo Docker includes several additional features. These include:</p><ul><li><p><strong><a href="https://developer.kameleo.io/integrations/docker/#vnc-viewer-only-in-linux-based-container">Built-in VNC viewer</a></strong>: Allows you to monitor and interact with live browser sessions. This is especially useful for debugging automation, validating fingerprints, or troubleshooting rendering issues. You can access it through a browser on port <em>8080</em> or via native VNC clients such as RealVNC or TigerVNC on port <em>5900</em>. For security reasons, it&#8217;s disabled by default.</p></li><li><p><strong>Browser-Based Kameleo GUI</strong>: A lightweight browser-based GUI on port <em>80</em> (reach it at <em>http://localhost:80</em>). Unlike the desktop app, it offers reduced functionality and is primarily intended for quick inspection, basic profile management, and monitoring.</p></li><li><p><strong>Optional GPU acceleration</strong>: The Linux container <a href="https://developer.kameleo.io/integrations/docker/#gpu-support-on-linux">supports optional GPU acceleration</a> for graphics-heavy workloads such as WebGL or canvas-intensive websites. Intel/AMD GPUs can be mounted through <em>/dev/dri</em>, while NVIDIA GPUs are supported through the NVIDIA Container Toolkit. When no GPU is available, Kameleo falls back to software rendering.</p></li></ul><h2>Getting Started With Kameleo Docker: Step-by-Step Guide</h2><p>In this guided section, I&#8217;ll show you how to set up Kameleo Docker and use it for browser automation against the <a href="https://www.scrapingcourse.com/javascript-rendering">Scraping Course&#8217;s &#8220;JavaScript Rendering&#8221; page</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k9B9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k9B9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 424w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 848w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 1272w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k9B9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target Scraping Course &#8220;JavaScript Rendering&#8221; page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target Scraping Course &#8220;JavaScript Rendering&#8221; page" title="The target Scraping Course &#8220;JavaScript Rendering&#8221; page" srcset="https://substackcdn.com/image/fetch/$s_!k9B9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 424w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 848w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 1272w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target Scraping Course &#8220;JavaScript Rendering&#8221; page</figcaption></figure></div><p>This is a sandbox environment for web scraping that simulates a real-world, JavaScript-rendered ecommerce page. It makes for a great testing target to validate the setup and see how Kameleo Docker behaves in a realistic automation scenario.</p><h3>Requirements and Prerequisites</h3><p>To get started with Kameleo Docker, make sure you have:</p><ul><li><p><a href="https://www.docker.com/get-started/">Docker installed and running locally</a>.</p></li><li><p>A <a href="https://help.kameleo.io/article/31-registering-a-kameleo-account">Kameleo account</a> with valid credentials.</p></li></ul><p>For more details on supported operating systems and memory requirements, <a href="https://developer.kameleo.io/integrations/docker/#prerequisites">refer to the official documentation</a>.</p><p>Since I&#8217;ll show how to use Kameleo Docker with Playwright in Python, to keep things moving, I&#8217;ll assume you already have a Python environment set up with <a href="https://playwright.dev/python/docs/library#installation">Playwright and its dependencies installed</a>.</p><p>To follow along with this tutorial section, I also recommend that you have:</p><ul><li><p><a href="https://docker-curriculum.com/">Basic Docker experience</a> (running containers, mounting volumes, and using compose files)</p></li><li><p>Familiarity with <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation using Playwright</a>.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Step #1: Create a Kameleo Account</h3><p>If you haven&#8217;t already, start by <a href="https://login.kameleo.io/Account/Register">creating a Kameleo account</a>. Fill out the sign-up form and enter the required information. Once registration is complete, <a href="https://kameleo.io/pricing">a </a><em><a href="https://kameleo.io/pricing">Free</a></em><a href="https://kameleo.io/pricing"> plan</a> will already be activated:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OVWu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OVWu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 424w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 848w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 1272w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OVWu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png" width="1456" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OVWu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 424w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 848w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 1272w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Free plan</figcaption></figure></div><p>Note that the <em>Free</em> plan is enough to use Kameleo Docker.</p><p><strong>Important</strong>: Kameleo credentials are required for the container to authenticate successfully, download browser kernels, and start correctly.</p><h3>Step #2: Start the Docker Container</h3><p>With your account ready, the next step is to pull and run <a href="https://hub.docker.com/r/kameleo/kameleo-app">the Kameleo Docker image</a>.</p><p>Remember that Kameleo ships as a multi-platform Docker image, supporting both Linux-based and Windows-based containers. To download and start the Linux container version of Kameleo, run:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">docker run --platform linux/amd64 \
    --shm-size=2g \
    -p 5050:5050 \
    -e EMAIL='&lt;YOUR_KAMELEO_EMAIL&gt;' \
    -e PASSWORD='&lt;YOUR_KAMELEO_PASSWORD&gt;' \
    -v kameleo-data:/data \
    kameleo/kameleo-app:latest</code></pre></div><p><strong>Note 1</strong>: If you run this command in PowerShell, replace &#8220;\&#8221; with the backtick &#8220;`&#8221; for multi-line commands.</p><p><strong>Note 2</strong>: To launch the web GUI included in the Linux container of Kameleo Docker, add the <em>-p 80:80</em> argument to your <em>docker run</em> command.</p><p>Here&#8217;s what matters in the above command:</p><ul><li><p><em>--platform linux/amd64</em> ensures Docker pulls the Linux-based image variant.</p></li><li><p><em>--shm-size=2g</em> is required for stable browser execution (the default Docker shared memory of 64MB is too small for browser execution).</p></li><li><p><em>-v kameleo-data:/data</em> creates a named volume that persists browser kernels and profiles across restarts.</p></li><li><p><em>EMAIL</em> and <em>PASSWORD</em> authenticate your Kameleo account and enable kernel downloads.</p></li></ul><p>Below&#8217;s the output you should get:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lH45!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lH45!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 424w, https://substackcdn.com/image/fetch/$s_!lH45!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 848w, https://substackcdn.com/image/fetch/$s_!lH45!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 1272w, https://substackcdn.com/image/fetch/$s_!lH45!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lH45!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png" width="1456" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Retrieving the Kameleo Docker image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Retrieving the Kameleo Docker image" title="Retrieving the Kameleo Docker image" srcset="https://substackcdn.com/image/fetch/$s_!lH45!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 424w, https://substackcdn.com/image/fetch/$s_!lH45!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 848w, https://substackcdn.com/image/fetch/$s_!lH45!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 1272w, https://substackcdn.com/image/fetch/$s_!lH45!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Retrieving the Kameleo Docker image</figcaption></figure></div><p>The Kameleo Docker image should now be downloaded and launched on your system. Cool!</p><h3>Step #3: Verify the Service</h3><p>Once you run the image, Kameleo Docker will:</p><ol><li><p>Start the Local API on port <em>5050</em>.</p></li><li><p>Authenticate using your credentials.</p></li><li><p>Download required browser kernels (first run only).</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pYxi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pYxi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 424w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 848w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 1272w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pYxi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png" width="1456" height="82" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:82,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Kameleo Docker image startup logs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Kameleo Docker image startup logs" title="The Kameleo Docker image startup logs" srcset="https://substackcdn.com/image/fetch/$s_!pYxi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 424w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 848w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 1272w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Kameleo Docker image startup logs</figcaption></figure></div><p>To confirm everything is running correctly, visit the following URL in your browser:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">http://localhost:5050/swagger</code></pre></div><p>You should see the Swagger UI for the Kameleo Local API:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oAMx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oAMx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 424w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 848w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oAMx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png" width="1456" height="782" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37e97028-9717-472e-850f-dcced084d674_3024x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:782,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Swagger UI for the Kameleo Local API at &#8220;http://localhost:5050/swagger&#8221;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Swagger UI for the Kameleo Local API at &#8220;http://localhost:5050/swagger&#8221;" title="The Swagger UI for the Kameleo Local API at &#8220;http://localhost:5050/swagger&#8221;" srcset="https://substackcdn.com/image/fetch/$s_!oAMx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 424w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 848w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Swagger UI for the Kameleo Local API at &#8220;http://localhost:5050/swagger&#8221;</figcaption></figure></div><p>You&#8217;ll have a local Kameleo instance ready for automation with Playwright, Puppeteer, or Selenium. Great!</p><h3>Step #4: Download the SDK and Create Your First Profile</h3><p>Now that Kameleo Docker is running, you can interact with it through the APIs exposed at <em>http://localhost:5050</em>. The next step is to proceed with the <a href="https://developer.kameleo.io/getting-started/quickstart/">usual Kameleo setup</a> by creating a profile.</p><p>Assuming you already have a Python environment with Playwright installed,start by <a href="https://pypi.org/project/kameleo.local-api-client/">installing the Kameleo SDK</a>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">pip install kameleo-local-api-client</code></pre></div><p>Then, in your Playwright script, initialize the Kameleo API client and generate a browser profile:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from kameleo.local_api_client import KameleoLocalApiClient
from kameleo.local_api_client.models import CreateProfileRequest

# Connect to the Kameleo API client
client = KameleoLocalApiClient(endpoint="http://localhost:5050")
# Search for a real-world fingerprint and create a Kameleo profile based on it
fps = client.fingerprint.search_fingerprints(
    device_type="desktop",
    os_family="windows",
    browser_product="chrome",
    browser_version="&gt;145",
)
profile = client.profile.create_profile(
    CreateProfileRequest(fingerprint_id=fps[0].id, name="twsc demo")
)</code></pre></div><p>The above snippet connects the Kameleo API client from the SDK to the local Kameleo Docker APIs. It retrieves a <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">realistic browser fingerprint</a> from the database and creates a persistent browser profile called &#8220;twsc demo&#8221; based on it. In this case, the fingerprint profile is for a desktop Chrome browser (version &gt;145) running on Windows.</p><p>Run the script above. If you started the Linux container of Kameleo Docker while mapping port <em>80</em> for the web GUI, then you&#8217;ll be able to see the &#8220;twsc demo&#8221; profile at <em>http://localhost:80</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7VeS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7VeS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 424w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 848w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 1272w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7VeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the &#8220;twsc demo&#8221; profile&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the &#8220;twsc demo&#8221; profile" title="Note the &#8220;twsc demo&#8221; profile" srcset="https://substackcdn.com/image/fetch/$s_!7VeS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 424w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 848w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 1272w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the &#8220;twsc demo&#8221; profile</figcaption></figure></div><h3>Step #5: Connect With Playwright</h3><p>You can now connect Playwright to the Kameleo profile created above <a href="https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi">via CDP</a> using the following WebSocket URL:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">ws://localhost:5050/playwright/&lt;KAMELEO_PROFILE_ID&gt;</code></pre></div><p>Achieve that by using <a href="https://playwright.dev/python/docs/api/class-browsertype#browser-type-connect-over-cdp">Playwright&#8217;s </a><em><a href="https://playwright.dev/python/docs/api/class-browsertype#browser-type-connect-over-cdp">connect_over_cdp()</a></em> method on the target URL:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

# Kameleo profile creation...

# Connect Playwright to the Kameleo instance based on the configured profile
ws_endpoint = f"ws://localhost:5050/playwright/{profile.id}"
with sync_playwright() as playwright:
    browser = playwright.chromium.connect_over_cdp(endpoint_url=ws_endpoint, timeout=90_000)

    # Regular Playwright automation logic...</code></pre></div><p>Wonderful! Playwright is attached to the browser session managed by Kameleo Docker. You can now automate it using standard Playwright APIs as if it were a regular local Chromium instance.</p><h3>Step #6: Implement the Automation Logic</h3><p>To achieve the scraping goal, begin by inspecting the page to study its DOM structure:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zk-5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zk-5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 424w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 848w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 1272w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zk-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png" width="1456" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Inspecting a product on the page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Inspecting a product on the page" title="Inspecting a product on the page" srcset="https://substackcdn.com/image/fetch/$s_!zk-5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 424w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 848w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 1272w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Inspecting a product on the page</figcaption></figure></div><p>Then, apply the following Playwright logic (connected to a Kameleo profile) to automate scraping on the JavaScript-rendered page:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># Connect Playwright to the Kameleo instance based on the configured profile
ws_endpoint = f'ws://localhost:5050/playwright/{profile.id}'
with sync_playwright() as playwright:
    browser = playwright.chromium.connect_over_cdp(endpoint_url=ws_endpoint, timeout=90_000)

    # Open a new page
    context = browser.contexts[0]
    page = context.new_page()

    # Visit target site
    page.goto("https://www.scrapingcourse.com/javascript-rendering")

    # Where to store the scraped data
    products = []

    # Wait for products to render
    page.wait_for_selector(".product-item")

    # Locate all product items
    product_elements = page.locator(".product-item")

    for i in range(product_elements.count()):
        # Select the nth product
        product_element = product_elements.nth(i)

        # Extract the product data
        name = product_element.locator(".product-name").inner_text()
        price = product_element.locator(".product-price").inner_text()
        image = product_element.locator("img.product-image").get_attribute("src")
        link = product_element.locator("a.product-link").get_attribute("href")

        # Populate a product object with the scraped data
        product = {
            "name": name,
            "price": price,
            "image": image,
            "url": link
        }
        # Append it to the products list
        products.append(product)</code></pre></div><p>The above snippet instructs the controlled browser to visit the target page, waits for product elements to load, then iterates through each product DOM node to extract structured fields (name, price, image, URL) and stores them in a Python list for downstream processing.</p><p>The Kameleo-powered automation script is almost complete. Only one step remains!</p><h3>Step #7: Stop the Kameleo Profile</h3><p>Normally, in a Playwright scenario, you would need to call <em>browser.close()</em> to terminate the browser session and release its resources.</p><p>In Kameleo, that&#8217;s not required. Instead, you only need to call:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">client.profile.stop_profile(profile_id=profile.id)</code></pre></div><p>The above line of code sends a close command to the browser via CDP. Once the browser actually stops, the Kameleo profile is terminated, too. This ensures that all resources associated with both the browser and the running profile are properly released.</p><h3>Step #8: Run the Script</h3><p>The final Playwright automation script, connecting via CDP to the stealth browser instance exposed by Kameleo Docker, will contain:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># pip install playwright kameleo-local-api-client

from kameleo.local_api_client import KameleoLocalApiClient
from kameleo.local_api_client.models import CreateProfileRequest
from playwright.sync_api import sync_playwright

# Connect to the Kameleo API client
client = KameleoLocalApiClient(endpoint="http://localhost:5050")
# Search for a real-world fingerprint and create a Kameleo profile based on it
fps = client.fingerprint.search_fingerprints(
    device_type="desktop",
    os_family="windows",
    browser_product="chrome",
    browser_version="&gt;145",
)
profile = client.profile.create_profile(
    CreateProfileRequest(fingerprint_id=fps[0].id, name="twsc demo")
)

# Connect Playwright to the Kameleo instance based on the configured profile
ws_endpoint = f'ws://localhost:5050/playwright/{profile.id}'
with sync_playwright() as playwright:
    browser = playwright.chromium.connect_over_cdp(endpoint_url=ws_endpoint, timeout=90_000)

    # Open a new page
    context = browser.contexts[0]
    page = context.new_page()

    # Visit target site
    page.goto("https://www.scrapingcourse.com/javascript-rendering")

    # Where to store the scraped data
    products = []

    # Wait for products to render
    page.wait_for_selector(".product-item")

    # Locate all product items
    product_elements = page.locator(".product-item")

    for i in range(product_elements.count()):
        # Select the nth product
        product_element = product_elements.nth(i)

        # Extract the product data
        name = product_element.locator(".product-name").inner_text()
        price = product_element.locator(".product-price").inner_text()
        image = product_element.locator("img.product-image").get_attribute("src")
        link = product_element.locator("a.product-link").get_attribute("href")

        # Populate a product object with the scraped data
        product = {
            "name": name,
            "price": price,
            "image": image,
            "url": link
        }
        # Append it to the products list
        products.append(product)

    # Print the scraped products
    for product in products:
        print(product)

# Stop the Kameleo profile
client.profile.stop_profile(profile_id=profile.id)</code></pre></div><p>Execute the script, and you should see output similar to this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tH8f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tH8f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 424w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 848w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 1272w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tH8f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output produced by the automation script&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output produced by the automation script" title="The output produced by the automation script" srcset="https://substackcdn.com/image/fetch/$s_!tH8f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 424w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 848w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 1272w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the automation script</figcaption></figure></div><p>Notice how the script successfully scraped product data from the JavaScript-rendered page.</p><p>Once execution completes, open the Kameleo web GUI, and you&#8217;ll notice that the &#8220;twsc demo&#8221; profile is now marked as &#8220;TERMINATED&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zZCP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zZCP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 424w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 848w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 1272w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zZCP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the updated status of the Kameleo profile&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the updated status of the Kameleo profile" title="Note the updated status of the Kameleo profile" srcset="https://substackcdn.com/image/fetch/$s_!zZCP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 424w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 848w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 1272w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the updated status of the Kameleo profile</figcaption></figure></div><p>That doesn&#8217;t mean the profile is gone forever. Quite the opposite!</p><p>Kameleo profiles are reusable, meaning you can retrieve and start them again later to continue the browsing session with the same fingerprint and persisted state. I&#8217;ll cover exactly how to do that in a dedicated FAQ.</p><p>Mission complete! You just learned how to use Kameleo Docker for Playwright automation. With very similar logic, you can automate Puppeteer, Selenium, or any other CDP-compatible solution, in both Python and JavaScript.</p><h3>Pricing Model</h3><p>Kameleo Docker is included across all plans at no additional cost. You get the same core limits (concurrent browsers, number of profiles, and browser usage time) regardless of whether you run the desktop app or the containerized version. So, <a href="https://kameleo.io/pricing">take a look at the official pricing page</a> for more information.</p><h2>Anti-Bot Performance Benchmarks</h2><p>To test Kameleo Docker, I ran a simple script against one page protected by each major anti-bot detection system. The results are shown below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kcyp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kcyp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 424w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 848w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 1272w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kcyp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png" width="1456" height="461" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:461,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vanilla Playwright (headless) vs vanilla Playwright (headful) vs Kameleo Docker&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vanilla Playwright (headless) vs vanilla Playwright (headful) vs Kameleo Docker" title="Vanilla Playwright (headless) vs vanilla Playwright (headful) vs Kameleo Docker" srcset="https://substackcdn.com/image/fetch/$s_!kcyp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 424w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 848w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 1272w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Vanilla Playwright (headless) vs vanilla Playwright (headful) vs Kameleo Docker</figcaption></figure></div><p><strong>Note</strong>: All tests were performed locally using my ISP&#8217;s residential IP address.</p><p>As shown above, in this basic experiment, Kameleo Docker achieved a 100% success rate. In contrast, Playwright consistently failed in headless mode and, in some cases, also struggled in headful mode.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Final Thoughts on Kameleo Docker</h3><p><em>What stood out when I met Barnabas at <a href="https://www.praguecrawl.com/">Prague Crawl 2026</a> (see you next year &#128521;</em>)<em> was the clear passion the team has for the project, along with their focus on quality and continuous improvement.</em></p><p><em>At the same time, when testing new products, especially technical and complex ones like Kameleo Docker, you usually stumble across bugs or unexpected behavior. I can confidently say that this wasn&#8217;t the case at all here. Everything ran smoothly from the beginning, and I didn&#8217;t encounter any issues&#8230;</em></p><p>On top of that, t<em>he benchmark results are promising, and I didn&#8217;t notice any significant performance lag. Thus, my honest takeaway is simple: if you&#8217;re looking for a production-ready, containerized stealth browser, or you&#8217;re simply passionate about automation and scraping, consider giving Kameleo Docker a try!</em></p><p>In this article, I covered what the project is about, what it offers, how it works, and how to use it. As always, remember to use Kameleo Docker only for legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical web scraping and automation</a>. Until next time!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>FAQ</h2><h3>Is Kameleo Docker different from the desktop app?</h3><p>Kameleo Docker differs from the desktop app mainly in deployment. Instead of a local GUI, it runs as a containerized service that runs on both Linux and Windows servers. Feature parity is largely preserved, including profiles, fingerprinting, and browser engines.</p><h3>Can I reuse already created profiles in Kameleo Docker?</h3><p>Yes! Profiles can be reused by retrieving the full list of profiles, filtering by name (or ID, if you know it), and then starting the desired profile. For example, to reuse the &#8220;twsc demo&#8221; profile, write:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from kameleo.local_api_client import KameleoLocalApiClient
from kameleo.local_api_client.models import ProfileLifetimeState

# Connect to the Kameleo API client
client = KameleoLocalApiClient(endpoint="http://localhost:5050")

# Fetch all available profiles
profiles = client.profile.list_profiles()

# Find the profile with the specific name
target_name = "twsc demo"
profile = next((p for p in profiles if p.name == target_name), None)

# Check if the profile was found
if profile:
    # Start the existing profile if it isn't already running
    if profile.status.lifetime_state != ProfileLifetimeState.RUNNING:
      client.profile.start_profile(profile.id)</code></pre></div><h3>Can I use Kameleo Docker in my CI/CD?</h3><p>Kameleo Docker fits naturally into CI/CD pipelines by running as a disposable, reproducible container in build or test stages. You can spin up browsers on demand, run automated flows, and tear them down after execution. Configuration is <a href="https://developer.kameleo.io/integrations/docker/#example-with-docker-compose">typically handled via Docker Compose</a>.</p><h3>Does Kameleo Docker support proxy integration?</h3><p>Yes! HTTP, HTTPS, and SOCKS proxies can be configured at profile creation time, <a href="https://developer.kameleo.io/tutorials/using-proxy-servers/">as explained in the documentation</a>.</p><h3>Can Kameleo Docker scale to thousands of browsers?</h3><p>Kameleo Docker supports horizontal scaling through standard orchestration tools. You can run multiple containers across clusters using Kubernetes or <a href="https://developer.kameleo.io/integrations/docker/#aws-ecs-support">AWS ECS</a>, each managing independent browser instances.</p><h3>How Does Firefox Automation Work in Kameleo Docker?</h3><p>Kameleo Docker supports Firefox-based automation through the Junglefox engine. Because Playwright cannot connect directly to Firefox-based sessions, Kameleo provides a <em><a href="https://developer.kameleo.io/integrations/docker/#using-junglefox-playwright-pw-bridge">pw-bridge</a></em><a href="https://developer.kameleo.io/integrations/docker/#using-junglefox-playwright-pw-bridge"> helper</a> that acts as a compatibility layer. This component translates Playwright connections into the correct browser session, allowing standard automation scripts to run unchanged while still using Firefox-based fingerprint profiles.</p><div><hr></div><p><em>Did you like this article? Share it with someone who might find it useful and get a discount on paid plans.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/kameleo-docker-exploring-the-docker?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.thewebscraping.club/p/kameleo-docker-exploring-the-docker?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><p></p>]]></content:encoded></item><item><title><![CDATA[THE LAB #107: Reversing Shopee's native crypto with Ghidra]]></title><description><![CDATA[Shopee hides its crypto in a native library. We read it in Ghidra and rebuild it in Python, byte for byte.]]></description><link>https://substack.thewebscraping.club/p/reversing-shopee-app-ghidra</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/reversing-shopee-app-ghidra</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 11 Jun 2026 22:19:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/988c1c72-086f-45da-89ea-a3074ccdcc0c_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Shopee is one of the largest marketplaces in Southeast Asia, and like most big apps, its mobile API is a better scraping target than its website. The app talks to the backend in plain JSON over HTTPS, the endpoints are stable, and the anti-bot layer is usually lighter than the one guarding the web frontend. We covered the easy version of this in <a href="https://substack.thewebscraping.club/p/the-lab-12-reverse-engineering-mobile">The Lab #12</a>, where Charles and JADX were enough to read an Android app&#8217;s traffic and replay it.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Shopee does not hand you the easy version. Capture a request, replay it, and the backend answers with HTTP 418 and a security error code. Every API call carries a set of anti-fraud headers, and the code that builds them is not in the Java you can read with JADX. It sits in native code, in compiled <code>.so</code> libraries, which is exactly where traffic interception and a Java decompiler stop being useful. Open the app in JADX and the signing method is there in name only, declared <code>native</code>, with its body on the far side in ARM machine code.</p><p>This is a two-part investigation into how Shopee signs its API requests and how you reproduce that signing yourself. The strategy is the one that works on most hardened apps. You locate the native security libraries, open them in a disassembler, and turn what they do back into something you control. When a library is readable crypto, you reimplement it in Python and sign offline, at any volume, with no app in the loop. When it is a bytecode virtual machine you cannot practically rewrite, you keep the app running and drive its own signer as an oracle. We chose this route because it is the one that scales. An offline signer, or an oracle you call, runs inside your scraper on a server. A rooted phone you have to babysit does not.</p><p>This first part is the foundation. We take one of Shopee&#8217;s native libraries, <code>libshopeeaegis.so</code>, reverse it end to end with Ghidra and rebuild it in Python. Reading the decompiled code identifies every operation as textbook crypto, and our rebuild reproduces those algorithms byte for byte. It is the readable case, the kind you win cleanly, and the clearest worked example of the method. The second part takes on the harder library, the one that computes the per-request signature, and gets past it with the oracle approach.</p><p>What you take from each part depends on your goal. If Shopee is your target, the payoff is the full picture of its request signing across both parts. If you scrape other apps, the method matters more than the marketplace. Most apps that protect their API at all push the work into a native library, and a large share of those are plain, readable crypto you can reproduce. We work it out on Aegis here, and it is the same move on the next app you open.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://visit.decodo.com/WyQ3mA" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>The tools</h2><p>We use four tools, each doing one job.</p><p><a href="https://github.com/androguard/androguard">androguard</a> is a Python library for static APK analysis. We use it for fast recon. It lists the native libraries an app ships and finds which classes declare <code>native</code> methods. It does not give you readable source. It gives you structure you can script.</p><p><a href="https://github.com/skylot/jadx">JADX</a> decompiles Dalvik bytecode back to Java. It is how you read the managed side of the app and find the exact class and method that crosses into native code. JADX stops at the <code>native</code> keyword, which is the handoff point to the next tool.</p><p><a href="https://github.com/NationalSecurityAgency/ghidra">Ghidra</a> is the NSA&#8217;s open source reverse engineering framework. It disassembles a <code>.so</code> and decompiles it to pseudo C. It is the only tool here that can read native code, and it is the one this article leans on.</p><p><a href="https://github.com/frida/frida">Frida</a> injects a JavaScript engine into a running process so you can hook and call functions live. We use it to run the app under instrumentation and confirm our static reading against what the app actually does.</p><p>JADX and androguard read the managed code. Ghidra reads the native code. Frida watches the code run. The native library is the one piece only Ghidra can open, so the work centers there.</p><h2>Modeling the app&#8217;s defenses</h2><p>Before opening anything, it helps to name the layers, because Shopee has several and only one is our target here.</p><p>The managed layer is the Java and Kotlin code. It builds requests, attaches headers, and calls into native methods. JADX reads it.</p><p>The native layer is a set of <code>.so</code> libraries the app loads. Pull the arm64 split out of the APK with androguard and the security-relevant ones stand out by name. They are <code>libshopeeaegis.so</code>, <code>libshpssdk.so</code>, and <code>libBkeBizSecurity.so</code>, plus <code>libjnihook.so</code> and <code>libshook.so</code>. The last two are a hooking framework and an anti-hook layer, which means the app actively watches for instrumentation. That matters for Frida later.</p><p>The request-signing layer sits on top. Two okhttp interceptors, <code>com.shopee.app.network.antifraud.b</code> and <code>.d</code> (they call themselves <code>SecurityNewSapInterceptor</code> and <code>SecurityNewSapPostInterceptor</code>), attach the anti-fraud headers <code>af-ac-enc-sz-token</code> and <code>x-sap-ri</code> to API requests. The values they attach come from <code>libshpssdk.so</code>, the Shopee Security SDK.</p><p>We target one layer, <code>libshopeeaegis.so</code>, a general-purpose crypto library the app calls for specific operations. The request signer in <code>libshpssdk.so</code> stays out of scope. It is a bytecode virtual machine, a harder problem that we handle separately, and reproducing the <code>af-ac</code> headers is not the promise here. The promise is that you can take <code>libshopeeaegis.so</code>, understand every operation it performs, and reproduce it byte for byte in Python.</p><p>One detail decides whether that promise holds. <code>libshopeeaegis.so</code> loads only when the app needs it, so it is not present at idle. We watched the process maps over a minute of normal browsing and the library never appeared. The crypto we are about to reverse is a toolbox the app reaches for in certain flows, not the thing running on every request.</p><div><hr></div><blockquote><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://byteful.com/?promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://byteful.com/?promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h2>Getting the library and finding the door</h2><p>We pulled Shopee PH 3.75.24 (<code>com.shopee.ph</code>) as an XAPK and unzipped it. The native libraries are not in <code>base.apk</code>. For a split bundle they live in <code>config.arm64_v8a.apk</code>. Listing the <code>.so</code> files with androguard and <code>unzip -l</code>, <code>libshopeeaegis.so</code> is a small one at 280 KB, which is a good sign. Small means little room for a heavy obfuscator.</p><p>androguard answers the first question, which library to open. It does not answer the second, how the app calls it. For that we go to JADX and find the class on the Java side. The library registers its native methods against <code>com.shopee.sz.reinforce.Aegis</code>. The class exposes a method <code>fire</code>, overloaded, declared <code>native</code>. Two of the overloads matter:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;820ec8fb-46fc-4f0c-9553-0dc70f676be0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">native byte[] fire(int mode, byte[] data)
native byte[] fire(int mode, byte[] data, byte[] key)</code></pre></div><p>This is the door. The first argument is an integer mode. Then one or two byte arrays. The return is a byte array. JADX cannot show what <code>fire</code> does, because the body is in the <code>.so</code>. So we open the <code>.so</code>.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Reading the library with Ghidra</h2><p>We ran Ghidra 12.1.2 in headless mode. It runs without a GUI, it scripts cleanly, and it repeats exactly. The workflow is documented in our Ghidra tool skill you can use in Claude Code, just like I did for this test. In short, you import the <code>.so</code>, let auto analysis run, then run a script that decompiles functions to a file.</p><pre><code><code>support/analyzeHeadless /tmp/proj aegis \
  -import config.arm64_v8a/lib/arm64-v8a/libshopeeaegis.so \
  -scriptPath ./scripts \
  -postScript DecompileExport.java out.c \
  -overwrite</code></code></pre><p>Auto analysis finished in nine seconds and the decompiler produced 674 functions with zero failures. That number alone tells you this is not a packed or virtualized binary. A protected library fights the decompiler; this one did not.</p><p>The first useful function is <code>JNI_OnLoad</code>, which every JNI library runs at load time. Read its pseudo C and it looks up the class <code>com/shopee/sz/reinforce/Aegis</code> and calls <code>RegisterNatives</code> with two methods. That confirms the door from the Java side and tells us the native functions are registered dynamically rather than exported under <code>Java_*</code> names. Dynamic registration is a mild form of hiding, and it is exactly what Ghidra&#8217;s JNI handling and a <code>RegisterNatives</code> trace are for.</p><p>The C++ symbols survived. That is the break that makes this library readable. The class is <code>Aegis</code>, with methods named <code>missileFire</code>, <code>missileCount</code>, <code>prism</code>, <code>snowDon</code>, <code>tugWar</code>, and <code>parse</code>. There is a second class, <code>TeslaModel</code>, with <code>model_3</code>, <code>model_a</code>, <code>model_b</code>, <code>model_c</code>, <code>model_e</code>, <code>model_s</code>, <code>model_x</code>, <code>model_y</code>, and <code>getNuremberg</code>. The names are deliberately silly, a Tesla and military theme, but they are real symbols, and the structure is intact.</p><p>Follow the call chain from the registered native function. The dispatcher is <code>Aegis::prism</code>, a plain <code>switch</code> on the mode integer:</p><pre><code><code>switch(param_1) {
case 0:  model_3(...)               // one input
case 1:  model_x(key, input, ...)   // keyed
case 2:  model_x(...); model_3(...) // keyed, then case 0
case 3:  model_y(...); model_3(...)
case 4:  model_e(...)
case 5:  model_a(...)
case 6:  model_b(...)
case 7:  model_s(...); model_3(...)
case 8:  model_c(...); model_3(...)
}</code></code></pre><p>One native call selects one of nine operations by an integer, and some operations are a keyed primitive followed by <code>model_3</code>. To name each operation we read two things, the output size and the primitive body.</p><p>The output size comes from <code>TeslaModel::getNuremberg(mode, len)</code>, which <code>missileFire</code> calls to size the output buffer before doing the work. It returns 16 for mode 4, 32 for mode 5, 64 for mode 6, and 20 for mode 8. Those are the digest sizes of MD5, SHA-256, SHA-512, and SHA-1. For mode 0 it returns the Base64 expansion of the input length. The size function alone half-names the table.</p><p>The bodies confirm the rest, and here the silly names get helpful, because the renamed primitives kept their original suffixes. <code>model3_autopilot</code> is a textbook Base64 encoder. It reads three bytes, writes four, and pads with <code>0x3d</code>, which is the <code>=</code> character. <code>modelx_autopilot_cbc</code> is AES in CBC mode, recognizable because it XORs each 16 byte block with the previous ciphertext block before the round function. The hash contexts are renamed with a <code>phantom</code> and <code>F</code> theme but keep the gnulib <code>_init_ctx</code> / <code>_process_bytes</code> / <code>_finish_ctx</code> shape. <code>phantom1</code> is SHA-1, <code>phantom256</code> is SHA-256, <code>InitF22</code> is SHA-512. And <code>phantom1</code> as called by <code>model_c</code> is the HMAC form. It XORs the key with <code>0x36</code> for the inner pad and <code>0x5c</code> for the outer pad over a 64 byte block, which is the HMAC construction.</p><p>Two of the keyed modes turned out not to be ciphers at all. <code>model_s</code> calls <code>phantom256</code> with a key and a message and returns 32 bytes, so it is HMAC-SHA256. <code>model_c</code> calls <code>phantom1</code> the same way and returns 20 bytes, so it is HMAC-SHA1. Reading the bodies kept us honest here. From the signatures alone we had guessed AES.</p><p>That gives the full table.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x9os!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x9os!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 424w, https://substackcdn.com/image/fetch/$s_!x9os!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 848w, https://substackcdn.com/image/fetch/$s_!x9os!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 1272w, https://substackcdn.com/image/fetch/$s_!x9os!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x9os!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png" width="719" height="403" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:403,&quot;width&quot;:719,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42068,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/201664746?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x9os!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 424w, https://substackcdn.com/image/fetch/$s_!x9os!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 848w, https://substackcdn.com/image/fetch/$s_!x9os!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 1272w, https://substackcdn.com/image/fetch/$s_!x9os!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><code>model_x</code> pads with PKCS7 to a 16 byte boundary, then runs AES-CBC. The key length sets the variant, and a 16 byte key gives AES-128. <code>model_y</code> does the same but writes the IV in front of the ciphertext, the standard prepend-the-IV pattern, before the Base64 in mode 3.</p><p>One value is not in the file. The CBC IV is a fixed 16 byte constant the library keeps at a <code>.bss</code> address. <code>.bss</code> is zero-initialized on disk and filled at runtime, so the IV is set when the library initializes and you cannot read it statically. For the hash, HMAC, and Base64 modes that does not matter, because their output is fully determined by the input and key. For the three AES modes it means byte-identical output needs the real IV, which you read from the live process once the library loads.</p><p>As always, the code that will be used for the python reimplementation we&#8217;re showing now can be found&nbsp;<a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved for paying users, inside the folder&nbsp;</a><strong><a href="https://github.com/TheWebScrapingClub/thelab">107.</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">SHOPEE-GHIDRA</a><strong><a href="https://github.com/TheWebScrapingClub/thelab">.</a></strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/reversing-shopee-app-ghidra">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Give Claude Real-Time Web Access With the Decodo MCP]]></title><description><![CDATA[Learn how to connect Claude to the web with zero integration code]]></description><link>https://substack.thewebscraping.club/p/claude-decodo-mcp-how-to</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/claude-decodo-mcp-how-to</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 07 Jun 2026 19:37:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0ca9845f-9623-47e0-9aa6-3e40da60ddb6_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve spent serious time scraping, you know the real work isn&#8217;t parsing HTML: it&#8217;s surviving Cloudflare, rotating proxies, handling CAPTCHAs, and pretending to be a human long enough to get the data you need.</p><p>Now, <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">since the rise of AI and LLMs, every office job has changed, with no exceptions for scraping professionals</a>. For us, this means that surviving anti-bots and CAPTCHAs has become the first challenge when creating scraping pipelines. The second challenge has become the integration of scraping services and capabilities into AI pipelines and AI agents. The reason for this shift is simple: your boss or your clients don&#8217;t just want the data anymore. They want it in real-time, structured, and often packed with some insights.</p><p>Until some months ago, these pipelines required a huge amount of custom code (and time!). Luckily for us, the major struggle lasted until MCPs were first released.</p><p>In this article, you&#8217;ll learn what MCP is, how the Decodo MCP server works, and how to integrate it with Claude Desktop. You&#8217;ll also learn how to use it with two hands-on examples.</p><p>Let&#8217;s get into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h3>What is MCP?</h3><p><a href="https://modelcontextprotocol.io/docs/getting-started/intro">The Model Context Protocol (MCP)</a> is an open standard introduced by Anthropic that defines how AI models connect to external tools, data sources, and services. Before MCP, every integration between an AI model and an external system had to be built from scratch. MCP replaces all of the complexity behind the integration with a single, shared protocol.</p><p>Basically, the MCP protocol acts as a common language for models to connect with external tools, files, and systems. For example: do you need an AI assistant to pull a file from Google Drive, query a company database, and trigger an action in an internal app? That&#8217;s exactly the kind of job MCP is built to handle!</p><p>The practical upside is composability. Developers can mix and match several MCP servers into a single AI application without writing any integration code. So, after more than a year since its introduction in the AI industry, MCP has become the standard for integrating different services and applications into a single AI software.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>What is Decodo MCP?</h3><p>The <a href="https://visit.decodo.com/9VzKKe">Decodo MCP Server</a> is a web scraping layer for AI agents. It connects MCP-compatible clients to Decodo&#8217;s Web Scraping API, enabling:</p><ul><li><p><strong>Web scraping for LLMs and AI agents without managing infrastructure:</strong> It can scrape any website, including JavaScript-heavy pages, and get real-time data without handling proxy rotation, CAPTCHA solving, or anti-bot systems. It is specifically built for RAG pipelines, AI research agents, and automation flows.</p></li><li><p><strong>Structured outputs for LLM workflows</strong>: It allows you to retrieve the scraped data in Markdown (LLM-ready), JSON (for structured pipelines), and screenshots (for visual context).</p></li></ul><p>As of now, the Decodo MCP server exposes the following tools:</p><ul><li><p><em>scrape_as_markdown</em>: Scrapes any target URL, given a target URL via prompt. Returns results in Markdown.</p></li><li><p><em>screenshot</em>: Captures a screenshot of any webpage and returns it as a PNG image.</p></li><li><p><em>google_search</em>: Scrapes Google Search for a given query, and returns parsed results.</p></li><li><p><em>google_ads</em>: Scrapes Google Ads search results.</p></li><li><p><em>google_lens</em>: Scrapes Google Lens image search results.</p></li><li><p><em>google_ai_mode</em>: Scrapes Google AI Mode (Search with AI) results.</p></li><li><p><em>google_travel_hotels</em>: Scrapes Google Travel Hotels search results.</p></li><li><p><em>amazon_search</em>: Scrapes Amazon Search for a given query, and returns parsed results.</p></li><li><p><em>amazon_product</em>: Scrapes a given Amazon Product page.</p></li><li><p><em>amazon_pricing</em>: Scrapes Amazon Product pricing information.</p></li><li><p><em>amazon_sellers</em>: Scrapes Amazon Seller information.</p></li><li><p><em>amazon_bestsellers</em>: Scrapes Amazon Bestsellers list.</p></li><li><p><em>walmart_search</em>: Scrapes Walmart Search for a given query, and returns parsed results.</p></li><li><p><em>walmart_product</em>: Scrapes Walmart Product page.</p></li><li><p><em>target_search</em>: Scrapes Target Search for a given query, and returns parsed results.</p></li><li><p><em>target_product</em>: Scrapes Target Product page.</p></li><li><p><em>tiktok_post</em>: Scrapes a TikTok post URL.</p></li><li><p><em>tiktok_shop_search</em>: Scrapes TikTok Shop Search for a given query, and returns parsed results.</p></li><li><p><em>tiktok_shop_product</em>: Scrapes TikTok Shop Product page.</p></li><li><p><em>tiktok_shop_url</em>: Scrapes TikTok Shop page by URL.</p></li><li><p><em>youtube_metadata</em>: Scrapes YouTube video metadata.</p></li><li><p><em>youtube_channel</em>: Scrapes YouTube channel videos.</p></li><li><p><em>youtube_subtitles</em>: Scrapes YouTube video subtitles.</p></li><li><p><em>youtube_search</em>: Search YouTube videos.</p></li><li><p><em>reddit_post</em>: Scrapes a specific Reddit post.</p></li><li><p><em>reddit_subreddit</em>: Scrapes Reddit subreddit results.</p></li><li><p><em>reddit_user</em>: Scrapes a Reddit user profile and their posts and comments.</p></li><li><p><em>bing_search</em>: Scrapes Bing Search results.</p></li><li><p><em>chatgpt</em>: Search and interact with ChatGPT for AI-powered responses and conversations.</p></li><li><p><em>perplexity</em>: Search and interact with Perplexity for AI-powered responses and conversations.</p></li></ul><p><strong>NOTE</strong>: Decodo is currently onboarding its MCP server onto various platforms and marketplaces. At the moment of writing this article, it can be found on the <a href="https://registry.modelcontextprotocol.io/?q=decodo">Official MCP registry</a>, <a href="https://www.pulsemcp.com/servers/decodo">Pulse MCP</a>, <a href="https://glama.ai/mcp/servers?query=decodo">Glama AI</a>, <a href="https://mcp.so/explore?q=decodo">mcp.so</a>, and <a href="https://mcpmarket.com/server/decodo">mcpmarket.com</a>.</p><h2>How To Integrate The Decodo MCP With Claude</h2><p>The prerogative of MCPs is to integrate them with LLMs. The <a href="https://github.com/Decodo/mcp-server">Decodo MCP server</a> can be integrated with several services that leverage LLMs, like Claude and Cursor. In this paragraph, you will learn how to integrate it with Claude.</p><h3>Requirements</h3><p>To use the Decodo MCP, your system must satisfy the following requirements:</p><ul><li><p><strong>Claude Desktop</strong>: To integrate the Decodo MCP server with Claude, you need <a href="https://claude.com/download">Claude Desktop. If you don&#8217;t have it installed yet, you can download it from their website.</a></p></li><li><p><strong>Decodo account:</strong> Create an account at <a href="https://dashboard.decodo.com/">dashboard.decodo.com</a>. With a free one, you have up to 2K free requests.</p></li><li><p><strong>Scraping token</strong>: Get the basic authentication token. To get it, click on <strong>Web Scraping API</strong> &gt; <strong>API playground:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0R6j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0R6j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 424w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 848w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 1272w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0R6j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png" width="1456" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:170154,&quot;alt&quot;:&quot;How to get the basic authentication token in the Decodo dashboard by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to get the basic authentication token in the Decodo dashboard by Federico Trotta" title="How to get the basic authentication token in the Decodo dashboard by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!0R6j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 424w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 848w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 1272w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How to get the basic authentication token in the Decodo dashboard</figcaption></figure></div></li></ul><p>Good. Your system is set to connect Claude Desktop with the Decodo MCP server.</p><h3>Connect The Decodo MCP Server to Claude</h3><p>To connect the Decodo MCP server to Claude, open Claude Desktop and click on <strong>Settings</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YotT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YotT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 424w, https://substackcdn.com/image/fetch/$s_!YotT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 848w, https://substackcdn.com/image/fetch/$s_!YotT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 1272w, https://substackcdn.com/image/fetch/$s_!YotT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YotT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png" width="1456" height="705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:705,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121992,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YotT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 424w, https://substackcdn.com/image/fetch/$s_!YotT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 848w, https://substackcdn.com/image/fetch/$s_!YotT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 1272w, https://substackcdn.com/image/fetch/$s_!YotT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Go to settings in Claude</figcaption></figure></div><p>Then, click on <strong>Developer</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zS13!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zS13!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 424w, https://substackcdn.com/image/fetch/$s_!zS13!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 848w, https://substackcdn.com/image/fetch/$s_!zS13!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 1272w, https://substackcdn.com/image/fetch/$s_!zS13!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zS13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png" width="1282" height="481" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:481,&quot;width&quot;:1282,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37560,&quot;alt&quot;:&quot;Go to the Developer section in Claude&#8217;s settings by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Go to the Developer section in Claude&#8217;s settings by Federico Trotta" title="Go to the Developer section in Claude&#8217;s settings by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!zS13!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 424w, https://substackcdn.com/image/fetch/$s_!zS13!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 848w, https://substackcdn.com/image/fetch/$s_!zS13!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 1272w, https://substackcdn.com/image/fetch/$s_!zS13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Go to the Developer section in Claude&#8217;s settings</figcaption></figure></div><p>After clicking on <strong>Edit Config</strong>, the system will automatically open a folder where Claude stores all the configuration files on your local machine. Open the <em>claude_desktop_config.json</em> file and add the following to it:</p><pre><code><code>"mcpServers": {
    "Decodo MCP Server": {
      "command": "npx",
      "args": [
        "-y",
        "@decodo/mcp-server"
      ],
      "env": {
        "SCRAPER_API_TOKEN": "&lt;your-decodo-mcp-api-key&gt;"
      }
    }
  } </code></code></pre><p>Replace <em>&lt;your-decodo-mcp-api-key&gt;</em> with the basic authentication token you retrieved earlier from the Decodo dashboard, and the integration is done.</p><p>Quit Claude to make the changes effective. Note that just closing the Desktop window is not sufficient. You have to quit it:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vI5X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vI5X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 424w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 848w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 1272w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vI5X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png" width="311" height="196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:196,&quot;width&quot;:311,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68945,&quot;alt&quot;:&quot;How to quit Claude by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to quit Claude by Federico Trotta" title="How to quit Claude by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!vI5X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 424w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 848w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 1272w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">How to quit Claude</figcaption></figure></div><p>After that, when returning to <strong>Settings</strong> &gt; <strong>Developer,</strong> you&#8217;ll see the MCP server up and running:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KBhB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KBhB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 424w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 848w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 1272w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KBhB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png" width="1251" height="455" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:455,&quot;width&quot;:1251,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54828,&quot;alt&quot;:&quot;The Decodo MCP server is running by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Decodo MCP server is running by Federico Trotta" title="The Decodo MCP server is running by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!KBhB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 424w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 848w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 1272w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Decodo MCP server is running</figcaption></figure></div><p>To be sure everything works fine, you can test it with a prompt similar to the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IWAq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IWAq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 424w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 848w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 1272w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IWAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png" width="757" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:757,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63054,&quot;alt&quot;:&quot;A prompt to test if Claude can connect to the Decodo MCP server by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A prompt to test if Claude can connect to the Decodo MCP server by Federico Trotta" title="A prompt to test if Claude can connect to the Decodo MCP server by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!IWAq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 424w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 848w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 1272w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A prompt to test if Claude can connect to the Decodo MCP server</figcaption></figure></div><p>Alright, you successfully integrated the Decodo MCP server with Claude Desktop. Now it's time to test it!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Running the Decodo MCP With Claude: Hands-On Examples</h2><p>In this section, you will learn how to use the Decodo MCP server with two different examples:</p><ul><li><p>A basic usage taken from the documentation.</p></li><li><p>A more advanced usage where you&#8217;ll ask Claude to retrieve some data from Amazon, returning the result in a JSON file.</p></li></ul><p>Let&#8217;s get on to this!</p><h3>Getting Started: Run a Google Search in Seconds</h3><p>As a first and simple usage, you can test the <em>google_search</em> tool. The main idea behind this tool is to prompt the model with a query, and, under the hood, the MCP will use its Google search capabilities to return the result.</p><p>To just try things out, you can use the exact example reported in the Decodo MCP documentation to search for shoes on Google, reporting the top positions:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7PUU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7PUU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 424w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 848w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 1272w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7PUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png" width="787" height="391" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:391,&quot;width&quot;:787,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32789,&quot;alt&quot;:&quot;Allow Claude to use the MCP tools by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Allow Claude to use the MCP tools by Federico Trotta" title="Allow Claude to use the MCP tools by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!7PUU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 424w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 848w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 1272w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Allow Claude to use the MCP tools</figcaption></figure></div><p>As you can see, Claude automatically tries to use the MCP server, loading its tool. As Claude is a production-ready LLM, you will be asked if you want to allow it to use the Google search from the Decodo MCP always or just for this call.</p><p>When Claude has completed its job, the result is the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2oSz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2oSz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 424w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 848w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 1272w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2oSz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png" width="552" height="762" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:762,&quot;width&quot;:552,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97772,&quot;alt&quot;:&quot;Claude&#8217;s results by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude&#8217;s results by Federico Trotta" title="Claude&#8217;s results by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!2oSz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 424w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 848w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 1272w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Claude&#8217;s results</figcaption></figure></div><p>If you want to verify that the model hasn&#8217;t hallucinated, you can search for &#8220;shoes&#8221; on Google:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UYZP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UYZP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 424w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 848w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 1272w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UYZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png" width="1456" height="481" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:481,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177541,&quot;alt&quot;:&quot;The results on Google by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The results on Google by Federico Trotta" title="The results on Google by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!UYZP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 424w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 848w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 1272w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The results on Google</figcaption></figure></div><p>Alright! You made it to your first usage example.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Leveling Up: Extracting JSON-structured Data From Amazon</h3><p>Having results listed in a chat can be useful for a quick overview of web research. But the actual power of MCPs is to use the capabilities of the underlying tools for specific tasks that LLMs can solve.</p><p>An example is to ask the model to retrieve some data and report the results in a JSON file. This solution provides you with data that can be further used in the second part of your pipeline&#8212;for example, <a href="https://substack.thewebscraping.club/p/analyzing-scraped-data-pandas-matplotlib">for analyzing your scraped data</a>.</p><p>For this purpose, you can use the following prompt:</p><pre><code><code>Get Amazon bestsellers in electronics, extract the main info and return it in JSON format</code></code></pre><p>The image below shows Amazon&#8217;s best sellers in electronics, at the time of writing this article:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a_uB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a_uB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 424w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 848w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 1272w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a_uB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png" width="1352" height="886" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:886,&quot;width&quot;:1352,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:383040,&quot;alt&quot;:&quot;Amazon&#8217;s best sellers in electronics by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Amazon&#8217;s best sellers in electronics by Federico Trotta" title="Amazon&#8217;s best sellers in electronics by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!a_uB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 424w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 848w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 1272w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Amazon&#8217;s best sellers in electronics</figcaption></figure></div><p>Under the hood, Claude will trigger the <em>amazon_bestsellers</em> tool and will search the data in electronics. The chat result is the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XVSQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XVSQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 424w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 848w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 1272w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png" width="590" height="797" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:797,&quot;width&quot;:590,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108993,&quot;alt&quot;:&quot;Claude&#8217;s results via the chat by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude&#8217;s results via the chat by Federico Trotta" title="Claude&#8217;s results via the chat by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!XVSQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 424w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 848w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 1272w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Claude&#8217;s results via the chat</figcaption></figure></div><p>Below is the (partial) JSON returned by Claude:</p><pre><code><code>{
  "metadata": {
    "category": "Electronics",
    "source": "Amazon Best Sellers",
    "url": "&lt;https://www.amazon.com/Best-Sellers/zgbs/electronics/&gt;",
    "scraped_at": "2026-05-18",
    "total_items": 50,
    "currency": "USD"
  },
  "bestsellers": [
    {
      "rank": 1,
      "asin": "B08JHCVHTY",
      "title": "Blink Plus Plan with Monthly Auto-Renewal",
      "price": 11.99,
      "rating": 4.4,
      "ratings_count": 275779,
      "image_url": "&lt;https://images-na.ssl-images-amazon.com/images/I/31YHGbJsldL._AC_UL300_SR300,200_.png&gt;",
      "url": "&lt;https://www.amazon.com/Blink-Plus-Plan-monthly-auto-renewal/dp/B08JHCVHTY&gt;"
    },
    {
      "rank": 2,
      "asin": "B0DCH8VDXF",
      "title": "Apple EarPods Headphones with USB-C Plug",
      "price": 19,
      "rating": 4.6,
      "ratings_count": 13500,
      "image_url": "&lt;https://images-na.ssl-images-amazon.com/images/I/513OSdW4elL._AC_UL300_SR300,200_.jpg&gt;",
      "url": "&lt;https://www.amazon.com/Apple-EarPods-Headphones-Built-Control/dp/B0DCH8VDXF&gt;"
    },
    
    &lt;Omitted for brevity&gt;

    }
  ]
}</code></code></pre><p>As you can see, the first two items correspond to the first two in the image above, taken from the Amazon best seller page. This is just to be sure the model hasn&#8217;t hallucinated.</p><p>Well done! You learned how to use the Decodo MCP server with Claude.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>In this article, you learned what MCP is and why it has become the standard for connecting AI models to external tools and services. You also learned what the Decodo MCP server is, how to integrate it with Claude Desktop, and how to use it in practice.</p><p>So, let us know: what kind of scraping workflows are you planning to build with the Decodo MCP?</p><p></p>]]></content:encoded></item><item><title><![CDATA[THE LAB #106: Is Camoufox still effective, and do the forks help?]]></title><description><![CDATA[The project moved to CloverLabs and the fork tree keeps growing. We read the code and ran four builds against DataDome to see what still works.]]></description><link>https://substack.thewebscraping.club/p/is-camoufox-still-effective-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/is-camoufox-still-effective-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 04 Jun 2026 16:21:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e103d057-1faa-43a1-9346-cb8fdd5383b8_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Camoufox has been our default anti-detect browser for more than a year. We said so in <a href="https://substack.thewebscraping.club/p/how-to-bypass-cloudflare-turnstile">THE LAB #73: How to Bypass Cloudflare in 2025</a>, and again when we put it on the level of a commercial product in the Kasada article. Lately, that confidence has started to decline. In hallway conversations at PragueCrawl, more than one person told us the same thing we had started to feel. Camoufox does not pass the harder targets the way it used to.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Part of that is the cat-and-mouse game every stealth tool plays. Part of it is specific to open source. When the entire fingerprint-spoofing codebase is public, the anti-bot vendors can read it line by line and build the exact counter-signal. We made that argument in the rayobrowse review. The openness that made Camoufox popular is the same openness that let the anti-bot giants study it and catch up.</p><p>Two things changed in 2026 that make this worth a fresh look. First, the project moved. The repository at <a href="https://github.com/daijro/camoufox">github.com/daijro/camoufox</a> now carries a note at the top of its README:</p><div class="callout-block" data-callout="true"><p>Browser development is active at github.com/CloverLabsAI/camoufox and github.com/VulpineOS/VulpineOS. This repo is being used to merge checkpoint releases and should be used as the source of truth.</p></div><p>Clover Labs is a Toronto venture studio building AI agents, listed among the project sponsors. The alpha features (per-context fingerprints, hardware spoofing) now ship first in their <code>cloverlabs-camoufox</code> package, and daijro&#8217;s repo became the checkpoint mirror. This is not an abandoned project; the main maintainers changed.</p><p>Second, that public repo has more than 750 forks. Open source means that when one person stops, others can pick up the work, add features on features, and keep the chase going in parallel. So the real question is not only &#8220;is Camoufox still effective&#8221;, it is &#8220;has anyone in the fork tree built something better&#8221;. This is what we tried to discover in this article.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>The forks we actually tested</h2><p>We pulled the fork list from the GitHub API and sorted it by recent pushes. Most of it is noise. Many forks share the exact <code>pushed_at</code> timestamp of the parent, which is the signature of mirror bots that never wrote a line of their own. Once you count how many commits each fork is ahead of <code>daijro:main</code> and read what those commits do, the field collapses to a handful. Many of the survivors only touch CI or rebrand the binary. Three of them touch the anti-detect surface for real.</p><p>Official Camoufox (<a href="https://github.com/daijro/camoufox">github.com/daijro/camoufox</a>) is the baseline. A custom Firefox build with a fingerprint database and stealth patches, driven through Playwright&#8217;s Juggler protocol. We covered how it hides Playwright&#8217;s own traces in <a href="https://substack.thewebscraping.club/p/scraping-datadome-camoufox">THE LAB #65</a>, so we will not repeat that here.</p><p><strong>camoufox-reverse</strong> (<a href="https://github.com/WhiteNightShadow/camoufox-reverse">github.com/WhiteNightShadow/camoufox-reverse</a>) goes the other way. Instead of hiding harder, it adds a PropertyTracer at the SpiderMonkey engine layer that records which DOM properties a page reads. It is an instrument for watching the detector work, not a better scraper. That makes it the most useful tool in the set for understanding what we are up against.</p><p><strong>LeooNic/camoufox</strong> (<a href="https://github.com/LeooNic/camoufox">github.com/LeooNic/camoufox</a>) is the most ambitious on paper. Its commits add content-aware canvas noise that claims to defeat a 2025 academic pixel-recovery attack, a sigma-lognormal humanized mouse engine, and RDPBrowser, an automation path that drives Firefox over the Remote Debugging Protocol instead of Juggler.</p><p><strong>JWriter20/camoufox</strong> (<a href="https://github.com/JWriter20/camoufox">github.com/JWriter20/camoufox</a>) is the pragmatic one. Targeted stealth fixes, the headline being a closed WebRTC IP leak under a proxy on Firefox 146 (daijro issue #538), plus a real pytest suite, which none of the others ship.</p><p>Let&#8217;s start by using camoufox-reverse to discover something more about DataDome installed on Leboncoin.fr.</p><h2>What DataDome reads, watched from inside the engine</h2><p>Before testing who passes, we wanted to see what the detector looks at. We have explained the three detection layers before: behavioral, browser, and HTTP, in <a href="https://substack.thewebscraping.club/p/change-ciphers-scrapy">THE LAB #6</a>. camoufox-reverse lets us watch the browser layer from below the JavaScript, which is a view we have never had in these pages.</p><p>The PropertyTracer is documented to be enabled via a config flag. We drove the macOS arm64 build directly with Playwright, set the trace config through the <code>CAMOU_CONFIG</code> environment variable, and pointed it at a DataDome-protected page. Our target throughout this article is leboncoin.fr, the French classifieds site, because it runs only DataDome. That isolates the signal we care about, with no second anti-bot muddying the result.</p><p>The full probe is in code/camoufox_fork_analysis/trace_datadome.py. The core of it sets the trace and lets DataDome&#8217;s script run:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;f65ef838-3049-4931-9b72-149f05b52fe0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">config = {
    "propertyTrace": {
        "enabled": True,
        "logDir": str(LOG_DIR),
        "objects": [],            # empty = trace all covered getters
        "maxEventsPerSession": 200000,
    }
}
env = os.environ.copy()
env["CAMOU_CONFIG"] = json.dumps(config)
env["MOZ_DISABLE_CONTENT_SANDBOX"] = "1"  # required on macOS for the tracer</code></pre></div><div><hr></div><blockquote><p>When sites get tough, skip the heavy lifting. Get clean, structured CSV datasets,  ready for Excel, BI or your apps</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KpSw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" width="592" height="149.84467881112175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1043,&quot;resizeWidth&quot;:592,&quot;bytes&quot;:81723,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.databoutique.com/buy-data-list&quot;,&quot;text&quot;:&quot;Find your dataset&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.databoutique.com/buy-data-list"><span>Find your dataset</span></a></p></blockquote><div><hr></div><p>The tracer writes one JSON line per getter access, each shaped like <code>{"o": "navigator", "p": "hardwareConcurrency", ...}</code>. Loading the leboncoin homepage produced 140 engine-level reads across 30 distinct properties. Aggregated by object and property, the access pattern contains information useful for fingerprint creation:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;3700752d-7e2b-415d-8254-13f7525b67ee&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"> COUNT  PROPERTY
    14  window.outerWidth
    13  window.devicePixelRatio
    13  window.outerHeight
    13  navigator.plugins.indexedGetter
     9  navigator.hardwareConcurrency
     7  canvas.toDataURL
     6  window.innerWidth
     6  screen.rect
     4  navigator.platform
     4  navigator.userAgent
     4  webgl.getParameter
     4  canvas2d.getImageData
     3  navigator.maxTouchPoints
     2  offscreenCanvas.getContext</code></pre></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p>This is happening entirely below the JavaScript layer. From the page&#8217;s point of view, nothing was instrumented, because the recording lives in the C++ getter, not in a JavaScript proxy. DataDome reads the screen geometry, the navigator core, the plugin and mime enumeration, and then it reaches for the canvas and WebGL. Both <code>canvas.toDataURL</code> and <code>canvas2d.getImageData</code> are in the list, alongside <code>webgl.getParameter</code> and <code>offscreenCanvas.getContext</code>.</p><p>That last detail is what connects this experiment to the rest of the article. The canvas readback is exactly the surface LeooNic&#8217;s content-aware noise patch sets out to protect, and the WebRTC and screen reads are where the other forks claim improvements. We now know the detector touches it all.</p><p>The homepage is the light version. The pages that hold the data are watched far more closely, and the tracer shows it. We pointed the same probe at a car listing (a leboncoin <code>/ad/voitures/</code> URL). Those pages block direct connections, so this run went through a residential proxy, which is the setup we explain in the next section. The listing loaded its real content (the page title came back as &#8220;Alfa romeo Tonale 1.5 Ibrida 175ch Veloce TCT&#8221;), so we were tracing a passing ad page, not a challenge screen. The read pattern is a different animal: 584 engine-level reads across 35 properties, against 140 across 30 on the homepage.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7042a22e-a626-4f27-837f-b2d2b2ac3cda&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"> COUNT  PROPERTY (ad page)
   220  document.cookie.get
    47  window.innerWidth
    30  window.innerHeight
    26  navigator.plugins.indexedGetter
    26  screen.rect
    25  sessionStorage.setItem
    22  sessionStorage.getItem
    16  document.cookie.set
    12  performance.timing
     8  window.scrollY
     7  canvas.toDataURL
     6  webgl.getParameter
     4  canvas2d.getImageData
     3  navigator.globalPrivacyControl
     1  mediaDevices.enumerateDevices</code></pre></div><p>The cookie reads jump from one on the homepage to 220 on the ad page. Session storage, which the homepage barely touched, is read and written dozens of times. New surfaces appear that the homepage never queried: <code>window.scrollY</code> for behavior, <code>navigator.globalPrivacyControl</code>, and <code>mediaDevices.enumerateDevices</code>. The canvas and WebGL reads are still there. This is the same DataDome, running a heavier script on the page that matters. It is the concrete reason the homepage passes a clean browser while the listings do not. It also tells you where to spend your effort. The protection you have to beat lives on the content pages, not the landing page.</p><h2>Setting up a fair comparison</h2><p>The shared virtual environment we&#8217;re creating already had <code>camoufox</code> 0.4.11, which fetches the Firefox 135 official build. We ran on an Apple M2 Max, so we pulled the macOS arm64 binaries for each build, signed them ad hoc (the cross-compiled bundles need it), and pointed the same launcher at each one with <code>executable_path</code>.</p><p>Two version details matter for fairness. JWriter20&#8217;s WebRTC fix targets a regression introduced in Firefox 146, so we did not compare it against the 135 cache. We pulled the official <code>v146-hardware</code> build (Firefox 146.0.1) as the baseline and JWriter20&#8217;s own 146.0.1 build as the patched version. Same Firefox, two builds. camoufox-reverse only ships at 135, which is fine because we used it only as a tracer, not as a contender.</p><p>Every test drives the binaries the way a real user would, through the camoufox launcher with <code>proxy</code> and <code>geoip</code> set, so the fingerprint database, the locale coherence, and the stealth patches are all active. The one exception is the WebRTC probe, explained below, where the page we run matters.</p><h2>The WebRTC leak that JWriter20 actually fixes</h2><p>JWriter20&#8217;s headline fix is a closed WebRTC IP leak under a proxy. We checked it on the official 146 build against the JWriter20 146 build, same launcher, same Bright Data proxy, <code>geoip=True</code>. The probe gathers ICE candidates from a STUN server and reports any IP that escapes (webrtc_leak_test.py).</p><p>A quick detour on what those candidates are, because the whole leak lives in them. WebRTC connects two peers directly, and to do that each side has to advertise every network address it could be reached on. Each address it offers is an ICE candidate. A candidate is an IP, a port, a protocol, and a type, and it reaches JavaScript as a string like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;eb21871e-0761-45d0-95b7-2236487833b7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">candidate:842163049 1 udp 1677729535 203.0.113.25 54321 typ srflx raddr 192.168.1.45 rport 54321</code></pre></div><p>Two types matter here. A <code>host</code> candidate is an address of a local network interface, so it carries your LAN IP. A <code>srflx</code> (server-reflexive) candidate is the public address a STUN server reports back when the browser asks which IP it appears to come from, so it carries your real WAN IP. A page gathers all of this with no permission. It opens an <code>RTCPeerConnection</code> pointed at a STUN server, calls <code>setLocalDescription</code>, and reads each candidate as it arrives. The key is that STUN runs over UDP, and an HTTP proxy only tunnels TCP. The STUN request leaves from the real interface, the proxy never sees it, and the <code>srflx</code> candidate comes back with the real WAN IP even though every HTTP request went through the proxy.</p><p>The first version of our probe ran the RTCPeerConnection on <code>about:blank</code> and showed both builds leaking the real IP. That was our mistake, not a result. Camoufox&#8217;s content-level injection is not active on <code>about:blank</code>, so we were measuring an unprotected page. Moving the probe onto a real https origin changed everything:</p><pre><code><code>official-146   HTTP exit IP (proxy): 189.173.138.17
               ICE candidates: 1  [srflx] ips=['203.0.113.25']   &lt;- real WAN IP leaks

jwriter20-146  HTTP exit IP (proxy): 93.44.185.102
               ICE candidates: 0                                &lt;- nothing leaks</code></code></pre><p>Our real WAN IP is 203.0.113.25. The official build, behind a working proxy, still hands it to any page through the WebRTC reflexive candidate. The proxy exit IP rotates on each run, so the constant 203.0.113.25 in the candidate is unmistakably the real address, not the proxy. </p><p>The fix is real, and it is baked into the binary. </p><p>We unzipped both <code>camoufox.cfg</code> files to confirm. The official build sets only <code>media.peerconnection.ice.no_host</code>. JWriter20 adds <code>default_address_only</code>, <code>proxy_only_if_behind_proxy</code>, <code>proxy_only_if_pbmode</code>, and <code>obfuscate_host_addresses</code>. Behind a proxy that cannot carry UDP, those preferences make WebRTC gather no candidates at all, so there is nothing to leak. Reproduced across two runs.</p><p>If WebRTC leaks were your problem, JWriter20 solves them. Hold that thought, because it does not end where you would expect.</p><h2>The canvas patch that we could read but not run</h2><p>To see why this patch exists, you have to understand the small-arms race it sits within. The PropertyTracer run above caught DataDome calling <code>toDataURL</code> and <code>getImageData</code>. Those two calls are how a canvas fingerprint is taken. A script draws the same text and shapes into an off-screen canvas on every machine, reads the pixels back, and hashes them. The drawing commands are identical everywhere. The pixels are not, because the final image depends on your GPU, your graphics driver, and how your system rasterizes fonts. That hash is stable for your device and different from the next one, which is most of what a tracker wants.</p><p>The standard way to hide is to add noise. Camoufox, Brave, Firefox&#8217;s resist-fingerprinting mode, and a long tail of extensions all nudge a few pixels so the hash will not stay constant across sites. The weakness is in how that noise is generated. If it is a fixed per-session perturbation that depends only on a seed and the pixel position, it can be undone. A 2025 paper at The Web Conference, <a href="https://dl.acm.org/doi/abs/10.1145/3696410.3714713">Breaking the Shield</a> by Hoang Dai Nguyen and Phani Vadrevu, showed exactly that against eighteen extensions and five browsers. Their Pixel-Recovery attack paints a second canvas filled with a known solid color and reads it back. Because it knows what every pixel should have been, it solves for the perturbation and subtracts it from the real fingerprint canvas. Reload ten times, and the recovered fingerprint stays constant while the noised one keeps changing. That is the proof the noise was reversible all along.</p><p>Two changes defeat the attack, and the same paper points at both. Leave the flat regions alone, so a detector that paints a solid block and reads it back finds no tampering to measure. And make each perturbation depend on the pixel content rather than its position, so there is no single value to solve for and subtract. The second idea is what Brave&#8217;s Farbling does, deriving its noise from the canvas content so two different canvases are altered differently, and it is the one defense the Pixel-Recovery attack could not reverse.</p><p>LeooNic&#8217;s patch implements both moves, and it is the most interesting code in the whole fork tree. The rewritten <code>ApplyCanvasNoise</code> skips flat regions and only perturbs edges, and the comments name the attack directly:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;279f1acd-8f68-4c56-ab69-a1a65dd58a57&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">// Content-aware + content-dependent canvas noise.
//   - Tier 1 known-pixel checks (DataDome, Castle): flat regions are skipped
//     because flat_score &lt; FLAT_THRESHOLD. fillRect(R,G,B) is undisturbed.
//   - WWW'25 Pixel-Recovery Attack (Nguyen &amp; Vadrevu): noise depends on the
//     pixel content AND its 4 neighbors, not just (seed, index).</code></pre></div><p><code>FLAT_THRESHOLD</code> is the cutoff that decides what counts as a flat region. The edge pixels that survive it get a content-dependent nudge of plus or minus one, small enough to stay invisible but enough to move the hash. The logic is sound on paper. We wanted to confirm it at runtime.</p><p>We built a probe that draws two solid blocks with one sharp boundary and counts perturbed pixels in the flat interior versus the edge (canvas_fingerprint_test.py). </p><p>First, we learned that canvas noise is off by default in every current build, which lines up with the CloverLabs &#8220;Disable Canvas Noise&#8221; commit. The noise only runs when <code>canvas:seed</code> is non-zero. With the seed forced on the official 146 build, the original algorithm shows its tell:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;8f66b3b1-7dde-41de-9999-4a526d196d19&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">official-146 (original noise)
  interior flat pixels perturbed : 9105 / 18240   (~50%)
  boundary edge pixels perturbed : 98 / 192
  max edge delta (per channel)   : 1
  hash varies across sessions    : True</code></pre></div><p>The stock algorithm perturbs roughly half of every pixel, flat fills included. That is precisely the behavior a known-pixel check catches, and precisely what LeooNic set out to fix. So the official baseline is the &#8220;before&#8221; picture, captured at runtime.</p><p>The &#8220;after&#8221; picture is what we were not able to collect. LeooNic ships only a Windows binary, so we ran it on a Windows cloud box. It would not launch under Playwright at all. Every attempt, headless or headful, with the stock launcher or LeooNic&#8217;s own 0.5.0 launcher installed from source, ended the same way:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;d224820d-d322-4344-a4a1-bb64fce1952a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">console.error: "Warning: unrecognized command line flag" "-juggler-pipe"
Remote Settings startup changesets bundle could not be extracted (JSON.parse...)
JavaScript error: AsyncShutdown.sys.mjs, line 587: uncaught exception: undefined
&lt;process did exit: exitCode=0&gt;</code></pre></div><p>The official 135 build launches and drives fine on the same box, so the machine and Playwright are healthy. The Firefox 149 build that LeooNic publishes aborts at startup before Juggler attaches. </p><p>This is not only our environment. LeooNic&#8217;s own issue #1 is titled &#8220;fix: port patches and build system to Firefox 149.0&#8221;, an open work in progress, and daijro carries issues #620 and #572 about Juggler failing to initialize in constrained environments. The build does run through LeooNic&#8217;s native RDP path, which is the whole point of their RDPBrowser, but even there we could not activate the canvas seed. The global config is ignored, and the per-context <code>setCanvasSeed</code> function the build exposes only at document start was never present on the page when driven over RDP.</p><p>So we report LeooNic honestly. The content-aware algorithm is real and well-reasoned in source, and the original algorithm&#8217;s weakness is confirmed at runtime on the official build. The published Firefox 149 binary is not something you can pick up and drive with the standard stack today. For a reader choosing a fork, that is the practical signal. The innovation lives in the code, not yet in a usable artifact you can run.</p><h2>The block-rate test, and the result we did not expect</h2><p>Given that, we could only test the jwriter fork compared to the original version. As always, the code that will be used for testing can be found&nbsp;<a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved for paying users, inside the folder&nbsp;</a><strong><a href="https://github.com/TheWebScrapingClub/thelab">106.</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">CAMOUFOX</a><strong>.</strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/is-camoufox-still-effective-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[5 mistakes that are driving up your scraping costs - Insights from DataImpulse]]></title><description><![CDATA[An in-depth look at the real factors driving up web scraping costs and how smarter proxy usage and system design can reduce expenses by up to 60%.]]></description><link>https://substack.thewebscraping.club/p/5-mistakes-that-are-driving-up-your</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/5-mistakes-that-are-driving-up-your</guid><dc:creator><![CDATA[Olia Liudko]]></dc:creator><pubDate>Tue, 02 Jun 2026 20:02:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Jfy8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is a guest post written by the DataImpulse team, tackling the problem behind the costs of scraping. For an independent benchmark of proxy prices, <a href="https://proxyprice.thewebscraping.club/">visit our Proxy Pricing Benchmark tool</a>.</em></p><div><hr></div><p>Every product has its own price, usually formed by a simple and predictable formula. Proxy services follow a similar pricing logic. At first thought, multiplying the proxy price by the amount of bandwidth should result in a final cost, but in fact, many aspects must be taken into consideration. Bandwidth is not a clean, one-to-one reflection of useful work. What users pay for is not just the data they properly collect but also everything that happens around. And it&#8217;s about request failures, encountered blocks, timeouts, and suboptimal routing decisions.</p><p style="text-align: justify;">Much of the traffic in scraping systems is consumed without usable output, and that&#8217;s where the gap between expected and actual costs starts to widen. To understand where the budget really goes, we need to research it deeper.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jfy8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jfy8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1773700,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/200176130?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jfy8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>What is the math behind the scraping costs</strong></h2><p style="text-align: justify;">Bandwidth is measurable, and proxy providers typically price it transparently. However, it&#8217;s not the key driver of the cost, it&#8217;s the result of a much more complex process. Each unit of bandwidth stands for a series of events, rather than merely one successful request. While some requests yield immediate data, many others can fail, be obstructed, time out, or require further attempts. Consequently, the actual cost structure is influenced more by system behavior than by the volume of traffic.</p><p style="text-align: justify;">The real cost equals the total number of request cycles necessary to extract data multiplied by the cost of executing each request cycle. In simple words, it is defined by how many complete attempts the system must make before it gets that one response. The key detail is that only successful requests generate value. Each request cycle may have the initial request, retries after timeouts, proxy rotation, session management, and many other factors. All of them consume resources. Because of this, the real price increases not only when proxy prices rise but also when the system becomes less well-functioning. The same dataset can cost remarkably more to extract when the scraping pipeline is inefficient.</p><p style="text-align: justify;">The cost amplification effect, known as the cumulative increase in resource usage caused by constant request cycles, is an issue in web scraping systems. One failed request can trigger multiple follow-up attempts. These extra cycles accumulate and raise the cost required to get one successful data point. For this reason, it&#8217;s accurate to evaluate scraping by cost per successful request. The fewer attempts needed, the lower the real cost.</p><h2><strong>Mistake #1 - Using the wrong proxy type</strong></h2><p style="text-align: justify;">This mistake is not just widespread but can also cost you a lot. Not all targets need the same level of stability and anonymity, but numerous systems use a universal strategy. This results in either high costs or poor operation.</p><p style="text-align: justify;">Each proxy type has its own balance of speed, cost, and detection resistance. The system will not work if that balance doesn&#8217;t align with the target website&#8217;s behavior. For example, mobile IPs are not just more expensive by default. They are highly trusted and harder to get blocked, so it&#8217;s logical to use them for challenging targets. When they are used on low-protection websites, they increase costs without improving the results. The approach that works is based on the right matching of proxy type to the task.</p><ul><li><p style="text-align: justify;">Residential proxies are actively used in web scraping as they route traffic via IPs assigned by ISPs to real devices. By looking like it comes from the real user, these proxies ensure strong trust signals. Many proxy users notice better success rates when they switch to residential IPs.</p></li></ul><ul><li><p style="text-align: justify;">Mobile proxies direct traffic via carrier networks using IP addresses from mobile service providers. Since these IPs are shared, the traffic appears very authentic and is significantly more difficult for systems that rely on fingerprinting to identify.</p></li></ul><ul><li><p style="text-align: justify;">Datacenter proxies function on cloud or server-based infrastructure and use IP ranges that aren&#8217;t tied to real ISPs. The biggest advantage is speed. They are perfect for heavy automation and data collection tasks.</p></li></ul><h2><strong>Mistake #2 - The retry loop problem</strong></h2><p style="text-align: justify;">The goal of retry logic is to improve success rates by giving failed requests another attempt to return a valid request. This approach works when responses are consistent, and failures are occasional, not systematic. On targets with rate limits or unstable responses, many retries can lead to constant failures under the same conditions. Not all failures are the same. If you got a timeout, it&#8217;s worth retrying, but if it&#8217;s a 403 error or a block, there are other actions to try. For example, you can rotate proxies or fix headers. </p><p style="text-align: justify;">Retries can turn into a loop where the system keeps sending more requests but isn&#8217;t getting better results. Instead of retrying everything the same way, treat different errors differently, and adjust your behaviour based on the response. You can rotate proxies after getting a certain status code or stop retrying when a request is blocked.</p><h2><strong>Mistake #3 - Misconfigured proxy rotation</strong></h2><p style="text-align: justify;">Rotating proxies aggressively is not a solution. Changing IPs too often makes traffic look unnatural and can raise suspicion. On the flip side, not rotating enough can create another issue. Thus, there must be balance. Some websites tolerate frequent IP changes, while others expect a more stable session. Treating all targets the same way is not appropriate. It&#8217;s better to adjust rotation based on the context. </p><p style="text-align: justify;">If the server expects the same user behavior over time, using sticky sessions may help. In this case, it&#8217;ll help maintain session consistency and not break the flow. For bulk data extraction, you can try rotating proxies more frequently. In this situation, there is no need to preserve session context. You can also refine your rotation if you use signal-based triggers instead of fixed rules. Rotate thoughtfully, and when the system detects specific conditions like a sudden drop in success rates or status codes, adapt your proxy to it. </p><h2><strong>Mistake #4 - Ignoring caching and duplicate requests</strong></h2><p style="text-align: justify;">A notable portion of scraping traffic is often dedicated to retrieving data that has already been collected. This occurs when pipelines lack deduplication or clear definitions for data freshness. It leads to repeated requests for identical resources. This process consumes bandwidth and proxy capacity without providing new information.</p><p style="text-align: justify;">To address this, implement a caching layer and deduplication logic. Responses can be cached based on a time-to-live (TTL) interval that aligns with the frequency of data updates. Request fingerprints can be used to identify duplicates before requests are sent. For structured data, storing IDs or hashes of processed items allows the system to skip previously captured content. </p><h2><strong>Mistake #5 - No cost-aware proxy routing</strong></h2><p style="text-align: justify;">Many scraping systems process all requests through a single proxy type. This approach can simplify implementation, but it can still lead to ineffectiveness. Different endpoints often have distinct requirements, and a universal strategy may result in unnecessary costs.</p><p style="text-align: justify;">For instance, using proxies with a high trust score for simple endpoints can be expensive, whereas lower-cost proxies used for protected pages may result in blocks and required retries. Without routing logic to adapt to these variables, systems often overpay or underperform. They can&#8217;t adapt.</p><p style="text-align: justify;">An alternative is to implement cost-aware routing, which matches the proxy type to the difficulty of the task. This involves using more economical options for low-risk requests and escalating to higher-trust proxies only when necessary. By monitoring metrics such as status codes, latency, and success rates, the system can determine when to switch proxy pools. For example, a blocked request can be retried using a higher-trust proxy rather than repeating the request under the same conditions.</p><p style="text-align: justify;">This approach creates a more structured pipeline that balances cost and performance by allocating resources based on the specific requirements of each request.</p><h2><strong>Understanding the real price of proxies</strong></h2><p style="text-align: justify;">While &#8220;price per GB&#8221; is often cited as a standard industry metric, experienced engineers understand that it fails to capture the true economic reality of data scraping. In practice, failed requests consume bandwidth and incur costs despite yielding no usable data. These unsuccessful attempts represent a negative return on investment.</p><p style="text-align: justify;">Furthermore, the expenses associated with automated retries add another hidden expense. We have to look beyond the basic per-GB rate and adopt the &#8220;Cost Per Successful Request&#8221; (CPSR). This metric provides a more accurate reflection of true operational expenses.</p><p style="text-align: justify;">To calculate the cost of each valid data retrieval, use the following formula:</p><p style="text-align: justify;"><strong>CPSR = price per GB / 1,000 * 1 / Success Rate</strong></p><p style="text-align: justify;">In this equation, the &#8220;success rate&#8221; is the percentage of requests that return an HTTP 200 OK status along with the intended data. Organizations can make better financial decisions if they start evaluating proxy services through the lens of CPSR.</p><p style="text-align: justify;"><a href="https://dataimpulse.com/residential-proxies/">DataImpulse is a reliable provider</a> of residential, mobile, and datacenter proxies with non-expiring traffic and a pay-as-you-go model, meaning purchased traffic remains available until it is used. This vendor offers more than 90 million IPs in 195 countries. Teams usually choose DataImpulse for web scraping, ad verification, market research, SERP monitoring, and website testing. </p><h3 style="text-align: justify;"><strong>Why is DataImpulse cheaper than other vendors? </strong></h3><p style="text-align: justify;">The pricing structure is based on the proxy sourcing method. Many providers purchase traffic rights from ISPs and resell them, which includes an additional markup. DataImpulse sources IP addresses directly through its own application and SDKs, bypassing intermediaries to avoid extra costs. This operational model complies with all legal standards.</p><h2><strong>How to reduce your scraping costs by 30-60%</strong></h2><p style="text-align: justify;">Cost efficiency in data collection is primarily achieved by minimizing inefficient requests and increasing the success rate of each attempt. </p><ol><li><p style="text-align: justify;">To optimize expenses, match proxy types to the specific requirements of the task. Using cost-effective proxies for straightforward targets while reserving higher-trust proxies for more challenging endpoints can reduce unnecessary spending.</p></li><li><p style="text-align: justify;">Refining retry logic is also important. Failures should be addressed based on their specific status codes. Avoiding repeat requests under identical conditions prevents the waste of resources.</p></li><li><p style="text-align: justify;">Proxy rotation should be managed strategically rather than randomly. Implementing sticky sessions and rotating based on indicators such as blocks or elevated failure rates can improve both stability and overall success rates.</p></li><li><p style="text-align: justify;">Incorporating caching and deduplication techniques helps manage traffic. By avoiding redundant requests for data that has not changed, it is possible to decrease total request volume. </p></li><li><p style="text-align: justify;">Implement a cost-aware proxy routing strategy. Prioritize lower-cost alternatives, escalating to premium options only when strictly necessary. This approach facilitates a more efficient resource allocation model, ensuring that infrastructure investments are directed toward the areas of greatest impact.</p></li><li><p style="text-align: justify;">Lastly, pay attention to how your scraper interacts with websites. When a browser loads a page, it also pulls images, scripts, videos, and even fonts. Thus, lots of traffic is generated. Use HTTP requests for structured data and browser-based scraping when JS rendering is necessary. </p></li></ol><p style="text-align: justify;">These optimizations don&#8217;t require a comprehensive system overhaul. Incremental improvements in request efficiency can harvest significant cost reductions.</p><h2><strong>Start measuring your current CPSR baseline</strong></h2><p style="text-align: justify;">At first sight, scraping costs look like a simple equation between proxy price and bandwidth. But the real drivers of cost lie deeper.  As we&#8217;ve seen, unnecessary retries and poor rotation strategies contribute to a growing gap between expected and actual costs. A system with low success rates will always consume more resources. </p><p style="text-align: justify;">The important shift is moving away from thinking in terms of raw pricing and toward thinking in terms of efficiency. It doesn&#8217;t always require major steps, simple adjustments are key. Better proxy selection, cost-aware routing, caching, and improved retry logic are among them. From factual proxy usage data from DataImpulse, we&#8217;ve seen that even small optimizations can noticeably reduce total costs. Every request should add value, so spending must be thoughtful and deliberate. Audit your scraping pipeline against these 5 mistakes today. </p>]]></content:encoded></item><item><title><![CDATA[Why and How to Build a Web Scraper with Rust in 2026]]></title><description><![CDATA[Is Rust the future of web scraping? Let&#8217;s find out!]]></description><link>https://substack.thewebscraping.club/p/how-to-build-a-web-scraper-rust</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-build-a-web-scraper-rust</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 31 May 2026 15:27:12 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/159b826b-b10b-4781-9efd-ddc651b7f874_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What do popular developer technologies like ZeroClaw, IronClaw, Codex CLI, and many others have in common, besides thousands of GitHub stars, tons of downloads, and growing communities? They are all developed in Rust!</p><p>Rust is becoming increasingly popular thanks to its advantages in performance, stability, and security. But what about using it for web scraping?</p><p>In this post, I&#8217;ll show you what Rust brings to the table for web scraping, why it makes sense (and when it doesn&#8217;t), and how to build a web scraper in Rust.</p><h2>Main Characteristics of Rust: Quick Overview</h2><p>Rust stands out because it combines performance, safety, and control in a way few programming languages do. According to the <a href="https://survey.stackoverflow.co/2025/technology">2025 Stack Overflow Developer Survey</a>, 14.8% of respondents reported using Rust that year, making it the 14th most popular option.</p><p>Personally, what I find most compelling about Rust is its <a href="https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html">memory safety model</a>. Thanks to <a href="https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html">ownership and borrowing</a>, it avoids entire classes of bugs like memory leaks or race conditions. All of that, without needing a garbage collector!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p><div><hr></div></blockquote><p>Here&#8217;s what Rust looks like in its simplest form:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">fn main() {
    println!("Hello, world!");
}</code></pre></div><p>Even in this minimal example, you can see Rust&#8217;s explicit structure and compile-time guarantees.</p><p><strong>Remember</strong>: In Rust, <em>println!</em> is not a function. It&#8217;s a macro. The <em>!</em> tells the compiler: &#8220;this is a macro invocation, not a normal function call.&#8221;</p><p>Performance is another big win. Rust is compiled and extremely fast, making it ideal for high-throughput parsing on heavy HTML pages or large volumes of pages (e.g., in an <a href="https://substack.thewebscraping.club/p/offline-web-scraping">offline web scraping scenario</a>). Concurrency is also first-class, helping you manage thousands of requests in parallel without the usual headaches.</p><p>On the flip side, Rust has a steeper learning curve. If you&#8217;re coming from Python or JavaScript, getting used to the syntax and strict compiler won&#8217;t be trivial. In my experience, the first steps can feel a bit unforgiving&#8230;</p><blockquote><div><hr></div></blockquote><h2>Why AI Has Made Rust a Solid Choice for Web Scraping</h2><p>AI is changing the nature of software development, including web development. And, as you may already have noticed, not always in a &#8220;lighter&#8221; direction. Humans struggle to deal with long scripts and source code files, but machines don&#8217;t!</p><p>Thus, AI tends to produce very long and complex HTML with a lot of elements embedded in the same page. On top of that, AI makes it trivial to generate large amounts of content, which further increases HTML size. In addition, <a href="https://developer.mozilla.org/en-US/curriculum/core/semantic-html/">semantic HTML</a> is more verbose than traditional HTML.</p><p>As a result, modern web pages are getting bigger and more complex. From a scraping perspective, this translates into slower and more resource-intensive parsing. What used to be lightweight DOM trees are now dense, deeply nested structures that require more CPU and memory to process.</p><p>This is exactly where Rust starts to make sense&#8230;</p><p>Sure, it may not be the easiest programming language, but Rust&#8217;s performance makes it compelling (and in some cases even necessary). Its low-level control and zero-cost abstractions allow Rust HTML parsers to process large documents in fractions of a second, even under high concurrency.</p><p><a href="https://medium.com/@jgfriedman99/html-parsing-benchmarks-2170417e8c06">Independent benchmarks</a> show Rust HTML parsers ranking among the fastest available. In particular, libraries like <em><a href="https://github.com/y21/tl">tl</a></em> stand out for their exceptional speed and low overhead.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h2>Best Rust Web Scraping Libraries</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NgZ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NgZ2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 424w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 848w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The best Rust web scraping libraries&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The best Rust web scraping libraries" title="The best Rust web scraping libraries" srcset="https://substackcdn.com/image/fetch/$s_!NgZ2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 424w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 848w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The best Rust web scraping libraries</figcaption></figure></div><h2>How to Build a Scraper in Rust: A Step-by-Step Guide</h2><p>In this section, I&#8217;ll guide you through the process of building a web scraper in Rust. The target web page will be <a href="https://books.toscrape.com/">Books to Scrape&#8217;s homepage</a>. This is a static page, which is the ideal scenario for high-speed HTML parsing in Rust.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CIY7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CIY7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 424w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 848w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 1272w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CIY7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png" width="1456" height="786" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:786,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CIY7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 424w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 848w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 1272w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Books to Scrape homepage</figcaption></figure></div><p>The end goal is to scrape all the book information and export it to a CSV file. Follow the instructions below!</p><h3>Prerequisites</h3><p>Make sure you have:</p><ul><li><p><a href="https://rust-lang.org/tools/install/">Rust installed locally</a> (the article refers to Rust 1.95.0).</p></li><li><p>Some basic familiarity with <a href="https://doc.rust-lang.org/book/">Rust syntax and constructs</a>.</p></li></ul><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Qrb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png" width="479" height="239.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:479,&quot;bytes&quot;:911444,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196394917?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Trusted by teams running ad verification, web scraping, SERP tracking, and market research. Ethically sourced proxies, globally accessible, and fairly priced.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataimpulse.com/&quot;,&quot;text&quot;:&quot;Get Started With DataImpulse&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataimpulse.com/"><span>Get Started With DataImpulse</span></a></p></blockquote><div><hr></div><h3>Step #1: Set Up a Rust Scraping Project</h3><p>Create a new Rust project for web scraping with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cargo new books_rust_scraper</code></pre></div><p>This will generate a new project called <em>books_rust_scraper</em> containing a basic &#8220;Hello, world!&#8221; program. Move into the project folder:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cd books_rust_scraper</code></pre></div><p>You should now see the following file structure:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">
books_rust_scraper/
&#9500;&#9472;&#9472; src/
&#9474;   &#9492;&#9472;&#9472; main.rs
&#9500;&#9472;&#9472; target/
&#9500;&#9472;&#9472; .gitignore
&#9500;&#9472;&#9472; Cargo.toml
&#9492;&#9472;&#9472; Cargo.lock</code></pre></div><p>Focus on the <em>src/main.rs</em> file:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oopm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oopm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 424w, https://substackcdn.com/image/fetch/$s_!oopm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 848w, https://substackcdn.com/image/fetch/$s_!oopm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 1272w, https://substackcdn.com/image/fetch/$s_!oopm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oopm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png" width="1364" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:1364,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The src/main.rs file&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The src/main.rs file" title="The src/main.rs file" srcset="https://substackcdn.com/image/fetch/$s_!oopm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 424w, https://substackcdn.com/image/fetch/$s_!oopm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 848w, https://substackcdn.com/image/fetch/$s_!oopm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 1272w, https://substackcdn.com/image/fetch/$s_!oopm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The src/main.rs file</figcaption></figure></div><p>This is the entry point of your application and currently contains a simple &#8220;Hello, world!&#8221; example. Test your Rust application with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cargo run</code></pre></div><p>The command executes the <em>src/main.rs </em>file, so the result will be:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Hello, world!</code></pre></div><p>In that file, you&#8217;ll implement your Rust web scraping logic. Great!</p><h3>Step #2: Install Required Dependencies</h3><p>Run these commands to install the crates (Rust libraries) needed to build a Rust web scraper:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cargo add tokio --features full
cargo add reqwest
cargo add scraper
cargo add csv</code></pre></div><p>These are the core dependencies:</p><ul><li><p><em><a href="https://docs.rs/tokio/latest/tokio/">tokio</a></em>: Enables asynchronous execution.</p></li><li><p><em><a href="https://docs.rs/reqwest/latest/reqwest/">reqwest</a></em>: To send HTTP requests to retrieve HTML pages.</p></li><li><p><em><a href="https://docs.rs/scraper/latest/scraper/">scraper</a></em>: To parse HTML and extract data using CSS selectors.</p></li><li><p><em><a href="https://docs.rs/csv/latest/csv/">csv</a></em>: To export the scraped data to a CSV file.</p></li></ul><p>After running the commands above, your <em>Cargo.toml</em> file should look similar to this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;toml&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-toml">[package]
name = "books_rust_scraper"
version = "0.1.0"
edition = "2024"

[dependencies]
csv = "1.4.0"
reqwest = "0.13.3"
scraper = "0.26.0"
tl = "0.7.8"
tokio = { version = "1.52.1", features = ["full"] }</code></pre></div><p>Nice! You now have all the dependencies in place to start building your Rust scraper.</p><h3>Step #3: Retrieve the Target Page</h3><p>Use <em>reqwest</em> to fetch the target page with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">use std::error::Error;
use reqwest::Client;

#[tokio::main]
async fn main() -&gt; Result&lt;(), Box&lt;dyn Error&gt;&gt; {
    // Initialize the HTTP client
    let client = Client::builder()
        .build()?;

    // Retrieve the target page
    let url = "https://books.toscrape.com/";
    let response = client
    .get(url)
    .send()
    .await?;

    // Extract the HTML content from the response
    let html = response.text().await?;
  
    // Parsing logic...

    // Data export logic...

    Ok(())
}</code></pre></div><p>This snippet initializes an asynchronous HTTP client using Tokio, sends a GET request to the target URL, retrieves the HTML response body, and prepares it for parsing and data extraction.</p><p>If you print <em>html</em>, you&#8217;ll observe:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G1Ll!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G1Ll!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 424w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 848w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 1272w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png" width="1456" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The HTML of the target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The HTML of the target page" title="The HTML of the target page" srcset="https://substackcdn.com/image/fetch/$s_!G1Ll!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 424w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 848w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 1272w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The HTML of the target page</figcaption></figure></div><p>Excellent! Get ready to apply the Rust data parsing logic.</p><h3>Step #4: Implement the Parsing Logic</h3><p>Before implementing the web scraping logic in Rust, study the DOM of the target page. Inspect a book HTML element in the browser:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zd3v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zd3v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 424w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 848w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 1272w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/baf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Inspecting a book HTML element&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Inspecting a book HTML element" title="Inspecting a book HTML element" srcset="https://substackcdn.com/image/fetch/$s_!Zd3v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 424w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 848w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 1272w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Inspecting a book HTML element</figcaption></figure></div><p>From this structure, notice how you can select all books using the <em>article.product_pod</em> CSS selector. For each book element, you can retrieve:</p><ul><li><p>The title and URL from <em>h3 a</em>.</p></li><li><p>The image URL from <em>.image_container img</em>.</p></li><li><p>The price from <em>.price_color</em>.</p></li><li><p>The rating from <em>p.star-rating</em>.</p></li><li><p>The stock status from <em>.instock.availability</em>.</p></li></ul><p>First, <a href="https://doc.rust-lang.org/book/ch05-01-defining-structs.html">define a struct</a> to store that data:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">#[derive(Debug)]
struct Book {
    url: String,
    image_url: String,
    title: String,
    price: String,
    rating: String,
    in_stock: bool,
}</code></pre></div><p>Next, define the <em>parse_books()</em> function that extracts and structures the data:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">use scraper::{Html, Selector};
// ...

// ...
fn parse_books(html: &amp;str) -&gt; Result&lt;Vec&lt;Book&gt;, Box&lt;dyn Error&gt;&gt; {
    // Parse the HTML content
    let document = Html::parse_document(html);

    // Define CSS selectors for the HTML elements of interest
    let book_selector = Selector::parse("article.product_pod")?;
    let title_selector = Selector::parse("h3 a")?;
    let image_selector = Selector::parse(".image_container img")?;
    let price_selector = Selector::parse(".price_color")?;
    let rating_selector = Selector::parse("p.star-rating")?;
    let stock_selector = Selector::parse(".instock.availability")?;

    // Where to store the scraped data
    let mut books = Vec::new();

    // Iterate over each book element and extract the relevant data
    for book_el in document.select(&amp;book_selector) {
        // Apply the parsing logic
        let title_el = book_el.select(&amp;title_selector).next().unwrap();

        let relative_url = title_el.value().attr("href").unwrap_or("");
        let url = format!(
            "https://books.toscrape.com/catalogue/{}",
            relative_url
        );

        let image_url = book_el
            .select(&amp;image_selector)
            .next()
            .and_then(|img| img.value().attr("src"))
            .unwrap_or("")
            .to_string();

        let image_url = format!(
            "https://books.toscrape.com/{}",
            image_url.trim_start_matches('/')
        );

        let title = title_el
            .value()
            .attr("title")
            .unwrap_or("")
            .to_string();

        let price = book_el
            .select(&amp;price_selector)
            .next()
            .map(|e| e.text().collect::&lt;String&gt;())
            .unwrap_or_default();

        let rating = book_el
            .select(&amp;rating_selector)
            .next()
            .and_then(|e| e.value().attr("class"))
            .unwrap_or("no rating")
            .replace("star-rating", "")
            .trim()
            .to_lowercase();

        let in_stock = book_el
            .select(&amp;stock_selector)
            .next()
            .map(|e| {
                let text = e.text().collect::&lt;String&gt;();
                text.to_lowercase() == "in stock"
            })
            .unwrap_or(false);

        // Collect the scraped book data
        books.push(Book {
            title,
            price,
            rating,
            in_stock,
            image_url,
            url,
        });
    }

    Ok(books)
}</code></pre></div><p>This function parses raw HTML into structured data using the <em>scraper</em> crate. <em>Html::parse_document()</em> creates a DOM-like representation of the page, while <em>Selector::parse()</em> defines CSS selectors for targeting elements.</p><p><em>document.select(&amp;book_selector)</em> iterates over each book container. Inside each element, <em>.select()</em> extracts nested elements, while <em>.value().attr()</em> retrieves attributes such as links and titles. The <em>.text()</em> method collects visible text content.</p><p>Finally, all extracted values are assembled into a <em>Book</em> struct, and each instance is stored in a vector for later export or processing.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Step #5: Export the Scraped Data</h3><p>Right now, the scraped data is returned by the <em>parse_books()</em> function as a vector of <em>Book</em> structs. Next, add a function that uses the <em>csv</em> crate to export that data into a CSV file:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">use csv::Writer;
// ...

//...
fn write_csv(books: &amp;[Book], file_path: &amp;str) -&gt; Result&lt;(), Box&lt;dyn std::error::Error&gt;&gt; {
    let mut wtr = Writer::from_path(file_path)?;

    // Write the header row
    wtr.write_record(&amp;[
        "url",
        "image_url",
        "title",
        "price",
        "rating",
        "in_stock",
    ])?;

    for book in books {
        wtr.write_record(&amp;[
            &amp;book.url,
            &amp;book.image_url,
            &amp;book.title,
            &amp;book.price,
            &amp;book.rating,
            &amp;book.in_stock.to_string(),
        ])?;
    }

    wtr.flush()?;
    Ok(())
}</code></pre></div><h3>Step #6: Put It All Together</h3><p>This is the final code of your Rust web scraper:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">// src/main.rs

use std::error::Error;
use reqwest::Client;
use scraper::{Html, Selector};
use csv::Writer;

#[derive(Debug)]
struct Book {
    url: String,
    image_url: String,
    title: String,
    price: String,
    rating: String,
    in_stock: bool,
}

#[tokio::main]
async fn main() -&gt; Result&lt;(), Box&lt;dyn Error&gt;&gt; {
    // Initialize the HTTP client
    let client = Client::builder()
        .build()?;

    // Retrieve the target page
    let url = "https://books.toscrape.com/";
    let response = client
    .get(url)
    .send()
    .await?;

    // Extract the HTML content from the response
    let html = response.text().await?;

    // Parse the books data from the HTML
    let books = parse_books(&amp;html)?;

    // Export the scraped data to a CSV file
    write_csv(&amp;books, "books.csv")?;

    Ok(())
}

fn parse_books(html: &amp;str) -&gt; Result&lt;Vec&lt;Book&gt;, Box&lt;dyn Error&gt;&gt; {
    // Parse the HTML content
    let document = Html::parse_document(html);

    // Define CSS selectors for the HTML elements of interest
    let book_selector = Selector::parse("article.product_pod")?;
    let title_selector = Selector::parse("h3 a")?;
    let image_selector = Selector::parse(".image_container img")?;
    let price_selector = Selector::parse(".price_color")?;
    let rating_selector = Selector::parse("p.star-rating")?;
    let stock_selector = Selector::parse(".instock.availability")?;

    // Where to store the scraped data
    let mut books = Vec::new();

    // Iterate over each book element and extract the relevant data
    for book_el in document.select(&amp;book_selector) {
        // Apply the parsing logic
        let title_el = book_el.select(&amp;title_selector).next().unwrap();

        let relative_url = title_el.value().attr("href").unwrap_or("");
        let url = format!(
            "https://books.toscrape.com/catalogue/{}",
            relative_url
        );

        let image_url = book_el
            .select(&amp;image_selector)
            .next()
            .and_then(|img| img.value().attr("src"))
            .unwrap_or("")
            .to_string();

        let image_url = format!(
            "https://books.toscrape.com/{}",
            image_url.trim_start_matches('/')
        );

        let title = title_el
            .value()
            .attr("title")
            .unwrap_or("")
            .to_string();

        let price = book_el
            .select(&amp;price_selector)
            .next()
            .map(|e| e.text().collect::&lt;String&gt;())
            .unwrap_or_default();

        let rating = book_el
            .select(&amp;rating_selector)
            .next()
            .and_then(|e| e.value().attr("class"))
            .unwrap_or("no rating")
            .replace("star-rating", "")
            .trim()
            .to_lowercase();

        let in_stock = book_el
            .select(&amp;stock_selector)
            .next()
            .map(|e| {
                let text = e.text().collect::&lt;String&gt;();
                text.to_lowercase() == "in stock"
            })
            .unwrap_or(false);

        // Collect the scraped book data
        books.push(Book {
            title,
            price,
            rating,
            in_stock,
            image_url,
            url,
        });
    }

    Ok(books)
}

fn write_csv(books: &amp;[Book], file_path: &amp;str) -&gt; Result&lt;(), Box&lt;dyn std::error::Error&gt;&gt; {
    let mut wtr = Writer::from_path(file_path)?;

    // Write the header row
    wtr.write_record(&amp;[
        "url",
        "image_url",
        "title",
        "price",
        "rating",
        "in_stock",
    ])?;

    for book in books {
        wtr.write_record(&amp;[
            &amp;book.url,
            &amp;book.image_url,
            &amp;book.title,
            &amp;book.price,
            &amp;book.rating,
            &amp;book.in_stock.to_string(),
        ])?;
    }

    wtr.flush()?;
    Ok(())
}</code></pre></div><p>Note how all previously defined functions are now called inside <em>main()</em>. Et voila! In just around 150 lines of code, you&#8217;ve built an efficient web scraper in Rust.</p><p>Run your scraper with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cargo run</code></pre></div><p>After execution, a <em>books.csv</em> file will be created in your project folder. Open it, and you will see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ou7T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ou7T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 424w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 848w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png" width="1456" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output books.csv file&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output books.csv file" title="The output books.csv file" srcset="https://substackcdn.com/image/fetch/$s_!Ou7T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 424w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 848w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output books.csv file</figcaption></figure></div><p>This matches exactly the data shown on the target website, but now in a structured format. Mission complete!</p><h2>Browser Automation in Rust: Does It Make Sense?</h2><p>First of all, it&#8217;s worth noting that the ecosystem for browser automation in Rust is quite small compared to JavaScript or Python. Also, most libraries aren&#8217;t official, but rather community-backed ports like Playwright Rust or the Selenium bindings.</p><p>Now, from a technical standpoint, browser automation happens inside the browser itself. So, Chrome, Chromium, or Firefox do most of the heavy lifting. What you define through the library&#8217;s API simply orchestrates operations like clicking, waiting for elements, and extracting data. These commands are then translated into browser actions via <a href="https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi">WebDriver, CDP, or WebDriver BiDi</a>.</p><p>Because of that, using a systems-level language like Rust can be more of a burden than an advantage. The main strength of Rust (i.e., raw performance) doesn&#8217;t really matter here, since the controlled browser instances are the actual bottleneck, not your automation code.</p><p>That means we lose Rust&#8217;s biggest advantage while still paying its costs. On top of that, Rust&#8217;s strict compiler and steeper learning curve can slow down development speed.</p><p>To be honest, I see Rust as excellent for the parsing and data processing layer, but I wouldn&#8217;t recommend it for browser automation&#8230;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Rust for Web Scraping: Final Comment</h3><p>If I had to summarize my experience with Rust for web scraping, I&#8217;d say this: <em>it really shines when you&#8217;re parsing large HTML pages at scale or handling a high number of parsing tasks in parallel.</em></p><p>In those scenarios, the combination of performance, memory safety, and concurrency makes a real difference. That said, I wouldn&#8217;t recommend Rust for everyday scraping tasks&#8230; The entry barrier is just too high, the learning curve too steep, and the ecosystem around scraping too small.</p><p>On top of that, finding experienced Rust developers specifically focused on web scraping, or even just <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">translating those skills into job opportunities</a>, can be way more challenging than in more mainstream stacks.</p><p>So my take is pretty simple: consider Rust when performance and scale <em>truly</em> matter. For everything else, prefer Python or JavaScript.</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Can efficiently handle thousands of requests in parallel.</p></li><li><p>Rust HTML parsers are extremely.</p></li><li><p>Strict compiler checks and static guarantees lead to stable scraping pipelines.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Slower development and prototyping speed.</p></li><li><p>Smaller ecosystem of scraping libraries compared to Python or JavaScript.</p></li><li><p>Not a practical choice for browser automation.</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve guided you through the world of web scraping in Rust. In a world dominated by AI slops, security flaws, and neglected best practices, this programming language is gaining traction thanks to its focus on efficiency and strict compilation.</p><p>As you&#8217;ve seen, Rust is excellent for CPU-intensive or memory-intensive tasks like HTML parsing and data processing. Still, it might not be ideal for browser automation or quick prototyping. You also learned how to go from zero to scraped data in CSV format by building a Rust scraper.</p><p>I hope you found this helpful and insightful. If you have any questions, feel free to share them in the comments below!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #105: If LLMs Can Bypass CAPTCHAs, Are CAPTCHA Solver Services Cooked?]]></title><description><![CDATA[Bypassing hCaptcha in the AI era]]></description><link>https://substack.thewebscraping.club/p/bypassing-hcaptcha-llm</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/bypassing-hcaptcha-llm</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Fri, 29 May 2026 13:13:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/87ac6a36-66d4-4153-8a63-d6ee184d4f29_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Think back to 2023. The story about large language models was that they would automate most work involving reading or writing, and a good slice of the work that means looking at a screen the way you do. We bought that story too. We even wrote <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">Are CAPTCHAs still a thing?</a> that August, reporting on the ETH Zurich paper that claimed AI bots beat humans at reCAPTCHA v2 image challenges by roughly 15%. CAPTCHAs were on the long list of things LLMs were supposed to make irrelevant on the road to general agency.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What the 2023 LLM hype promised, and what 2026 actually shipped</h2><p>The product category that took that promise most literally is the agentic browser. A real Chromium running under an LLM that reads the page, decides what to do, and clicks. Browserbase, Hyperbrowser, Skyvern, Browser Use, BrowserOS, Owl Browser, and a long tail of proxy companies rebranding their scraping browsers as &#8220;AI-powered&#8221;. The pitch in 2024 and 2025 never changed. You hand the agent a task in plain language, and the model handles the rest, including whatever defensive challenge the site throws back.</p><p>It is 2026 now, and that prediction has not played out. CAPTCHAs are still in your pipeline. So we went looking for the answer. Can an LLM actually solve a production-grade CAPTCHA like hCaptcha? We read the code where we could, checked the default configs and the docs, and probed the public surfaces of the solver services these products lean on. The picture is consistent, and it is not the one the marketing sells.</p><p>Every major agentic browser ships a CAPTCHA-solving bullet point on its site. Open the code of the open-source agents, though, and you find a different story. Almost none of them actually use an LLM to solve a CAPTCHA. They either refuse to try, or they try and fail.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><p>One thing to get straight before we open any code. The two families of CAPTCHA behave very differently, and the hype lands on them unevenly. The invisible ones, reCAPTCHA v3 and Cloudflare Turnstile, score your session in the background and rarely show a puzzle. A stealth-first browser on clean residential proxies usually walks past them without anyone seeing a challenge, which we covered for Turnstile in <a href="https://substack.thewebscraping.club/p/cloudflare-turnstile-what-is-that">Cloudflare Turnstile: what is that and how it works?</a> and <a href="https://substack.thewebscraping.club/p/how-to-bypass-cloudflare-turnstile">THE LAB #73: How to Bypass Cloudflare in 2025</a>. The visible image challenges, hCaptcha and reCAPTCHA v2, actually demand an answer. We went deep on reCAPTCHA v2 in <a href="https://substack.thewebscraping.club/p/bypassing-recaptchas-with-open-source">Bypassing reCAPTCHAs With Open Source and Commercial Tools - Part 2</a>. The one that matters in 2026 is hCaptcha, and that is where this article lives, because it is where most scrapers break.</p><p>So here is the question that drives everything below. When an agentic browser says it &#8220;solves&#8221; an hCaptcha, what does its code actually do?</p><h2>Tool landscape</h2><h3>What&#8217;s the commercial offer today</h3><p>There are four families of strategy in the commercial set. In none of them is the LLM in the agent layer doing the CAPTCHA work.</p><p>The first family is stealth-first. Residential proxies, fingerprint shaping, and request patterning lower the bot score so the visible challenge never triggers. The CAPTCHA is not solved. It is prevented from appearing. That gives you the cleanest legal posture in the set, because no automated solving is happening. ZenRows is the example.</p><p>The second family is the opposite of stealth. Instead of hiding that the traffic is automated, the vendor declares it openly and relies on a business arrangement with the CAPTCHA providers to be let through. Browserbase is explicit about this. Its <a href="https://docs.browserbase.com/features/stealth-mode">Stealth Mode documentation</a> says &#8220;through Browserbase&#8217;s partnerships with CAPTCHA providers, Browserbase can resolve challenges automatically so your sessions continue without interruption&#8221;, with solving &#8220;enabled by default for all sessions&#8221;. This is the verified-bot path, the same idea behind Cloudflare&#8217;s Web Bot Auth. A declared, allowlisted agent rather than a disguised one. No model solves a puzzle on either side. The provider recognizes the partner and waves it through. For the common challenge types, according to their documentation, this works without a fight.</p><p>The third family pairs a proprietary solver with documented third-party integrations. The vendor ships its own solving for some challenge types. For the rest, it documents how to wire in an external solver service. The external solver watches for the challenge and returns the response token through its extension or REST API. The agent then submits. Hyperbrowser and Skyvern Cloud both present this shape, a native or closed-source component plus a documented third-party path. Hyperbrowser advertises &#8220;Native Cloudflare Turnstile &amp; CAPTCHA Solving&#8221; with &#8220;No external plugins&#8221; in its <a href="https://tech.hyperbrowser.ai/scraping-infrastructure-native-turnstile-captcha-solving">post on native CAPTCHA solving</a>, then points to an external solver for the challenge types the native one does not cover.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p>The fourth family runs an in-house solver alongside the agent. Could be a vision LLM, could be classical computer vision, could be something else. No third party is in the critical path. The vendor owns the whole stack. Bright Data, Oxylabs, and Owl Browser sit here. <a href="https://owlbrowser.net/">Owl Browser</a> puts hard numbers on the claim: &#8220;detect and automatically solve reCAPTCHA v2 (1.2s), hCaptcha (0.8s), Turnstile (0.3s), and image CAPTCHAs.&#8221; Bright Data sells &#8220;AI-based unlocking logic&#8221; that handles &#8220;CAPTCHA solving, fingerprinting, retries, best headers, location and more&#8221; on its <a href="https://brightdata.com/products/web-unlocker/captcha-solver">Web Unlocker page</a>.</p><p>The table below maps each vendor to the mechanism its public documentation surfaces, with the source page that proves it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!klTW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!klTW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 424w, https://substackcdn.com/image/fetch/$s_!klTW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 848w, https://substackcdn.com/image/fetch/$s_!klTW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 1272w, https://substackcdn.com/image/fetch/$s_!klTW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!klTW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png" width="654" height="839" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:839,&quot;width&quot;:654,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164510,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/199732185?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!klTW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 424w, https://substackcdn.com/image/fetch/$s_!klTW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 848w, https://substackcdn.com/image/fetch/$s_!klTW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 1272w, https://substackcdn.com/image/fetch/$s_!klTW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>The open-source agents we can read</h3><p>For the proprietary CAPTCHA-solving strategies we cannot see how AI is used. For the open-source ones we can. So we opened their code and read exactly how they handle CAPTCHAs. Three projects matter here.</p><p><a href="https://github.com/browser-use/browser-use">browser-use</a> is the most popular open-source LLM agent framework for browser automation. MIT-licensed, vendor-agnostic on the LLM side. The repo contains no LLM-driven CAPTCHA-solving logic. The one CAPTCHA-related file it ships, <a href="https://raw.githubusercontent.com/browser-use/browser-use/main/browser_use/browser/watchdogs/captcha_watchdog.py">captcha_watchdog.py</a>, does not solve anything either. It waits for a solver running in the BrowserUse cloud proxy and blocks the agent loop until that solver reports back. Run the library locally with your own model and no cloud proxy, and the watchdog has nothing to wait for. That makes browser-use the cleanest test of the claim that a local LLM agent solves the CAPTCHA by reading the page.</p><p><a href="https://github.com/Skyvern-AI/skyvern">Skyvern OSS</a> is the open core of the Skyvern Cloud product. AGPL-3.0, focused on form-filling and structured workflows, written in Python.</p><p><a href="https://github.com/browseros-ai/BrowserOS">BrowserOS</a> is a YC S24 open-source agent-driven browser. AGPL-3.0, 11k stars on GitHub, active development. It pairs a Chromium fork with an integrated agent runtime.</p><h2>Modeling hCaptcha and reading what the open-source agents actually do</h2><p>Before testing, it helps to model the target. hCaptcha embeds a widget on the host page through a script served from <code>hcaptcha.com</code>. </p><p>The widget renders inside an iframe whose origin is <code>hcaptcha.com</code>, cross-origin to the host page. When the user clicks the &#8220;I am human&#8221; checkbox, the widget decides whether to issue a challenge. If it does, a second iframe opens with the puzzle dialog. The puzzle layout varies between runs (3x3 grid, 4x3 grid, area-select with a single click on an image, bounding-box, multiple-choice prompt). When the puzzle is solved, the widget writes a response token to a hidden <code>textarea[name="h-captcha-response"]</code> on the host page. The host form reads that textarea on submit and posts the token along with the rest of the data. The whole solve interaction happens inside a frame the host page cannot script. The same-origin policy boundary blocks it.</p><p>That last detail decides what works and what does not. An agent driving the host page through Playwright or CDP has full control over the outer page. Inside the <code>hcaptcha.com</code> frame, its control is limited. A browser extension runs with cross-origin privileges. It can both observe and click inside the widget frame. That asymmetry explains most of what follows.</p><p>We started by reading three open-source agent repos to see what each one had decided to do at this boundary.</p><h3>Skyvern OSS bails to the human</h3><p>Skyvern&#8217;s README is candid about the scope of the OSS release: &#8220;All of the core logic powering Skyvern is available in this open source repository licensed under the AGPL-3.0 License, with the exception of anti-bot measures available in our managed cloud offering.&#8221; That single sentence puts the CAPTCHA section of every Skyvern Cloud feature page outside the repo you can read.</p><p>The repo confirms the framing. In <a href="https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/forge/agent_functions.py#L800-L804">agent_functions.py, lines 800-804</a>, the <code>auto_solve_captchas</code> helper returns <code>False</code> unconditionally. In <a href="https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/webeye/actions/handler.py#L1149-L1162">handler.py, lines 1149-1162</a>, <code>handle_solve_captcha_action</code> does exactly one thing of substance.</p><pre><code><code>async def handle_solve_captcha_action(
    action: actions.SolveCaptchaAction,
    page: Page,
    scraped_page: ScrapedPage,
    task: Task,
    step: Step,
) -&gt; list[ActionResult]:
    LOG.warning(
        "Please solve the captcha on the page, you have 30 seconds",
        action=action,
    )
    await asyncio.sleep(30)
    return [ActionSuccess()]</code></code></pre><p>Thirty seconds of <code>asyncio.sleep</code> and a log message asking a human to handle it. Then a success result, regardless of what the human actually did. The script-generation path in <a href="https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/core/script_generations/skyvern_page.py#L1252-L1254">skyvern_page.py, lines 1252-1254</a> is even more direct. <code>solve_captcha</code> raises <code>NotImplementedError</code>. A Skyvern user opened <a href="https://github.com/Skyvern-AI/skyvern/issues/1117">issue #1117</a> asking how CAPTCHA solving was meant to work in OSS. A maintainer answered plainly: &#8220;We haven&#8217;t open sourced anything related to our captcha solver / anti-bot measures. We don&#8217;t want people abusing these things, so they must remain closed source unfortunately.&#8221;</p><p>The Skyvern OSS code does not pretend to solve CAPTCHAs. Whatever Skyvern Cloud does on top of this, the OSS release hands the problem to a human and moves on.</p><p>As always, the code that will use to bypass hCaptcha can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">105.</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">HCAPTCHA</a><strong>.</strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/bypassing-hcaptcha-llm">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Scrape Open-Source Datasets Ethically]]></title><description><![CDATA[How to collect open data responsibly, without breaking rules or burning bridges]]></description><link>https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 24 May 2026 19:58:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1ef53778-bd4a-4fd8-9911-912fc9f8ea67_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you need to scrape data from the web, &#8220;open data&#8221; and &#8220;open-source datasets&#8221; sound like a green light. No paywall, no login, no restrictions: just data sitting there, ready to be collected. It is a reasonable assumption, right?</p><p>Well, not so fast.</p><p>Open data does not automatically mean free to use, free to redistribute, or free from privacy obligations. And scraping it without thinking through the implications can land you in legal trouble, get your IP banned from a public infrastructure that was never designed to handle aggressive crawlers, or cause you to expose people&#8217;s personal information.</p><p>In this article, we will go through a complete picture of the &#8220;open data&#8221; world: what the problem actually is, how to approach it correctly, and how to implement responsible open data scrapers in Python. </p><p>Let&#8217;s dive into it!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank <strong>NetNut</strong>, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>What &#8220;Open Data&#8221; Actually Means Legally, Ethically, and Practically</strong></h2><p>&#8220;Open&#8221; is one of the most overloaded words in the data world. Depending on the license, the jurisdiction, and the type of data involved, the same publicly accessible dataset can be freely redistributable, commercially restricted, privacy-sensitive, or legally off-limits entirely. </p><p>So, before anything else, let&#8217;s establish what you are actually dealing with.</p><h3>What &#8220;Open-Source Dataset&#8221; Actually Means (and What It Doesn&#8217;t)</h3><p>Where a dataset sits on the licensing spectrum determines everything: whether you can redistribute it, whether you can use it commercially, and whether collecting it at all exposes you to liability. Here is how the spectrum breaks down:</p><ul><li><p><strong>CC0</strong> (Creative Commons Zero): Essentially, it is a public domain dedication. The author waives all rights. You can scrape it, redistribute it, use it commercially, and modify it.</p></li><li><p><strong>CC-BY</strong> (Creative Commons Attribution): It requires you to credit the original source. This means you must clearly state where the data came from, who created it, and link back to the original when you publish or redistribute it. This is the most permissive license after CC0, and it is generally easy to comply with.</p></li><li><p><strong>CC-BY-SA</strong> (Share-Alike): This carries the same attribution requirement as CC-BY, but adds a condition: any derivative work you publish must carry the same license. In practice, this means you cannot fold a CC-BY-SA dataset into a proprietary product and lock it down.</p></li><li><p><strong>CC-BY-NC</strong> (Non-Commercial): It also requires attribution, but restricts commercial use entirely. You can use the data for research, journalism, or personal projects, but the moment money is involved, you need a separate agreement with the data owner.</p></li><li><p><strong>ODbL</strong> (Open Database License), used by OpenStreetMap: It requires both attribution and share-alike, specifically for databases. It is worth noting that ODbL distinguishes between the database itself and the contents. Basically, you can use individual facts freely, but any public use of the database as a whole must comply with the license terms.</p></li></ul><p>And then there is the grey zone, which is where most scraping engineers actually operate: data that is publicly accessible but carries no explicit license. Common cases are government portals, academic repositories, open court records, and municipal datasets. This is a huge portion of what people call &#8220;open data&#8221;. And here is the thing that matters for scraping professionals: <strong>no license does not mean free to use</strong>. In most jurisdictions, the absence of a license means the default copyright law applies. Which means the creator reserves all rights.</p><p>So before you write a single line of scraper code, the first question is not <em>&#8220;Can I access this?&#8221;</em> but <em>&#8220;Under what terms am I allowed to use what I access?&#8221;</em></p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>Where the Ethical (and Legal) Risks Hide</h3><p>Once you have cleared the license question, there are still several risk areas that are easy to overlook:</p><ul><li><p><strong>License violations</strong>: This is the most obvious one. If a dataset requires attribution and you redistribute it without crediting the source, you are in breach. If it has a non-commercial clause and you use it in a commercial product, it&#8217;s the same story. These are the kind of things that generate cease-and-desist letters.</p></li><li><p><strong>PII embedded in &#8220;open&#8221; datasets</strong>: This is a subtler and arguably more dangerous problem than license violation. Consider open court records: they are public by design, but they contain names, addresses, and sometimes sensitive personal details. Census microdata, even when anonymized at the aggregate level, can contain individual-level records. For example, the GitHub commit history is public, but it contains email addresses, which is personal data. So, the fact that data was made public by someone else does not strip it of its privacy implications when you collect, aggregate, and store it.</p></li><li><p><strong>Jurisdictional complexity</strong>: A dataset hosted on a European government portal carries GDPR obligations even if you are scraping it from the United States. The GDPR applies based on where the data subjects are located, not where the scraper is running. If you are collecting data about EU residents, you are in GDPR territory regardless of your own geography.</p></li><li><p><strong>The aggregation problem</strong>: This is probably one of the most underappreciated risks in the scraping industry. Individually, a dataset of names, a dataset of addresses, and a dataset of employment records might each be harmless and openly licensed. But combine them, and you have created a detailed profile of real people. This is something that privacy regulations were specifically designed to prevent.</p></li></ul><h3>The Infrastructure Problem: Open Data Portals Are Not Built for Scrapers</h3><p>Many scraping engineers come to open data with habits built on commercial targets. That experience can be misleading, because the infrastructure behind open data portals is completely different.</p><p><a href="https://substack.thewebscraping.club/p/sentiment-analysis-product-reviews">When you scrape a large e-commerce website</a> or a <a href="https://substack.thewebscraping.club/p/scraping-linkedin-public-data">major social media platform</a>, you are hitting servers that are engineered to handle millions of requests per day, backed by CDNs, load balancers, and dedicated anti-bot teams. In other words, they can take a (hard) hit.</p><p>On the other hand, a municipal open data portal, a university&#8217;s research repository, or a small NGO&#8217;s dataset hosting is an entirely different story. This means that a scraper that would barely register as noise on Amazon&#8217;s servers could genuinely degrade performance for a public data portal serving thousands of researchers.</p><p>This is why scraping open data portals aggressively is arguably more unethical than doing the same to a commercial target. You are not fighting a corporation&#8217;s anti-bot system. You are potentially taking down a public resource that other people depend on.</p><h3><strong>A Four-Step Framework for Scraping Open Datasets Without Breaking Rules or Infrastructure</strong></h3><p>Every risk outlined above has a straightforward mitigation, but only if you apply it at the right point in your workflow. The mistake most scraping engineers make is treating these as afterthoughts: checking the license after the scraper is already built, thinking about PII after the data is already stored. Let&#8217;s discuss a framework that inverts this.</p><h3>License-First Workflow: Read Before You Scrape</h3><p>The fix for the license problem is simple in principle, even if it requires discipline in practice: make license verification the first step of your workflow.</p><p>Most well-maintained open data portals provide license information in one of these three places: a <code>LICENSE</code> file in the dataset&#8217;s root directory, a metadata field in the dataset&#8217;s API response, or the dataset&#8217;s documentation page. Here is a quick reference for what the licenses described above mean for your use case:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AbdL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AbdL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 424w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 848w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1272w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png" width="1021" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:1021,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48171,&quot;alt&quot;:&quot;Summary table for data licenses by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196400924?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Summary table for data licenses by Federico Trotta" title="Summary table for data licenses by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!AbdL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 424w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 848w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1272w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Summary table for data licenses</figcaption></figure></div><p>When there is no license, the safe default is not to scrape and redistribute without seeking explicit permission from the dataset owner. A short email asking for clarification is a sign of professionalism.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Prefer APIs and Bulk Downloads Over Scraping</h3><p>This is a rule that experienced scraping engineers sometimes forget because they are so used to reaching for their scraper toolkit: always check for an official API or bulk download endpoint before writing a scraper.</p><p>Most serious open data portals expose REST APIs or provide direct bulk download links. Using these is better in every dimension: it is faster, more reliable, more respectful of the server, and often gives you cleaner, structured data than you would get from parsing HTML.</p><p>Your workflow should be:</p><ol><li><p>Check the portal&#8217;s documentation for an API.</p></li><li><p>Check for a <code>Sitemap</code> or structured data endpoint (as discussed in our <a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">article on robots.txt and its implications</a>).</p></li><li><p>Check for bulk download links (CSV, JSON, Parquet).</p></li><li><p>Only fall back to HTML scraping if none of the above exist.</p></li></ol><p>Scraping should be your last resort, not your first instinct.</p><h3>Responsible Scraping Behavior for Open Infrastructure</h3><p>When scraping is genuinely the only option, the rules of polite scraping apply. But in the case of open data portals, you should apply a higher standard than you would on a commercial target.</p><p>As covered in &#8220;<a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">best practices for ethical web scraping</a>&#8221;, respecting rate limits, introducing delays between requests, and using a descriptive User-Agent are baseline requirements. But for open data portals, you should go further because of their weaker infrastructure. Below are additional rules you should consider:</p><ul><li><p><strong>Respect </strong><em><strong>Crawl-delay</strong></em><strong> strictly</strong>: Even if major crawlers ignore it, on underfunded infrastructure, that directive is a good signal about server capacity.</p></li><li><p><strong>Cache responses locally</strong>: If you need to re-run your scraper for testing or debugging, you should not be hitting the server again. Cache what you have already fetched.</p></li><li><p><strong>Scrape during off-peak hours</strong>: For public portals serving researchers and government users, off-peak typically means nights and weekends in the portal&#8217;s local timezone.</p></li><li><p><strong>Scrape only what you need</strong>: This sounds obvious, but it&#8217;s easy to over-collect data &#8220;just in case&#8221;. However, for open portals, remember that every unnecessary request is a cost imposed on a public resource that stays online on an underfunded infrastructure.</p></li></ul><h3>Handling PII in Open Datasets</h3><p>PII stands for Personally Identifiable Information. This refers to any data that can be used, alone or in combination with other data, to identify a specific individual. Think names, email addresses, phone numbers, but also subtler things like IP addresses or device IDs.</p><p>The reality is that most well-maintained open data portals go through a review process before publication, and raw PII in open datasets is not as common as you might think. The most common cases where PII can slip through are quite specific: older government datasets published before modern privacy review processes, improperly anonymized academic research deposits, or crowdsourced datasets where contributors included personal details voluntarily.</p><p>In such specific cases, the real risk for most scraping engineers is at the aggregation level. A dataset of names, a dataset of ZIP codes, and a dataset of employment records might each be perfectly clean and openly licensed in isolation. But combine them, and you have built a detailed profile of real individuals. This is something that privacy regulations like the GDPR and CPRA were specifically designed to prevent. And once you collect, store, and process that combined data, you become responsible for it, regardless of where it originally came from.</p><p>The key principle remains the usual one: identify and handle PII at collection time. Here is a schema you can use to audit the fields that are likely to contain PII:</p><ul><li><p><strong>Direct identifiers</strong>: names, email addresses, phone numbers, national ID numbers, passport numbers, and social security numbers. These are the clearest cases as they point to a specific individual on their own, without needing to be combined with anything else. If you see these fields in a dataset, there is no ambiguity: you are dealing with PII.</p></li><li><p><strong>Quasi-identifiers</strong>: dates of birth, ZIP codes, job titles, gender, ethnicity, and salary ranges. None of these identify a person on their own, but they become dangerous in combination. A classic example is aggregating just three fields&#8212;say date of birth, gender, and ZIP code: this is enough to uniquely identify a great portion of the population in a country.</p></li><li><p><strong>Sensitive categories under GDPR</strong>: health and medical data, political opinions, religious or philosophical beliefs, biometric data, genetic data, sexual orientation, and trade union membership. This is a legally distinct class that carries stricter obligations regardless of context. In other words, you cannot process this data based on legitimate interest alone. You need explicit consent or another specific legal basis, and the bar is significantly higher than for ordinary PII.</p></li></ul><p>For each PII field, decide upfront: do you need it? If not, drop it at collection time. If you do need it, apply pseudonymization (replacing the identifier with a reversible token) or anonymization (irreversible removal or generalization) before storage.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Python Implementation: Putting the Full Responsible Scraping Pipeline Into Code</h2><p>Principles are only useful if they translate into implementation. Below are two concrete components you can adapt for your own pipelines:</p><ul><li><p>Checking a dataset&#8217;s license before downloading anything, using CKAN&#8217;s metadata API, with a practical fallback strategy for portals that don&#8217;t run CKAN.</p></li><li><p>Running PII detection at collection time, using field-level schema classification, with an honest discussion of where that approach has limits.</p></li></ul><p>Note that the examples below omit an API-first fetch pattern and a polite scraper skeleton, even though they are covered in the framework section above. This is because those are problems with well-known, straightforward solutions that every scraping engineer should be aware of. The idea of the following sections is to provide you with lesser-known solutions, to help you get ideas to apply to your pipelines.</p><h3>Checking a Dataset&#8217;s License Programmatically</h3><p>Many open data portals are built on <a href="https://ckan.org/">CKAN</a>, an open-source data management system used by governments and enterprises. CKAN exposes a REST API that includes license metadata, which makes programmatic license checking straightforward.</p><p>Here is how to query a CKAN-based portal and extract license information before proceeding:</p><pre><code><code>import requests

def check_dataset_license(portal_base_url: str, dataset_id: str) -&gt; dict:
    """
    Queries a CKAN portal API to retrieve license information
    for a given dataset before any scraping begins.
    """
    api_url = f"{portal_base_url}/api/3/action/package_show"
    params = {"id": dataset_id}

    response = requests.get(api_url, params=params, timeout=10)
    response.raise_for_status()

    data = response.json()
    result = data.get("result", {})

    license_info = {
        "dataset_name": result.get("title", "Unknown"),
        "license_id": result.get("license_id", "Not specified"),
        "license_title": result.get("license_title", "Not specified"),
        "license_url": result.get("license_url", "Not specified"),
    }

    return license_info

# Example: querying the UK government's open data portal
portal = "&lt;https://data.gov.uk&gt;"
dataset = "road-accidents-safety-data"

license_info = check_dataset_license(portal, dataset)

print(f"Dataset: {license_info['dataset_name']}")
print(f"License: {license_info['license_title']}")
print(f"License ID: {license_info['license_id']}")
print(f"License URL: {license_info['license_url']}")</code></code></pre><p>Which outputs the following:</p><pre><code><code>Dataset: Road Safety Data
License: UK Open Government Licence (OGL)
License ID: uk-ogl
License URL: &lt;https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/&gt;</code></code></pre><p>With this information in hand, you can make an informed decision before a single byte of dataset content is downloaded. Specifically, you can directly check the <a href="https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/">government license page</a>. The image below partially shows the license page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hLlt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hLlt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 424w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 848w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1272w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png" width="1211" height="690" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:690,&quot;width&quot;:1211,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135747,&quot;alt&quot;:&quot;The license page of the National Archive of the UK Government by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196400924?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The license page of the National Archive of the UK Government by Federico Trotta" title="The license page of the National Archive of the UK Government by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!hLlt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 424w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 848w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1272w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The license page of the National Archive of the UK Government</figcaption></figure></div><p>But what if the portal you need to scrape doesn&#8217;t run CKAN? Not all open data portals do&#8230; <a href="https://dev.socrata.com/">Socrata</a> (used by many US city and state governments), <a href="https://getdkan.org/">DKAN</a>, and custom-built portals each have different or no metadata APIs. In those cases, your fallback options are the following:</p><ul><li><p>Check for a <em>LICENSE</em> or <em>METADATA</em> file in the dataset&#8217;s root directory or bulk download package. Many portals include one.</p></li><li><p>Look for a <em>&lt;link rel=&#8221;license&#8221;&gt;</em> tag in the dataset&#8217;s HTML page, which some portals emit as structured metadata.</p></li><li><p>Check the portal&#8217;s documentation or &#8220;About&#8221; page, where license terms are often stated globally for all datasets.</p></li></ul><p>If none of the above yield a clear answer, treat the license as unknown and do not redistribute without seeking explicit written permission from the dataset owner. A short email asking for clarification is a professional move.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>PII Detection at Scrape Time</h2><p>In this case, the approach depends heavily on what you actually know about the data you need to scrape. Two situations you will encounter in practice, each calling for a different strategy:</p><ul><li><p><strong>You know the schema</strong>: If you are retrieving structured data, field-level detection is the right approach. You know which fields are likely to carry PII, so you can target them directly. This is faster, more precise, and produces far fewer false positives than running a general NER model over free text.</p></li><li><p><strong>You have no schema</strong>: For unstructured data, NER-based detection is a reasonable starting point, but go in with realistic expectations. A common solution is using <a href="https://spacy.io/models/en">spaCy&#8217;s </a><em><a href="https://spacy.io/models/en">en_core_web_sm</a></em>, which is a small model trained on news text, so don&#8217;t expect it to do miracles for you. Another approach, which can give way better results, is <a href="https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text">using LLMs to give a structure to unstructured text</a>.</p></li></ul><p>For the structured case, here is a field-level PII detection pipeline:</p><pre><code><code>import re
import hashlib
from dataclasses import dataclass, field
from typing import Any

# Fields that are unambiguously PII on their own
DIRECT_IDENTIFIER_FIELDS = {
    "name", "full_name", "first_name", "last_name",
    "email", "email_address",
    "phone", "phone_number", "mobile",
    "ssn", "national_id", "passport_number",
    "ip_address", "device_id"
}

# Fields that are not PII alone but dangerous in combination
QUASI_IDENTIFIER_FIELDS = {
    "date_of_birth", "dob", "birth_date",
    "zip_code", "postcode", "zip",
    "gender", "sex",
    "job_title", "occupation",
    "salary", "income",
    "ethnicity", "race"
}

# Regex patterns for validating suspected PII values at the content level
EMAIL_PATTERN = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")
PHONE_PATTERN = re.compile(r"\\b(\\+?\\d[\\d\\s\\-().]{7,}\\d)\\b")

@dataclass
class FieldAudit:
    field_name: str
    classification: str    # "direct", "quasi", or "clean"
    original_value: Any
    processed_value: Any   # pseudonymized, generalized, or original
    action_taken: str       # "pseudonymized", "generalized", "dropped", "kept"

def pseudonymize(value: Any) -&gt; str:
    """
    Replaces a PII value with a consistent, reversible token.
    Using a hash means the same value always produces the same token,
    which preserves referential integrity across records (e.g., you can
    still count unique users without knowing who they are).
    In production, use an HMAC with a secret key instead of plain SHA-256.
    """
    return hashlib.sha256(str(value).encode()).hexdigest()[:16]

def generalize_date(value: str) -&gt; str:
    """
    Reduces a full date of birth to a birth year only.
    A simple but effective generalization for quasi-identifiers.
    """
    # Handles common formats: YYYY-MM-DD, DD/MM/YYYY, MM/DD/YYYY
    match = re.search(r"\\b(19|20)\\d{2}\\b", str(value))
    return match.group(0) if match else "UNKNOWN_YEAR"

def audit_record(record: dict) -&gt; tuple[dict, list[FieldAudit]]:
    """
    Processes a single structured record field by field.
    Returns a cleaned record and a full audit trail of what was done to each field.

    Strategy:
    - Direct identifiers: pseudonymize (preserve referential integrity)
    - Quasi-identifiers: generalize where possible, pseudonymize otherwise
    - Everything else: pass through unchanged
    """
    clean_record = {}
    audit_trail = []

    for field_name, value in record.items():
        normalized = field_name.lower().strip()

        if normalized in DIRECT_IDENTIFIER_FIELDS:
            processed = pseudonymize(value)
            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="direct",
                original_value=value,
                processed_value=processed,
                action_taken="pseudonymized"
            ))
            clean_record[field_name] = processed

        elif normalized in QUASI_IDENTIFIER_FIELDS:
            # Apply field-specific generalization where we can
            if normalized in {"date_of_birth", "dob", "birth_date"}:
                processed = generalize_date(value)
                action = "generalized"
            else:
                # For other quasi-identifiers, pseudonymize as a safe default
                processed = pseudonymize(value)
                action = "pseudonymized"

            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="quasi",
                original_value=value,
                processed_value=processed,
                action_taken=action
            ))
            clean_record[field_name] = processed

        else:
            # Field is not in either PII list &#8212; pass through, but still
            # run a regex check on string values as a safety net
            if isinstance(value, str):
                if EMAIL_PATTERN.search(value) or PHONE_PATTERN.search(value):
                    # Unexpected PII in a non-PII field: flag it and pseudonymize
                    processed = pseudonymize(value)
                    audit_trail.append(FieldAudit(
                        field_name=field_name,
                        classification="direct",
                        original_value=value,
                        processed_value=processed,
                        action_taken="pseudonymized (unexpected PII in non-PII field)"
                    ))
                    clean_record[field_name] = processed
                    continue

            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="clean",
                original_value=value,
                processed_value=value,
                action_taken="kept"
            ))
            clean_record[field_name] = value

    return clean_record, audit_trail

def process_records(records: list[dict]) -&gt; list[dict]:
    """
    Runs field-level PII detection and handling across a list of records.
    Prints an audit summary for any record where PII was found.
    """
    clean_records = []

    for i, record in enumerate(records):
        clean_record, audit_trail = audit_record(record)
        pii_fields = [a for a in audit_trail if a.classification != "clean"]

        if pii_fields:
            print(f"Record {i}: PII detected and handled in {len(pii_fields)} field(s):")
            for audit in pii_fields:
                print(f"  [{audit.classification.upper()}] {audit.field_name} "
                      f"&#8594; {audit.action_taken}")

        clean_records.append(clean_record)

    return clean_records

# Example: a batch of records from a scraped open dataset
records = [
    {
        "record_id": "A001",
        "name": "Jane Doe",
        "date_of_birth": "1985-03-22",
        "zip_code": "SW1A 1AA",
        "incident_type": "Road accident",
        "severity": "Slight"
    },
    {
        "record_id": "A002",
        "name": "John Smith",
        "date_of_birth": "1973-11-04",
        "zip_code": "EC1A 1BB",
        "incident_type": "Road accident",
        "severity": "Serious",
        # An email that slipped into a free-text notes field
        "notes": "Witness contact: witness@example.com"
    }
]

clean = process_records(records)</code></code></pre><p>The output is the following</p><pre><code><code>Record 0: PII detected and handled in 3 field(s):
  [DIRECT] name &#8594; pseudonymized
  [QUASI] date_of_birth &#8594; generalized
  [QUASI] zip_code &#8594; pseudonymized
Record 1: PII detected and handled in 4 field(s):
  [DIRECT] name &#8594; pseudonymized
  [QUASI] date_of_birth &#8594; generalized
  [QUASI] zip_code &#8594; pseudonymized
  [DIRECT] notes &#8594; pseudonymized (unexpected PII in non-PII field)</code></code></pre><p>A few things worth calling out in this implementation:</p><ul><li><p><strong>Pseudonymization preserves referential integrity:</strong> Because the same input always produces the same hash token, you can still count unique individuals, join records, or track entities across datasets, without storing the raw PII. In production, replace the plain SHA-256 with an HMAC keyed on a secret, so tokens cannot be reversed by someone who also has access to the hashing algorithm.</p></li><li><p><strong>The regex safety net on non-PII fields</strong>: This catches the common real-world case where PII slips into a free-text or notes field that your schema classification didn&#8217;t anticipate. It is not foolproof, but it catches the obvious cases.</p></li><li><p><strong>The audit trail is intentional:</strong> Every field-level decision is logged. If you are ever asked to demonstrate that your collection process handled PII responsibly, you have a record of exactly what was done to each field in each record.</p></li></ul><h2>Conclusion</h2><p>Open data is a shared resource, and how you interact with it says something about you as a professional. In this article, you learned what &#8220;open&#8221; means in the context of data scraping and how you should treat it if you want to be an ethical scraper.</p><p>So, let us know: Did we miss something? What&#8217;s your approach to handling open datasets in your scraping projects? Let&#8217;s discuss in the comments.</p>]]></content:encoded></item><item><title><![CDATA[Using Web Scraping in Finance to Discover Investment Insights]]></title><description><![CDATA[Tired of guessing? Use web scraping to make data-backed financial decisions!]]></description><link>https://substack.thewebscraping.club/p/web-scraping-in-finance</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/web-scraping-in-finance</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 17 May 2026 16:03:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8d7b98ff-dc95-41cf-bc83-5cfa5241ed96_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve ever invested, you know how challenging it can be (even if you don&#8217;t <em>YOLO</em> all your money into a single stock, lol). Thankfully, things get a lot easier when you build data-powered processes to guide your decision-making.</p><p>No wonder nearly half a trillion dollars are spent every year by financial firms on technology. Now, you probably don&#8217;t have that kind of money in the first place (and if you do, you don&#8217;t need to invest much anyway), but you might still want to collect financial data for personal use, research, academic projects, backtesting, or even just for selling it to industry giants.</p><p>No matter what you want to do with scraped financial data, there are a few pivotal tips to understand before embarking on this journey, which is exactly what I will explain here!</p><p>In this blog post, I will show why web scraping and finance are a match made in heaven and cover everything you need to know about retrieving both historical and real-time financial data from the web.</p><h2>Web Scraping + Finance: A Happy Marriage</h2><p>Before diving into web scraping for finance, let me explain why this is such a powerful approach and the advantages you can gain from it.</p><h3>Finance Runs on (Web) Data</h3><p>If there&#8217;s one thing that&#8217;s become clear over the past decade, it&#8217;s this: <a href="https://www.acceldata.io/blog/the-critical-role-of-data-in-finance">finance runs on data!</a></p><p>Financial institutions process massive volumes of market, customer, and transactional data every single day. In finance, data powers everything, from investment strategies to risk management. And the stakes are high, as <a href="https://www.gartner.com/smarterwithgartner/how-to-improve-your-data-quality">bad data alone costs organizations an average of $12.9 million per year</a>!</p><p>Data drives real-time decision-making, predictive modeling, and scenario planning. Finance teams feed that data into pipelines built around statistical analysis, machine learning, and <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">AI to identify patterns</a>, forecast market movements, and manage uncertainty in increasingly complex environments.</p><p>Now, here&#8217;s the central question we, web scraping enthusiasts, are all interested in: <em>where does most of that data actually come from? </em>A big portion of it comes from the web (not that surprising, uh?).</p><p>I&#8217;m talking about news sites, financial portals, company pages, official exchange websites, regulatory filings, institutional reports, and more. The web is essentially the largest and most dynamic data source available for financial purposes.</p><p>That&#8217;s exactly why web scraping in finance isn&#8217;t just useful. It&#8217;s foundational!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Benefits of a Data-Driven Approach in Finance</h2><p>Keep in mind that it&#8217;s not just big corporations or financial firms that benefit from data. Even individual retail investors can leverage financial data scraping to gain an edge. In particular, the main advantages include:</p><ul><li><p><strong>Informed decisions</strong>: Access to accurate historical data supports smarter investment decisions, while real-time data enables more solid trading choices.</p></li><li><p><strong>Market trend insights</strong>: Spot patterns and emerging trends before the wider market does.</p></li><li><p><strong>Risk management</strong>: Identify potential risks early and adjust strategies proactively.</p></li><li><p><strong>Portfolio optimization</strong>: <a href="https://substack.thewebscraping.club/p/llm-fine-tuning-for-scraping">Fine-tune asset allocation</a> based on backtesting and up-to-date market and company data.</p></li><li><p><strong>Efficiency and speed</strong>: Automate data collection, reducing time spent on manual research.</p></li></ul><p>I mean, financial firms wouldn&#8217;t be <a href="https://www.forrester.com/blogs/us-financial-services-tech-spending-hits-495-billion/">spending over $495 billion a year</a> (yeah, you read that right!) on technology (mostly built around collecting, processing, and leveraging data) if it didn&#8217;t give them a real edge!</p><h3>Getting vs Selling Financial Web Data: High-Level Overview</h3><p>There&#8217;s no doubt that financial firms invest billions into data. But what about you, as a web scraping expert, <em>how can you leverage financial data for potential gain?</em> There are two high-level approaches:</p><ol><li><p><strong>For yourself or your company</strong>: Build custom web scraping pipelines to gather data from multiple sources. Use it to <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">feed investment models, AI agents</a>, trading algorithms, or analytics pipelines. This is usually highly tailored to your strategies, risk appetite, or operational goals.</p></li><li><p><strong>To sell to financial services</strong>: Collect, aggregate, and potentially enrich data from various sources to sell. You can offer broad datasets for many clients or fully customized solutions for a specific customer&#8217;s needs.</p></li></ol><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>How to Approach Financial Data Scraping: Historical vs Real-Time</h2><p>When it comes to finance, the web is packed with countless data fields and categories (e.g., news, stock prices, filings, analyst reports, and more). It&#8217;s a huge industry, and almost anything can be scraped!</p><p>At a high level, though, the key distinction for web scraping is simple: the financial data you want to collect is either historical or real-time. That&#8217;s what actually makes a difference in the approach to data scraping.</p><p>In the following chapters, I&#8217;ll dive deeper into each of the two categories of financial data. I&#8217;ll cover which fields are most interesting to scrape, where to find them, and how to collect them efficiently and effectively.</p><p>For now, start with a brief introduction to historical and real-time financial web data scraping!</p><h3>Historical Financial Web Data</h3><p>This includes all past financial data collected from the web, from historical stock prices to inflation rates and archived news. It&#8217;s used for analysis supporting long-term investment decisions.</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Enables backtesting of investment and trading strategies.</p></li><li><p>Easier to scrape, as it isn&#8217;t time-sensitive.</p></li><li><p>Data itself is stable and doesn&#8217;t change over time&#8230;</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>&#8230;but the web pages displaying it (e.g., in tables and static charts) can still change, breaking your static parsing logic.</p></li><li><p>Misses recent market shifts or breaking events.</p></li><li><p>Data completeness varies across websites, often requiring aggregation from multiple sources.</p></li></ul><h3>Real-Time Financial Web Data</h3><p>This includes live financial data extracted from the web, such as stock prices, market news, order books, etc. It&#8217;s employed for trading and short-term investment decisions.</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Enables fast, data-driven trading decisions.</p></li><li><p>Captures live market movements and breaking news.</p></li><li><p>Can be passed to AI agents and pipelines directly, as it tends to require minimal preprocessing.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Harder to scrape reliably due to latency constraints and <a href="https://substack.thewebscraping.club/p/rate-limit-scraping-exponential-backoff">rate limits</a>.</p></li><li><p>Requires robust infrastructure for real-time ingestion and analysis, as every second counts.</p></li><li><p>Data storage can grow rapidly because new data arrives continuously.</p></li></ul><h3>Mastering Historical Financial Data Scraping</h3><p>As promised, let me guide you through the world of scraping historical financial data from the web.</p><h3>Main Types of Historical Financial Web Data</h3><p>The most important types of historical financial data you can retrieve from websites are:</p><ul><li><p><strong>Historical stock and commodity prices</strong>: Open, high, low, close (OHLC) prices and trading volumes for stocks, ETFs, indices, and commodities, used for <a href="https://substack.thewebscraping.club/p/predictive-analytics-web-scraped-data">time-series analysis, modeling, and predictions</a>.</p></li><li><p><strong>Summary info and infographics</strong>: Stock profiles, key metrics, and past indicators (e.g., P/E, EPS, moving averages), presented in dashboards or visual charts for quick insights.</p></li><li><p><strong>Macroeconomic indicators</strong>: Inflation, GDP, interest rates, unemployment, CPI, and PCE data, essential for understanding economic cycles and long-term market behavior.</p></li><li><p><strong>Financial statements</strong>: Company filings (income statements, balance sheets, cash flow), utilized for fundamental analysis and valuation models.</p></li><li><p><strong>News data</strong>: Archived headlines and press releases analyzed via NLP to correlate past market movements with specific events and sentiment shifts.</p></li><li><p><strong>ESG scores and sustainability reports</strong>: Historical environmental, social, and governance metrics employed to assess how &#8220;green&#8221; or ethical a company has been over time.</p></li><li><p><strong>Alternative data</strong>: Non-traditional datasets like web traffic, social media, satellite imagery (e.g., new headquarters or production plants), or credit card data for early performance signals.</p></li></ul><h3>Most Popular Targets</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!grNv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!grNv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 424w, https://substackcdn.com/image/fetch/$s_!grNv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 848w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png" width="1456" height="1184" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1184,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:228005,&quot;alt&quot;:&quot;Popular historical financial data scraping sources&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Popular historical financial data scraping sources" title="Popular historical financial data scraping sources" srcset="https://substackcdn.com/image/fetch/$s_!grNv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 424w, https://substackcdn.com/image/fetch/$s_!grNv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 848w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Popular historical financial data scraping sources</figcaption></figure></div><p>Also, if you&#8217;re interested in how to scrape historical data from the Wayback Machine, <a href="https://substack.thewebscraping.club/p/scraping-wayback-machine">read my previous guide for this newsletter!</a></p><h3>Scraping Techniques</h3><p>Typical examples of historical financial data include lists of open, high, low, and close prices for a given stock:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S5x0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S5x0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 424w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 848w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1272w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png" width="1456" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;NVDA historical stock data (Source: Yahoo Finance)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="NVDA historical stock data (Source: Yahoo Finance)" title="NVDA historical stock data (Source: Yahoo Finance)" srcset="https://substackcdn.com/image/fetch/$s_!S5x0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 424w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 848w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1272w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">NVDA historical stock data (Source: Yahoo Finance)</figcaption></figure></div><p>Or, another example, the historical returns of a specific index (.e.g, SP500) over time:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XcNT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XcNT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 424w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 848w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png" width="1456" height="979" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:979,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)" title="100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)" srcset="https://substackcdn.com/image/fetch/$s_!XcNT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 424w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 848w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)</figcaption></figure></div><p>These cases fall into the category of table-based data scraping, one of the most common web scraping scenarios. You&#8217;re probably already familiar with it, so there&#8217;s no need to go too deep here. Scraping older news and media can be slightly more challenging due to the unstructured nature of the target data, but it&#8217;s still a simple task.</p><p>At a high level, the process for getting historical finance data via web scraping follows a standard workflow:</p><ol><li><p>Visit the target web page, either via an HTTP client or a browser automation tool.</p></li><li><p>Parse the page using an HTML parser, either directly or after rendering in a controlled browser.</p></li><li><p>Select the HTML elements of interest and extract the data.</p></li><li><p>Store the scraped data in your desired format (e.g., XLS, CSV, JSON) or in a database.</p></li></ol><p>The main challenges involve generic anti-scraping mechanisms, such as CAPTCHAs, WAFs, IP bans, as well as <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">browser</a>, TLS, and <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">device fingerprinting</a>.</p><h3>Best Practices</h3><p>Based on my experience with financial web scraping, especially when focusing on historical data, these are the tips you should apply:</p><ul><li><p><strong>Normalize and validate data</strong>: Standardize formats (dates, currencies, units) and validate across sources to catch inconsistencies early.</p></li><li><p><strong>Be cautious with AI parsing</strong>: Avoid <a href="https://substack.thewebscraping.club/p/llms-ai-web-scraping">using AI for automatically parsing structured data</a> (tables, metrics, structured fields). It can introduce subtle errors and hallucinations, so prefer deterministic parsing. Harness AI mainly for retrieving unstructured text like news.</p></li><li><p><strong>Store raw HTML snapshots</strong>: Always keep the original page HTML. It lets you <a href="https://substack.thewebscraping.club/p/offline-web-scraping">re-parse data later and extract new signals without re-scraping</a>.</p></li><li><p><strong>Avoid single-source bias</strong>: When scraping news or market analysis pieces, pull data from multiple sources to reduce bias and improve reliability.</p></li><li><p><strong>Handle pagination properly</strong>: Many sites split historical data across pages or date ranges. Make sure your scraper fully traverses them all.</p></li><li><p><strong>Respect rate limits and retries</strong>: Even for historical data, implement retries and throttling to avoid blocks and incomplete datasets.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Understanding Real-Time Financial Data Scraping</h2><p>This is where things get a bit more interesting. Let me introduce you to real-time financial scraping!</p><h3>Main Types of Real-Time Financial Web Data</h3><p>The most relevant types of real-time financial web data are:</p><ul><li><p><strong>Live price tickers</strong>: Continuously updated &#8220;last trade&#8221; prices and bid/ask spreads for stocks, crypto, and forex, used to detect breakouts and short-term trading opportunities.</p></li><li><p><strong>Order book and market depth</strong>: Incoming buy/sell orders, liquidity levels, and spreads, fundamental for execution strategies and high-frequency trading.</p></li><li><p><strong>Breaking news</strong>: Immediate updates and announcements that trigger sentiment models as soon as key figures (CEOs, central banks, governments) release information.</p></li><li><p><strong>Corporate event triggers</strong>: Monitoring press releases or SEC feeds for earnings surprises, M&amp;A rumors, or sudden executive changes.</p></li><li><p><strong>Social media signals</strong>: <a href="https://substack.thewebscraping.club/p/how-to-scrape-reddit-with-scrapy">Tracking ticker mentions on platforms like Reddit</a> or X to detect retail-driven momentum, hype cycles, or panic selling in near real time.</p></li><li><p><strong>Institutional &#8220;whale&#8221; activity</strong>: Observing large trades or major wallet movements (especially in crypto) to identify where significant capital is flowing.</p></li><li><p><strong>Alternative digital signals</strong>: Web traffic spikes, app store ranking changes, or &#8220;out of stock&#8221; alerts on retail sites as proxies for real-world demand.</p></li></ul><p>As you can tell, this category is more varied than historical financial data, including social media tracking and other less conventional practices. Thus, the sources to monitor for live financial web scraping can be less standardized and intuitive.</p><h3>Most Popular Targets</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UIN0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UIN0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 424w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 848w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png" width="1456" height="1487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:289145,&quot;alt&quot;:&quot;Popular live financial data scraping sources&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Popular live financial data scraping sources" title="Popular live financial data scraping sources" srcset="https://substackcdn.com/image/fetch/$s_!UIN0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 424w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 848w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Popular live financial data scraping sources</figcaption></figure></div><h3>Scraping Techniques</h3><p>Imagine applying a traditional scraping pattern to real-time financial data. You send a request to a target site, extract a stock price, and repeat the operation every few seconds or even milliseconds.</p><p>The problem is latency. By the time the server responds, the page is rendered or parsed, the target data field is collected, and stored or sent to your pipeline, that piece of data is already outdated.</p><p>On top of that, this approach requires a crazy number of requests in a very short time. That increases the risk of triggering rate limiting or even IP bans. You might think proxies solve that through IP rotation, but most proxy networks introduce additional latency, often 2/3/5 seconds per request. In real-time scenarios, that delay is simply not acceptable!</p><p>Even if you <a href="https://substack.thewebscraping.club/p/choosing-proxy-provider-scraping">switch to faster or dedicated proxies</a>, you may end up with a smaller IP pool, which increases the likelihood of those IPs getting blocked.</p><p>A more advanced idea is to rely on browser automation and keep a page open, capturing updates as they happen. This is smarter, but still problematic. Long-lived sessions with little or no user interaction are highly suspicious and can easily trigger anti-bot systems. Plus, browser automation at scale tends to be flaky, not really reliable for persistent connections.</p><p>Long story short, scraping real-time financial data this way quickly turns into a losing game.</p><p>The solution? Stop targeting the data presentation layer in HTML and instead go directly to the data source!</p><h4>API/WebSocket Scraping as The Solution</h4><p>Web pages showing real-time financial data aren&#8217;t doing anything magical. Behind the scenes, they either poll APIs at regular intervals or (more commonly) <a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSocket">maintain a persistent connection via WebSockets</a> to receive continuous updates. The page simply renders that incoming data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q9lT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 424w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 848w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1272w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif" width="1080" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the live price update&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the live price update" title="Note the live price update" srcset="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 424w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 848w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1272w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the live price update</figcaption></figure></div><p>As a result, a much better approach is to intercept and replicate those data flows. You can do this through<a href="https://substack.thewebscraping.club/p/apis-in-web-scraping"> AJAX/API request inspection</a> or WebSocket sniffing. Open the browser developer tools, go to the &#8220;Network&#8221; tab, and check where the data is coming from.</p><p>If it&#8217;s an API call, you&#8217;ll see it under the &#8220;Fetch/XHR&#8221; tab:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d22T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d22T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 424w, https://substackcdn.com/image/fetch/$s_!d22T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 848w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1272w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png" width="1456" height="1180" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1180,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the API used by Yahoo Finance to determine whether the market is open in real time&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the API used by Yahoo Finance to determine whether the market is open in real time" title="Note the API used by Yahoo Finance to determine whether the market is open in real time" srcset="https://substackcdn.com/image/fetch/$s_!d22T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 424w, https://substackcdn.com/image/fetch/$s_!d22T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 848w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1272w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the API used by Yahoo Finance to determine whether the market is open in real time</figcaption></figure></div><p>If it&#8217;s a WebSocket, you&#8217;ll find it under the &#8220;Socket&#8221; tab:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!quUU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!quUU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 424w, https://substackcdn.com/image/fetch/$s_!quUU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 848w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1272w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png" width="1456" height="778" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/171d48c6-665a-4753-86bb-c30793609101_3059x1634.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:778,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page" title="Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page" srcset="https://substackcdn.com/image/fetch/$s_!quUU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 424w, https://substackcdn.com/image/fetch/$s_!quUU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 848w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1272w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page</figcaption></figure></div><p>Once identified, replicate those API calls or connect directly to the WebSocket in your scraping script. This gives you access to near real-time financial data in a structured format (typically JSON) without the overhead of parsing HTML.</p><p>Of course, that&#8217;s not trivial. <a href="https://substack.thewebscraping.club/p/websocket-bot-detection-scraping">WebSockets require proper anti-bot bypass</a>, and APIs may still enforce rate limits, tracking, and TLS fingerprinting protections. However, this approach is generally faster, more reliable, and much easier to maintain than scraping rendered pages!</p><h4>And What About Live News or Social Media Scraping?</h4><p>When it comes to news, if available, it makes sense to connect to public RSS feeds exposed by websites to monitor updates. This allows you to trigger scraping only when new and relevant content is published, instead of constantly polling pages unnecessarily.</p><p>Otherwise, you can build a polling mechanism that periodically checks news sites, social media platforms, and similar sources to capture fresh data. In these cases, you usually can&#8217;t rely on techniques like API or WebSocket scraping, as that&#8217;s not how those platforms fetch data.</p><p>Instead, you need a solid and robust infrastructure built around speed and efficiency: fast connections, high-quality proxies, optimized parsing, and lightweight requests. The goal is to minimize latency while maintaining reliability at scale.</p><h3>Best Practices</h3><p>Scraping real-time financial data is a demanding art, but it becomes easier with the following best practices:</p><ul><li><p><strong>Prefer APIs and WebSockets over HTML parsing</strong>: Whenever possible, save time by extracting data directly from the underlying APIs or WebSocket streams utilized by web pages instead of scraping data from rendered pages.</p></li><li><p><strong>Choose clean, structured sources</strong>: Prioritize endpoints that return well-formatted JSON to minimize preprocessing and reduce latency.</p></li><li><p><strong>Stream data into pipelines immediately</strong>: Send incoming data directly to processing pipelines for real-time insights, while storing it in parallel for later analysis.</p></li><li><p><strong>Use specialized AI for sentiment analysis</strong>: Prefer AI/ML models tuned for finance/social media, as Reddit and X content often include slang, memes, and non-standard language.</p></li><li><p><strong>Optimize browser automation</strong>: Configure Playwright, Selenium, or similar browser automation tools to block images, stylesheets, and fonts. This reduces bandwidth usage and significantly speeds up rendering time.</p></li><li><p><strong>Design for low latency</strong>: Optimize your stack (async requests, streaming ingestion, fast JSON parsers) to minimize delays, as even milliseconds matter.</p></li><li><p><strong>Prefer high-quality premium proxies</strong>: Count on <a href="https://substack.thewebscraping.club/p/how-many-ip-needed-scraping">proxy providers with a proven track record of fast, stable connections</a> to minimize latency and avoid disruptions.</p></li><li><p><strong>Time-synchronize everything</strong>: Append timestamps to all scraped data to enable time-series analysis and accurately reconstruct events.</p></li><li><p><strong>Build fault-tolerant systems:</strong> Expect disconnections (especially with WebSockets) and issues, so add reconnection logic and configure fallback data sources.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Top 5 Open-Source Financial Web Scraping Libraries</h3><p>Below is a selected set of interesting, fully open-source libraries, packages, and projects for simplified financial web scraping:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XR35!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XR35!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 424w, https://substackcdn.com/image/fetch/$s_!XR35!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 848w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1272w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png" width="1456" height="1136" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1136,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267258,&quot;alt&quot;:&quot;Top open-source financial web scraping libraries&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Top open-source financial web scraping libraries" title="Top open-source financial web scraping libraries" srcset="https://substackcdn.com/image/fetch/$s_!XR35!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 424w, https://substackcdn.com/image/fetch/$s_!XR35!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 848w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1272w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Top open-source financial web scraping libraries</figcaption></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve gone through the rabbit hole of financial web scraping, the task of collecting finance-related data from the Internet. This is one of the main use cases of corporate web scraping, powering enterprise data pipelines for decision-making and market analysis.</p><p>As you&#8217;ve seen, the main difference in the approach comes down to whether you&#8217;re targeting historical or real-time data. The first follows standard web scraping practices you&#8217;re likely already familiar with. The second is trickier and requires more advanced techniques.</p><p>I hope you found this helpful and insightful. If you have questions, feel free to share them in the comments below!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #104: Bypassing AWS WAF on IMDB with Scrapling ]]></title><description><![CDATA[An hands-on test on tools for TLS spoofing and Scrapling]]></description><link>https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 14 May 2026 22:23:56 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e0fccfe7-d622-4fe8-a6d2-d99c1a73a9d9_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AWS WAF is the protection we run into most often on Amazon&#8217;s public properties. It also sits in front of a long tail of third-party sites whose operators built on AWS and clicked the WAF checkbox. We wrote about it two years ago in <a href="https://substack.thewebscraping.club/p/bypassing-aws-waf-scraping">The Lab #53: Bypassing AWS WAF</a>, but this time our focus is just on AWS WAF. In fact, Traveloka used DataDome on top of AWS WAF, and our analysis had to account for both systems at once.</p><p>This time, we wanted AWS WAF on its own, in front of a target with nothing else in front of it, and we wanted to see what changes when the 2024 Scrapy-Playwright stack is replaced with the 2026 toolbox. </p><p>The target we picked is <a href="https://www.imdb.com">imdb.com</a>. It is an Amazon subsidiary, runs a standard AWS WAF deployment, and Wappalyzer confirms that there are not others antibot on the website. That makes IMDB a perfect use case for our article.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Today we&#8217;ll test three Python HTTP clients with strong TLS fingerprint impersonation: <code>curl_cffi</code>, the newer <code>httpx-curl-cffi</code>, and Rust-backed <code>rnet</code>. Each one produces a TLS handshake indistinguishable from real Chrome. Is that enough to scrape an AWS WAF target without spinning up a browser? And if not, what is the smallest browser step that gets us past the gate so the rest of the work can run on a cheap HTTP client?</p><h2>The tools we used</h2><p>Four libraries are in scope. Three are HTTP-only, one runs a real browser.</p><p><strong><a href="https://github.com/yifeikong/curl_cffi">curl_cffi</a></strong> is a Python binding for the <code>curl-impersonate</code> patched curl. It exposes a requests-like API and ships impersonation profiles for recent Chrome, Firefox, and Safari builds and works at the TLS layer. JA3 and JA4 fingerprints match the impersonated browser, along with HTTP/2 settings and header order. We tested with <code>chrome142</code>, the latest Chrome profile in version 0.14.0.</p><p><a href="https://github.com/vgavro/httpx-curl-cffi">httpx-curl-cffi</a> is a transport for <code>httpx</code> that delegates the actual HTTP work to <code>curl_cffi</code>. While it does not add new fingerprinting capability, it implements the <code>httpx</code> programming model: sync <code>Client</code>, async <code>AsyncClient</code>, event hooks, the same response object you get from the rest of an <code>httpx</code>-based codebase. We tested with the Chrome profile and <code>default_headers=True</code>.</p><p><strong><a href="https://github.com/0x676e67/rnet">rnet</a></strong><code> </code>is a Rust HTTP client with Python bindings. It implements its own impersonation stack rather than wrapping <code>curl-impersonate</code>. The enum <code>rnet.Impersonate</code> exposes a wide range of Chrome, Firefox, Safari, Edge, Opera, and OkHttp profiles. We tested with <code>Chrome137</code>.</p><p><a href="https://github.com/D4Vinci/Scrapling">Scrapling</a> is the only browser-driven tool in the set. Our <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide">Scrapling: A Complete Hands-On Guide</a> goes through the library in depth, with Cloudflare as the test target. Its <code>StealthyFetcher</code> drives a stealth-patched Chromium that runs JavaScript and applies fingerprint countermeasures. The library README only advertises Cloudflare Turnstile, but the same machinery handles AWS WAF&#8217;s challenge too.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h2>How AWS WAF protects IMDB</h2><p>A quick intro of the system helps interpret the results that follow. AWS WAF is not a dedicated anti-bot platform like DataDome or Kasada. It is a general-purpose web application firewall with a bot-control module that operators can enable per rule. When the bot-control rule is in challenge mode, AWS WAF inserts a single JavaScript gate at the start of a session.</p><p>A request without a valid cookie returns <code>HTTP 202</code> with <code>x-amzn-waf-action: challenge</code> and a short HTML body. The body contains <code>window.gokuProps</code> containing three base64 blobs (<code>key</code>, <code>iv</code>, <code>context</code>), a <code>&lt;script src&gt;</code> pointing to a customer-specific URL on <code>*.token.awswaf.com</code>, and a small inline script that calls <code>AwsWafIntegration.saveReferrer()</code>, <code>AwsWafIntegration.checkForceRefresh()</code>, and <code>AwsWafIntegration.getToken()</code>. The remote <code>challenge.js</code> tests the browser environment, posts a validation payload back to AWS, and on success, the response sets <code>Set-Cookie: aws-waf-token=...</code>. The inline script then reloads the page, and the second request, now carrying the token, gets the real content.</p><p>This works very differently from systems that score every request. Once the token is in our jar, AWS WAF lets us through with no further behavioral checks beyond IP reputation and rate limits. <br>What we want to discover with this article is if we&#8217;re able to bypass AWS WAF with &#8220;convincing&#8221; requests, with a proper TLS fingerprint and set of headers, or if we need a JS rendering engine.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Test setup</h2><p>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">104.IMDB</a>. If you&#8217;re not able to access the repository, <a href="https://twsc-private-form.lovable.app/">please use this form to request access.</a></strong></p><p>The libraries we pinned at the time of writing are <code>curl_cffi==0.14.0</code>, <code>httpx==0.28.1</code>, <code>httpx-curl-cffi==0.1.5</code>, <code>rnet==2.4.2</code>, <code>scrapling==0.4.7</code>. Python is 3.11.</p><p>Each HTTP test creates a <code>GET</code> against two URLs: the IMDB home page </p><p>https://www.imdb.com/</p><p> and a title page <code>https://www.imdb.com/title/tt0111161/</code>. We use two URLs to confirm the challenge fires the same way on both, not only on one entry point. We do not follow redirects (<code>follow_redirects=False</code>) because the AWS WAF response is a 202 with content rather than a redirect, and we want to see it raw. </p><p>We capture status code, HTTP version, the full response headers, any cookies, body length, and the first 600 characters of the body, and we saved everything to JSON under <code>aws_waf_imdb/responses/</code> for later inspection.</p><p>The baseline probe in <a href="../code/aws_waf_imdb/probe_plain.py">probe_plain.py</a> uses an unmodified <code>httpx.Client(http2=True)</code> with a generic Chrome User-Agent header and the standard <code>Accept</code> headers. This is the control: no TLS impersonation, no fingerprint trickery, just a normal Python HTTP client.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Use LLMs to Enhance Data Extraction From Unstructured Text]]></title><description><![CDATA[How combining LLMs with schema validation solves the extraction problem that NLP never could]]></description><link>https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 10 May 2026 19:06:21 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9c72c499-8522-4fcd-b662-e37cf857c78a_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>&#127465;&#127466; Before starting this article, let me remind you that on Friday the 15th, there will be the first TWSC meetup in Munich. For more details and to confirm your attendance, go to <a href="https://www.meetup.com/the-web-scraping-club/events/314567280/">the event page</a>  &#127465;&#127466; </em></p><div><hr></div><p>The web contains an extraordinary volume of information, the majority of which is in textual form. Blogs, forums, and newsletters alone generate millions of words of domain-specific knowledge every week. And they&#8217;re not the only sources of text on the web.</p><p>When you want to get insights from that kind of data, successfully extracting it from the web is only half of the battle, even now that <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">LLMs can use vision to scrape complex visual layouts</a>. The second part of the challenge is structuring this data to get it ready for analytics. Why? Because when you point a scraper at a news article, you get back a wall of text. But you cannot query it. You cannot aggregate it. You cannot feed it reliably into a machine learning pipeline or a database without significant preprocessing.</p><p>This article addresses the preprocessing problem of unstructured text when you scrape it from the web. It traces the evolution of solutions from classical NLP to large language models, identifies where each approach breaks down, and proposes a practical architectural solution.</p><p>Let&#8217;s get into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What &#8220;Unstructured&#8221; Really Means in Practice</h2><p>Unstructured text refers to content that carries no machine-readable schema. The information exists in the data you retrieved from the web, but no field boundaries exist, no consistent labels, and no guaranteed position for any given fact.</p><p>The following schema represents the difference between unstructured and structured text (machine-readable schema):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FNRZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 424w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 848w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1272w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png" width="1037" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1037,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58633,&quot;alt&quot;:&quot;The difference between unstructured and machine-readable text by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The difference between unstructured and machine-readable text by Federico Trotta" title="The difference between unstructured and machine-readable text by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 424w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 848w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1272w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The difference between unstructured and machine-readable text</figcaption></figure></div><p>Let&#8217;s consider three concrete scraping targets to illustrate what this costs you in practice.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>News Articles: When Signals Are Buried in Noise</h3><p>Consider you scraped a Reuters article about an ECB rate decision. The text you get back from the scraper could be something as follows:</p><pre><code><code>European Central Bank decides on rates.
Listen to this article. 2 min audio. 
You might also like: Eurozone inflation hits 3-year low. 
Christine Lagarde announced Thursday a 25 basis point reduction, bringing the main 
refinancing rate to 3.40%. 
SPONSORED: Track macro events with Bloomberg Terminal. 
The decision was widely anticipated after last month's CPI print. Share this article. 
4 comments. John M. writes: this was priced in already</code></code></pre><p>Your raw text contains the article body, a teaser for a related story, a sponsored insertion, and reader comments. The fact you want is buried in there: the ECB cut its main refinancing rate to 3.40% on a specific date. But your extractor gets the full content.</p><p>Such a wall of text, which, generally speaking, is way bigger than this and is useless for analytics purposes without preprocessing.</p><h3><strong>Financial Newsletters: When &#8220;Just Under Two Percent&#8221; Breaks Your Aggregation</strong></h3><p>Suppose you scrape a financial newsletter to extract an updated macroeconomic forecast. You need to capture a specific fact. Something like &#8220;Goldman Sachs revised its 2026 US GDP growth forecast down to 1.8%&#8221;. Your scraper captures the entire page output, which is similar to an article. Similarly to the previous example, the resulting raw text mixes the core facts with boilerplate and unrelated news:</p><pre><code><code>Market Daily Newsletter. November 12.
Jan Hatzius (Goldman Sachs) and his team were out with a note early Tuesday.
SPONSORED: Get 50% off your trading fees today. 
They see tariffs shaving roughly 0.7 points off the baseline. Meanwhile, 
European markets rallied on ECB news. 
Read our full coverage of the Eurozone here.
The revised number now sits just under two percent for the full year.
Subscribe for premium insights.</code></code></pre><p>The text distributes the target fact across the entire document. Also, the wording &#8220;just under two percent&#8221; requires numerical understanding to say that the text refers to the actual number you were searching for, that is, an exact 1.8%.</p><p>Now, imagine generalizing this after scraping hundreds of financial news and newsletters to regroup the information to summarize the numbers. Getting insight would be impossible. Why? Because some sources will give you the actual information you want (growth forecast down to 1.8%), others will use different phrasing to define the trend (&#8221;An expected growth under 2 percent&#8221;, &#8220;a slightly shrinking trend&#8221;, etc).</p><p>Without a way to create a structure for such data, you can&#8217;t get any insights from it.</p><h3>Job Posting Offers: They Are Always Messier Than They Look</h3><p>Consider the case when you want to scrape job offers to get an idea of what the market is paying on average for a specific position, given the expected technical skills, and considering the same day-to-day activity. Job offers can have the following ambiguities:</p><ul><li><p>A sentence might read &#8220;3+ years of experience with Python&#8221;. This establishes a floor and ignores a ceiling. Alternatively, the text might read &#8220;Senior-level candidates only&#8221;. This uses qualitative seniority as a proxy for an exact quantitative number.</p></li><li><p>Salary breaks in a different direction. One posting can say <em>&#8220;$120,000 - $145,000 base&#8221;</em>. Another can be <em>&#8220;competitive compensation commensurate with experience&#8221;</em>. A third could be<em>&#8220;&#8364;100,000&#8221;</em>, which you need to convert to dollars to make an actual comparison.</p></li><li><p>Employment type can introduce further ambiguity and difficulties. <em>&#8220;Full-time&#8221;</em>, <em>&#8220;FTE&#8221;</em>, <em>&#8220;permanent&#8221;</em>, and <em>&#8220;direct hire&#8221;</em> basically mean the same thing but are written differently. Also, the text might specify the role is &#8220;Hybrid&#8221;, which means multiple different things across companies. It could mean two days in the office. It could mean occasional travel with headquarters-optional rules.</p></li></ul><div><hr></div><blockquote><p>When sites get tough, skip the heavy lifting. Get clean, structured CSV datasets,  ready for Excel, BI or your apps</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KpSw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" width="592" height="149.84467881112175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1043,&quot;resizeWidth&quot;:592,&quot;bytes&quot;:81723,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.databoutique.com/buy-data-list&quot;,&quot;text&quot;:&quot;Find your dataset&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.databoutique.com/buy-data-list"><span>Find your dataset</span></a></p></blockquote><div><hr></div><h2>How Classical NLP Tried to Solve This (and Where It Stopped)</h2><p>Before large language models were released, the standard answer to this problem was Natural Language Processing. The classical NLP toolkit gave developers a set of tools that could, with enough effort, extract meaningful structure from text using different, but often interconnected, processes like the following:</p><ul><li><p><strong>Named Entity Recognition (NER)</strong>: NER is a process used in <a href="https://substack.thewebscraping.club/p/using-nlp-scraped-data">NLP to extract entities from text corpora</a>. It can particularly identify spans of text as persons, organizations, locations, or dates. An NLP model trained on news corpora, for example, is able to scan an article and tag &#8220;Jane Doe&#8221; as a person and &#8220;Washington D.C.&#8221; as a geopolitical entity.</p></li><li><p><strong>Part-of-speech tagging</strong>: Is a process in which NLP models can identify nouns, verbs, and adjectives. This enables the downstream logic to focus on the right parts of a sentence.</p></li><li><p><strong>Dependency parsing:</strong> Maps grammatical relationships between words, helping to extract which subject performed which action on which object.</p></li><li><p><strong>Relation extraction:</strong> Identifies when two co-occurring entities have a specific relationship. For example, a person who was affiliated with an organization, or an event that occurred in a specific location.</p></li></ul><p>Libraries like <a href="https://spacy.io/">spaCy</a>, <a href="https://nlp.stanford.edu/">Stanford NLP</a>, and <a href="https://www.nltk.org/">NLTK</a> made these processes largely accessible. But they work well for well-defined, narrow tasks on consistent text domains. The problems and limitations of this solution appear quickly at the edges:</p><ul><li><p><strong>Domain shift breaks everything:</strong> A NER model trained on news articles performs poorly on scientific abstracts. A model tuned for English financial text fails on multilingual content. In other words, every new domain requires retraining, re-labeling, and re-evaluation. These processes are very costly, both in terms of money and time.</p></li><li><p><strong>Context is invisible:</strong> Classical NLP models operate at the token and sentence level. They have no mechanism for understanding that &#8220;Apple&#8221; in a technology article refers to a corporation, while &#8220;apple&#8221; in a nutrition blog refers to a fruit. Disambiguation requires hand-crafted rules or separate classification layers bolted on top (which, again, is costly).</p></li></ul><p>Before NLP, you could basically only use regex (with all the difficulties associated with manually cleaning data, standardizing it, and&#8230;using regex!). So, NLP was a genuine (big) step forward: it made large-scale text analysis possible in ways that pure pattern matching never could (which is a way to <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">find patterns in scraped data using AI</a>). But it still required substantial domain expertise, constant maintenance, and produced results that were narrow, fragile, and difficult to generalize.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>The Modern Solution: LLMs as Universal Structure Extractors</h2><p>Large language models fundamentally changed the extraction problem. On the side of the underlying technology, a classical NLP model learns the statistical patterns inside the text. An LLM, instead, learns to understand language. This distinction matters enormously because it opened the doors to the following:</p><ul><li><p><strong>Context disambiguation that works out of the box:</strong> Feed an LLM with a paragraph from a technology article containing the word &#8220;Apple&#8221; and it will correctly identify it as a company. Feed it with a paragraph from a recipe blog, and it will correctly identify it as a fruit. No separate disambiguation layer. The model resolves ambiguity the same way a human reader does: by reading the surrounding context.</p></li><li><p><strong>Semantic equivalence that is understood, not computed:</strong> An LLM knows that &#8220;$40,&#8221; &#8220;forty dollars,&#8221; &#8220;40 USD,&#8221; and &#8220;forty bucks&#8221; all express the same value. You don&#8217;t need to instruct it to understand that.</p></li><li><p><strong>Implicit information that becomes accessible:</strong> A sentence like &#8220;the study, conducted over three months at a Boston hospital, found no significant effect&#8221; contains a location, a duration, and a finding. An LLM can extract all three without requiring the text to follow any particular structure.</p></li><li><p><strong>Domain generalization that requires no retraining:</strong> The same LLM that extracts entities from political news articles can extract findings from scientific abstracts, event mentions from cultural journalism, and source attributions from investigative reporting. You just need to change the prompt, not the model.</p></li></ul><p>The practical workflow becomes straightforward:</p><ul><li><p>You scrape unstructured text from the web.</p></li><li><p>You pass the content to an LLM with a prompt that describes what you want to extract.</p></li><li><p>The model returns a response.</p></li><li><p>You use that response downstream.</p></li></ul><p>This process works. But using LLMs alone introduces a different class of problems:</p><ul><li><p><strong>Output format is not guaranteed:</strong> Ask an LLM to return a price, and it might return <em>$40</em> in one run, <code>40</code><em> dollars </em>in another, and <em>40 USD</em> in a third. The model understands the value when it retrieves it from scraped content. But it does not guarantee how it expresses that value unless you explicitly constrain it.</p></li><li><p><strong>Required fields can go missing:</strong> If the article you extracted the content from does not mention a publication date, the model might omit the field, return <em>null</em>, or return <code>"</code><em>not mentioned</em><code>"</code>, or invent a plausible date (which is way worse). Each behavior is different, and none of them is predictable without enforcement.</p></li><li><p><strong>Hallucination is a real risk:</strong> When the model is uncertain, it always generates a plausible answer. For extraction tasks, that means it can invent entity names, fabricate statistics, or fill in missing information with confident-sounding fiction. Without validation, these errors pass into your data, creating issues at the analytics level.</p></li></ul><p>Generalizing all of this, you also get scalability issues because you have no consistency guaranteed. A pipeline processing 10,000 articles requires every output to follow the same schema. But a model that returns slightly different structures across runs cannot feed a database reliably without significant error handling.</p><p>In other words, LLMs provide you with the understanding that NLP lacked. But they do not, on their own, provide the structural guarantees that production pipelines require.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>How to Get Semantic Power and Structural Guarantees at the Same Time: A Practical Approach</h3><p>One possible solution to the unpredictability of LLM outputs is to separate the two concerns that these models conflate: semantic understanding and structure enforcement.</p><p>To do so, you can:</p><ul><li><p>Use the LLM for what it does well: reading text, resolving ambiguity, extracting meaning, and normalizing inconsistent expressions.</p></li><li><p>Use specific libraries to define schemas, enforce types, validate outputs, and reject malformed data before it enters your pipeline.</p></li></ul><p>Below is how this solution works, at a high level:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zCax!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zCax!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 424w, https://substackcdn.com/image/fetch/$s_!zCax!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 848w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1272w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png" width="998" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2508960-4a9d-44cb-9877-0df6263956b9_998x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:998,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52459,&quot;alt&quot;:&quot;The high-level process of creating machine-readable content from unstructured text by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The high-level process of creating machine-readable content from unstructured text by Federico Trotta" title="The high-level process of creating machine-readable content from unstructured text by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!zCax!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 424w, https://substackcdn.com/image/fetch/$s_!zCax!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 848w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1272w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The high-level process of creating machine-readable content from unstructured text</figcaption></figure></div><p>Let&#8217;s see how to implement this process and how the two approaches differ in practice.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>The Baseline Approach: A Direct LLM Call (and What It Gives You)</h3><p>Consider the following content that can come from scraping a news article:</p><pre><code><code>Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May</code></code></pre><p>To directly pass it to a GPT model, asking it for a precise output, you can use the following code:</p><pre><code><code>import os
import json
from openai import OpenAI

# Scraped content
SCRAPED_TEXT = """
Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May.
"""

# Define LLM client
raw_client = OpenAI(api_key=os.environ.get("YOUR_OPENAI_API_KEY"))

# Define prompt for the LLM
raw_prompt = """
Extract the following information from the article below and return it as JSON:
- title
- author
- publication_date
- mentioned_organizations
- cpi_march_value
- key_claim
- market_sentiment

Article:
""" + SCRAPED_TEXT

# Get response from LLM
raw_response = raw_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": raw_prompt}]
)

raw_output = raw_response.choices[0].message.content

# Print results
print(raw_output)</code></code></pre><p>The result will be as follows:</p><pre><code><code>{
  "title": "Fed Signals Caution as Inflation Data Disappoints",
  "author": "Sarah M. Connelly",
  "publication_date": "April 14, 2026",
  "mentioned_organizations": [
    "Federal Reserve",
    "Labor Department",
    "Reuters",
    "Goldman Sachs",
    "JPMorgan"
  ],
  "cpi_march_value": "3.5 percent",
  "key_claim": "The Federal Reserve is in no rush to cut interest rates and needs greater confidence that inflation is moving sustainably toward its two-percent target before reducing borrowing costs.",
  "market_sentiment": "Negative"
}</code></code></pre><p>Now, at first sight, this seems good. The prompt asked the GPT model to create a JSON file with specific values, and the model was able to do so. But two major problems affect the next steps when analyzing this data. They are:</p><ul><li><p>The publication date is reported as &#8220;April 14, 2026&#8221;. This is not represented in ISO 8601 format and will break any date parser.</p></li><li><p>The CPI is reported as &#8220;3.5 percent&#8221;, which is a string. Not a number or a float, which is what is required for such data if you want to further analyze it (without any intermediate steps).</p></li></ul><p>So, the LLM was able to give structure to an unstructured text, after being specifically prompted to do so. But it failed at providing the data in the right format. To do so, you have to provide specific guidance to the model.</p><h3>What Changes When You Define The Schema</h3><p>To have guarantees on the output format, you can use the following code:</p><pre><code><code>import os
import json
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Optional, Literal

# Scraped content
SCRAPED_TEXT = """
Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May.
"""

# Validation schema
class ArticleExtraction(BaseModel):
    title: str = Field(description="The article's headline")
    author: Optional[str] = Field(description="Full name of the author if explicitly mentioned")
    publication_date: Optional[str] = Field(description="Publication date in ISO 8601 format (YYYY-MM-DD)")
    mentioned_organizations: list[str] = Field(description="All organizations referenced in the article")
    cpi_march_value: Optional[float] = Field(description="CPI value as a float (e.g. 3.5)")
    key_claim: str = Field(description="The central argument or finding of the article in one sentence")
    market_sentiment: Literal["positive", "negative", "neutral"] = Field(
        description="Overall market sentiment expressed in the article"
    )

structured_client = instructor.from_openai(OpenAI(api_key=os.environ.get("YOUR_OPENAI_API_KEY")))

extraction = structured_client.chat.completions.create(
    model="gpt-4o",
    response_model=ArticleExtraction,
    messages=[
        {
            "role": "user",
            "content": f"Extract structured information from the following article:\\n\\n{SCRAPED_TEXT}"
        }
    ]
)

print(extraction.model_dump_json(indent=2))

print("\\n" + "=" * 60)
print("INSPECTION OUTPUT")
print("=" * 60)
for field, value in extraction.model_dump().items():
    print(f"  {field}: {repr(value)}  &#8594;  type: {type(value).__name__}")</code></code></pre><p>The above code leverages two fundamental libraries:</p><ul><li><p><strong><a href="https://pydantic.dev/">Pydantic</a></strong>: This is a Python data validation library. You define a schema as a Python class, declare the fields and their types, and Pydantic enforces that any data you put into that class matches what you declared.</p></li><li><p><strong><a href="https://python.useinstructor.com/">Instructor</a></strong>: This is the bridge between Pydantic and the LLM. The core problem it solves is that LLMs&#8217; APIs return text, but Pydantic validates Python objects. So, something has to sit in the middle, take the LLM&#8217;s response, parse it into the structure your Pydantic model expects, and retry the call if the output doesn&#8217;t validate. That&#8217;s what Instructor does. Without Instructor, you would have to manually prompt the model to return JSON, parse that JSON yourself, handle malformed responses, write retry logic, and coerce types by hand.</p></li></ul><p>By using these two libraries, the <em>ArticleExtraction() </em>class does the following<code>:</code></p><ul><li><p><strong>Type enforcement:</strong> Defines <em>cpi_march_value</em> as a float.  This guarantees the model will return an actual number) instead of a  string (3.5 instead of "3.5 percent" as the previous example<code>)</code>.</p></li><li><p><strong>Controls formatting and vocabulary:</strong> The <em>Literal</em> type on <em>market_sentiment</em> restricts the LLM&#8217;s output to <em>"positive"</em>, <em>"negative"</em>, or <em>"neutral"</em>. The model cannot invent new categories. Similarly, the description for <em>publication_date</em> explicitly demands the ISO 8601 format.</p></li><li><p><strong>Built-in prompting:</strong> The <em>Field(description="...")</em> parameters serve a dual purpose. First, they document the code for developers. Secondly, under the hood, the Instructor library feeds these exact descriptions to the LLM as targeted instructions. This ensures the model understands <em>exactly </em>what &#8220;key claim&#8221; or &#8220;publication date&#8221; means in this context.</p></li><li><p><strong>Graceful omissions:</strong> Wrapping fields like <code>author</code> in <em>Optional[...]</em> gives the model permission to safely return a null value if the information isn&#8217;t present in the scraped text.  This highly reduces the risk of hallucinations.</p></li></ul><p>The JSON output is as follows:</p><pre><code><code>{
  "title": "Fed Signals Caution as Inflation Data Disappoints",
  "author": "Sarah M. Connelly",
  "publication_date": "2026-04-14",
  "mentioned_organizations": [
    "Federal Reserve",
    "Labor Department",
    "Reuters",
    "Goldman Sachs",
    "JPMorgan"
  ],
  "cpi_march_value": 3.5,
  "key_claim": "The Federal Reserve remains cautious about cutting interest rates because inflation has not yet shown sufficient progress toward its two-percent target.",
  "market_sentiment": "negative"
}</code></code></pre><p>As you can see, now the CPI is a float, and the publication date is in ISO 8601.</p><p>The inspection output is the following:</p><pre><code><code>============================================================
INSPECTION OUTPUT
============================================================
  title: 'Fed Signals Caution as Inflation Data Disappoints'  &#8594;  type: str
  author: 'Sarah M. Connelly'  &#8594;  type: str
  publication_date: '2026-04-14'  &#8594;  type: str
  mentioned_organizations: ['Federal Reserve', 'Labor Department', 'Reuters', 'Goldman Sachs', 'JPMorgan']  &#8594;  type: list
  cpi_march_value: 3.5  &#8594;  type: float
  key_claim: 'The Federal Reserve remains cautious about cutting interest rates because inflation has not yet shown sufficient progress toward its two-percent target.'  &#8594;  type: str
  market_sentiment: 'negative'  &#8594;  type: str</code></code></pre><p>This validation helps immediately see that the data types are correct.</p><h2>Conclusion</h2><p>In this article, you learned what unstructured text actually costs a data pipeline. You saw how classical NLP made structured extraction possible but fragile, and how LLMs removed the domain constraints that NLP never solved. You also learned why LLMs alone are not enough and saw a practical solution to provide &#8220;guardrails&#8221; for LLMs so that their output follows a defined schema.</p><p>So, let us know: how are you managing unstructured text after you scraped it?</p>]]></content:encoded></item><item><title><![CDATA[Cloudflare Crawl Endpoint: Everything You Need to Know]]></title><description><![CDATA[Is the Cloudflare /crawl endpoint a real game-changer?]]></description><link>https://substack.thewebscraping.club/p/cloudflare-crawl-endpoint-for-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/cloudflare-crawl-endpoint-for-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 03 May 2026 20:24:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/898316de-e54e-4a62-8089-2ad66bc363b8_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Cloudflare just shook the Web by announcing its first API for crawling entire websites. It&#8217;s built for RAG systems and website monitoring, but can it really be used for real-world web scraping scenarios?</p><p>In this article, you&#8217;ll find out this and more. I&#8217;ll walk you through a complete guided example of how to use it, and break down its (Spoiler: undoubtedly serious) limitations.</p><h2>An Introduction to the Cloudflare Crawl Endpoint</h2><p>Before exploring the technical aspects behind the Cloudflare <em>/crawl</em> endpoint and seeing it in action, let me first give you some context!</p><h3>What Is the Cloudflare <em>/crawl</em> Endpoint?</h3><p>The <em><a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/">/crawl</a></em> endpoint is a new addition to <a href="https://developers.cloudflare.com/fundamentals/api/">Cloudflare&#8217;s REST APIs</a>. Its goal is to crawl an entire website (or just a portion of it) starting from a single URL.</p><p><strong>Note</strong>: The Crawl endpoint is currently in beta and was introduced on March 10, 2026, <a href="https://developers.cloudflare.com/changelog/post/2026-03-10-br-crawl-endpoint/">as highlighted in the Cloudflare changelog</a>.</p><p>Under the hood, it automatically discovers and visits new pages, <a href="https://developers.cloudflare.com/browser-rendering/">rendering them in a headless browser</a>. It returns the discovered content as HTML, Markdown, or structured JSON, making it ideal for RAG pipelines, monitoring, or dataset creation.</p><p>As I&#8217;ll dive into later, it respects <em>robots.txt</em> and <em>doesn&#8217;t</em> bypass bot protection or captchas. Thus, it&#8217;s designed as a <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">compliant approach to web crawling!</a></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" loading="lazy" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>How It Works at a High Level</strong></h2><p>At a high level, the <em>/crawl</em> endpoint involves a two-step flow:</p><ol><li><p>You kick off an asynchronous crawl job, passing a starting URL. Cloudflare returns a job ID.</p></li><li><p>You use that job ID to periodically check the job&#8217;s status or fetch results as they become available, following typical <a href="https://en.wikipedia.org/wiki/Polling_(computer_science)">polling behavior</a>.</p></li></ol><p><strong>Important</strong>: A crawl job can run for <em>up to seven days!</em><strong> </strong>Results remain available for 14 days after completion, after which the job data is deleted.</p><p>Behind the scenes, the crawler expands outward from the starting URL. By default, the API follows a clear order:</p><ol><li><p>The initial page.</p></li><li><p>Sitemap URLs.</p></li><li><p>Links discovered within pages.</p></li></ol><p>Still, you can tweak that depending on whether you want to prioritize sitemaps, page links, or both.</p><h3>Supported Use Cases</h3><p>The officially promoted use cases for the Cloudflare <em>/crawl</em> API are just two:</p><ul><li><p>Creating knowledge bases or training AI systems (like <a href="https://substack.thewebscraping.club/p/ingest-web-data-rag-llm">RAG applications</a>) using up-to-date web content.</p></li><li><p>Collecting and analyzing content across multiple pages <a href="https://substack.thewebscraping.club/p/build-an-ai-agent-for-scraping-papers">for research</a>, summarization, or monitoring purposes.</p></li></ul><h3>Pricing</h3><p>Compared to most other web crawling or discovery APIs on the market, Cloudflare&#8217;s /<em>crawl</em> API doesn&#8217;t charge by the number of pages. Instead, costs are based on resource usage, which depends on whether you enable the headless browser rendering feature.</p><p>If headless rendering is active, pricing follows the <a href="https://developers.cloudflare.com/browser-rendering/pricing/">Browser Rendering model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vrIj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vrIj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 424w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 848w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1272w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png" width="1456" height="238" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:238,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48862,&quot;alt&quot;:&quot;The Browser Rendering pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Browser Rendering pricing model" title="The Browser Rendering pricing model" srcset="https://substackcdn.com/image/fetch/$s_!vrIj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 424w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 848w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1272w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Browser Rendering pricing model</figcaption></figure></div><p>If rendering isn&#8217;t active, pricing follows the <a href="https://developers.cloudflare.com/workers/platform/pricing/">Workers model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HVsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HVsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 424w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 848w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1272w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png" width="1456" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66389,&quot;alt&quot;:&quot;The Workers pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Workers pricing model" title="The Workers pricing model" srcset="https://substackcdn.com/image/fetch/$s_!HVsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 424w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 848w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1272w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Workers pricing model</figcaption></figure></div><p><em>Yeah, I know&#8230; It&#8217;s honestly a bit confusing, and it&#8217;s almost impossible to predict the exact cost of a crawling task. The good news? You can test it for free!</em></p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Cloudflare Crawl Endpoints: Technical Analysis</h2><p>Now that you know what Cloudflare is and what it brings to the table, it&#8217;s time to better understand its functioning, strengths, and limitations.</p><h3><strong>Endpoint Presentation</strong></h3><p>The Cloudflare Crawl API is built around two main endpoints. Both share the same base URL:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl</code></pre></div><p>Where <em>&lt;ACCOUNT_ID&gt;</em> is your <a href="https://developers.cloudflare.com/fundamentals/account/find-account-and-zone-ids/#copy-your-account-id">Cloudflare account ID</a>.</p><h4>1. Initiate the Crawl Job (POST)</h4><p>To start a new crawl, you need to send a POST request with the target URL (and optional parameters like depth, rendering mode, etc.) as below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -X POST 'https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl' \
  -H 'Authorization: Bearer &lt;CLOUDFLARE_API_TOKEN&gt;' \
  -H 'Content-Type: application/json' \
  -d '{ "url": "https://example.com" }'</code></pre></div><p>Keep in mind that the endpoint supports several parameters, allowing you to greatly customize the crawling behavior, output format (JSON, HTML, or Markdown), rendering options, caching, and more. Check out the <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#optional-parameters">full list of supported body parameters for all available options</a>.</p><p>Cloudflare immediately returns a job ID that you&#8217;ll use to track or retrieve results. A possible response looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{
  "success": true,
  "result": "9f1c2d3a-4b5e-6f7a-8c9d-0e1f2a3b4c5d"
}</code></pre></div><p>The UUID in the <em>result</em> field is the Crawl job ID you&#8217;ll use to poll for updates.</p><h4>2. Request Crawl Results (GET)</h4><p>Once the crawl is running, make a GET request with the job ID to check the status or fetch results:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -X GET 'https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl/&lt;JOB_ID&gt;' \
  -H 'Authorization: Bearer &lt;CLOUDFLARE_API_TOKEN&gt;'</code></pre></div><p>Here, the <em>&lt;JOB_ID&gt;</em> placeholder is the UUID retrieved before from the <em>result </em>field.</p><p>The response either includes a <em>status</em> field like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">{
  "result": {
    "id": "3e7a1c92-b4d8-4f6e-9a21-6c0d5b8e2f14",
    "status": "running"
    // ...
  }
}</code></pre></div><p>The possible <em>status</em> values are: <em>running</em>, <em>completed</em>, <em>errored</em>, or one of several cancellation states (<em>cancelled_due_to_timeout</em>, <em>cancelled_due_to_limits</em>, <em>cancelled_by_user</em>).</p><p>Or, once the job is completed, calling the API returns the full results in the <em>records</em> field:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">{
  "result": {
    "id": "3e7a1c92-b4d8-4f6e-9a21-6c0d5b8e2f14",
    "status": "completed",
    "browserSecondsUsed": 98.3,
    "total": 12,
    "finished": 12,
    "records": [
      {
        "url": "https://example.com/",
        "status": "completed",
        "markdown": "# Example Domain\nThis domain is for use in illustrative examples...",
        "metadata": {
          "status": 200,
          "title": "Example Domain",
          "url": "https://example.com/"
        }
      },
      {
        "url": "https://example.com/about",
        "status": "completed",
        "markdown": "## About\nLearn more about this example site...",
        "metadata": {
          "status": 200,
          "title": "About - Example Domain",
          "url": "https://example.com/about"
        }
      }
      // additional entries omitted for brevity...
    ],
    "cursor": 10
  },
  "success": true
}</code></pre></div><p>Note that the response will vary based on the specified query parameters. For example, you can filter by specific statuses, limit the number of results, and <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#polling-for-completion">navigate through them using a pagination system</a>.</p><h3>Features</h3><p>Below is a list of the main, most relevant capabilities provided by the Cloudflare Crawl API:</p><ul><li><p><strong>Asynchronous crawl jobs</strong>:<strong> </strong>Trigger crawling jobs and poll results when they are ready, enabling non-blocking, large-scale crawling workflows.</p></li><li><p><strong>Automatic URL discovery</strong>: Finds pages from the starting URL, sitemaps, and in-page links, with configurable source control.</p></li><li><p><strong>Flexible output formats</strong>: Returns HTML, Markdown, or structured JSON. JSON leverages <a href="https://developers.cloudflare.com/workers-ai/features/json-mode/">Workers AI for schema-driven data extraction</a>.</p></li><li><p><strong>Headless browser rendering</strong>: Control JavaScript execution with <em>render: true</em> or perform fast static HTML fetches with <em>render: false</em>.</p></li><li><p><strong>Fine-grained crawl control</strong>: Configure <em>limit</em>, <em>depth</em>, and URL inclusion/exclusion with the <em>includePatterns</em>/<em>excludePatterns </em>fields.</p></li><li><p><strong>Incremental and cache-aware crawling</strong>: Use <em>modifiedSince</em> and <em>maxAge </em>parameters to avoid re-fetching unchanged content, optimizing performance and cost.</p></li><li><p><strong>Advanced filtering and pagination</strong>: Retrieve results using <em>limit</em>, <em>cursor</em>, and <em>status</em> filters to handle large datasets efficiently.</p></li><li><p><strong>Authentication and custom headers</strong>: Supports HTTP auth, cookies, and custom headers for crawling protected or API-driven content.</p></li><li><p><strong>Dynamic content handling</strong>: Wait for JS-rendered content using <em>gotoOptions</em> and <em>waitForSelector</em>, ideal for SPAs and interactive pages.</p></li><li><p><strong>Resource skipping for performance</strong>: Optionally block images, media, fonts, or stylesheets to speed up crawling and reduce unnecessary bandwidth usage.</p></li></ul><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h3>Limitations</h3><p>Cloudflare&#8217;s <em>/crawl</em> API also comes with several important limitations, such as:</p><ul><li><p><strong>Respects bot protection</strong>: The crawler can&#8217;t <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">bypass CAPTCHAs (including Turnstile challenges) or Cloudflare bot protections</a>. As a rule of thumb, sites protected via Cloudflare Bot Management or other WAFs tend to block crawling tasks entirely, limiting automated access and leading to incomplete datasets.</p></li><li><p><strong>Fixed User-Agent</strong>: The <em>/crawl</em> endpoint sets a non-customizable <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent">User-Agent</a> </em>value<em> </em>(<em>CloudflareBrowserRenderingCrawler/1.0</em>). You can&#8217;t change it, which may cause sites to block requests or serve different content based on the <em>User-Agent</em>.</p></li><li><p><strong>Content Signals enforcement</strong>: If a site disallows AI usage via <a href="https://contentsignals.org/">Cloudflare Content Signals</a>, crawl requests for those purposes are rejected with a <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/400">400 Bad Request</a></em> error. Even if the site allows other uses, attempts to crawl disallowed categories will fail, limiting AI-specific data collection.</p></li><li><p><strong>Rate limiting and crawl pacing</strong>: Sites with strict rate limits can slow down crawling. The crawler respects the robots.txt <em>Crawl-delay </em>directive and implements backoff. Large crawls may need to be split into smaller jobs to avoid throttling or skipped URLs.</p></li><li><p><strong>Browser usage limits and job cancellation</strong>: Accounts on Workers free plans are capped at 10 minutes of browser time per day. Exceeding this limit results in a <em>cancelled_due_to_limits</em> status. To avoid that, you can upgrade to a paid plan.</p></li></ul><h2>How to Use the Cloudflare Crawl Endpoint: Step-by-Step Tutorial</h2><p>In this guided section, I&#8217;ll show you how to use the Cloudflare Crawl Endpoint to crawl a website in Python. The target site will be the &#8220;<a href="https://quotes.toscrape.com/">Quotes to Scrape</a>&#8221; sandbox. The goal here is to demonstrate how to use the API, rather than actually collecting relevant data.</p><p>Follow the instructions below!</p><h3>Prerequisites</h3><p>To follow this tutorial section, make sure you have:</p><ul><li><p>Your <a href="https://developers.cloudflare.com/fundamentals/account/find-account-and-zone-ids/#copy-your-account-id">Cloudflare account ID</a> at hand.</p></li><li><p>A <a href="https://developers.cloudflare.com/fundamentals/api/get-started/create-token/">Cloudflare API token</a> with the &#8220;Browser Rendering - Edit&#8221; permission.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nJvY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nJvY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 424w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 848w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission" title="Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission" srcset="https://substackcdn.com/image/fetch/$s_!nJvY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 424w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 848w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission</figcaption></figure></div><p>For the sake of simplicity and to keep this tutorial concise, I&#8217;ll assume you already have a Python project set up with <em><a href="https://substack.thewebscraping.club/p/python-http-request-explained">requests</a></em> installed. That said, you can use any programming language and any HTTP client, because the high-level logic remains the same.</p><h3>Step #1: Set Up the Configurations</h3><p>Start by importing the required libraries and reading the necessary secrets (your Cloudflare API token and account ID). Use these secrets to prepare the Cloudflare Crawl base URL and headers. Also, specify the starting target URL as a constant.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import requests
import time
import json

# The Cloudflare secrets required for authenticating Crawl API calls
CLOUDFLARE_API_TOKEN = "&lt;YOUR_CLOUDFLARE_API_TOKEN&gt;"
CLOUDFLARE_ACCOUNT_ID = "&lt;YOUR_CLOUDFLARE_ACCOUNT_ID&gt;"

# The base URL used for all Crawl API calls
CLOUDFLARE_CRAWL_BASE_URL = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl"

# Common headers shared by all API calls
HEADERS = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# The URL to crawl
START_URL = "https://www.ssense.com/en-us/men/product/acne-studios/silver-folded-leather-wallet/18169981"</code></pre></div><p><strong>Tip</strong>: In a production script, read the Cloudflare API token and account ID from environment variables rather than hardcoding them.</p><h3>Step #2: Trigger the Crawling Job</h3><p>Define a <em>start_crawl()</em> function to send a POST request to Cloudflare&#8217;s Crawl API:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def start_crawl(start_url):
    # Customize according to your needs
    payload = {
        "url": start_url,
        "limit": 20,
        "depth": 2,
        "formats": ["markdown"],
        "render": True
    }

    response = requests.post(CLOUDFLARE_CRAWL_BASE_URL, headers=HEADERS, json=payload)
    response.raise_for_status()

    job_id = response.json()["result"]
    return job_id</code></pre></div><p>This creates a new crawling job for the target URL. Then, it returns a <em>job_id</em> that identifies this specific crawl.</p><p><strong>Tip</strong>: In a production-level script, make the <em>payload</em> object configurable via function input arguments for greater flexibility and reusability.</p><h3>Step #3: Poll Over the Job</h3><p>Next, add a <em>wait_for_completion()</em> function to repeatedly check the job status every few seconds until the crawl finishes or times out:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def wait_for_completion(job_id, max_attempts=60, delay=5):
    status_url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}?limit=1"

    for attempt in range(max_attempts):
        res = requests.get(status_url, headers=HEADERS)
        res.raise_for_status()

        status = res.json()["result"]["status"]
        print(f"Attempt {attempt+1}: {status}")

        if status != "running":
            return status

        time.sleep(delay)

    raise Exception("Timeout waiting for crawl job")</code></pre></div><p>This makes GET calls to the Cloudflare <em>/crawl</em> endpoint. It ensures you&#8217;re waiting for the task to complete processing before fetching the crawled records.</p><p><strong>Tip</strong>: The <em>limit=1</em> query parameter is recommended to restrict the number of retrieved records, keeping the response lightweight. After all, at this stage, you&#8217;re only interested in checking the job status, not in retrieving the actual output data.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Step #4: Get the Crawled Content Pages</h3><p>Build a <em>fetch_records()</em> function to collect all crawled pages:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def fetch_records(job_id):
    print("Fetching results...")

    all_records = []
    cursor = None

    while True:
        url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}"
        # Getting 10 records at a time
        params = {
            "limit": 10
        }

        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()

        result = response.json()["result"]
        records = result.get("records", [])

        all_records.extend(records)
        print(f"+{len(records)} records (Total: {len(all_records)})")

        cursor = result.get("cursor")
        if not cursor:
            break

    return all_records</code></pre></div><p>This handles pagination using a <em>cursor</em>, accessing records in batches (<em>10</em> per request) until all results are returned.</p><h3>Step #5: Put It All Together</h3><p>Finally, in the <em>main()</em> function, orchestrate the workflow:</p><ol><li><p>Start the crawl</p></li><li><p>Wait for completion</p></li><li><p>Fetch all results</p></li></ol><p>Then, you can export the crawled records to a local JSON file for further use, store the retrieved data in a database, process it there, etc.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def main():
    job_id = start_crawl(START_URL)
    status = wait_for_completion(job_id)

    if status != "completed":
        raise Exception(f"Crawl failed: {status}")

    print("Crawl completed!")

    records = fetch_records(job_id)

    # Export the crawled pages to an output JSON file
    with open("records.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()</code></pre></div><h3>Step #6: Complete Code</h3><p>This is what your Python script for interacting with the Cloudflare Crawl API will look like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># pip install requests

import requests
import time
import json

# The Cloudflare secrets required for authenticating Crawl API calls
CLOUDFLARE_API_TOKEN = "&lt;YOUR_CLOUDFLARE_API_TOKEN&gt;"
CLOUDFLARE_ACCOUNT_ID = "&lt;YOUR_CLOUDFLARE_ACCOUNT_ID&gt;"

# The base URL used for all Crawl API calls
CLOUDFLARE_CRAWL_BASE_URL = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl"

# Common headers shared by all API calls
HEADERS = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# The URL to crawl
START_URL = "http://quotes.toscrape.com/"

def start_crawl(start_url):
    """
    Triggers the Cloudflare Crawl API job
    """

    # Customize according to your needs
    payload = {
        "url": start_url,
        "limit": 20,
        "depth": 2,
        "formats": ["markdown"],
        "render": True
    }

    response = requests.post(CLOUDFLARE_CRAWL_BASE_URL, headers=HEADERS, json=payload)
    response.raise_for_status()

    job_id = response.json()["result"]
    return job_id

def wait_for_completion(job_id, max_attempts=60, delay=5):
    """
    Waits for the crawling task to complete
    """

    status_url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}?limit=1"

    for attempt in range(max_attempts):
        res = requests.get(status_url, headers=HEADERS)
        res.raise_for_status()

        status = res.json()["result"]["status"]
        print(f"Attempt {attempt+1}: {status}")

        if status != "running":
            return status

        time.sleep(delay)

    raise Exception("Timeout waiting for crawl job")

def fetch_records(job_id):
    """
    Collects all records from the paginated results
    """

    print("Fetching results...")

    all_records = []
    cursor = None

    while True:
        url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}"
        # Getting 10 records at a time
        params = {
            "limit": 10
        }

        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()

        result = response.json()["result"]
        records = result.get("records", [])

        all_records.extend(records)
        print(f"+{len(records)} records (Total: {len(all_records)})")

        cursor = result.get("cursor")
        if not cursor:
            break

    return all_records

def main():
    job_id = start_crawl(START_URL)
    status = wait_for_completion(job_id)

    if status != "completed":
        raise Exception(f"Crawl failed: {status}")

    print("Crawl completed!")

    records = fetch_records(job_id)

    # Export the crawled pages to an output JSON file
    with open("records.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()</code></pre></div><h3>Step #7: Test the Script</h3><p>Launch the script, and it&#8217;ll produce an output like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XDal!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XDal!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 424w, https://substackcdn.com/image/fetch/$s_!XDal!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 848w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1272w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png" width="1175" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1175,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output produced by the script in the terminal&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output produced by the script in the terminal" title="The output produced by the script in the terminal" srcset="https://substackcdn.com/image/fetch/$s_!XDal!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 424w, https://substackcdn.com/image/fetch/$s_!XDal!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 848w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1272w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the script in the terminal</figcaption></figure></div><p>The polling mechanism required 5 attempts (~25 seconds), and the API discovered and retrieved 22 pages.</p><p>A <em>records.json</em> file will appear in your project directory. Open it, and you&#8217;ll see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZwCj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 424w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 848w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png" width="1456" height="1071" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1071,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 424w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 848w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the script</figcaption></figure></div><p>Notice how the &#8220;Quotes to Scrape&#8221; entries contain a <em>markdown</em> field with the Markdown version of the page. Instead, external links like Zyte&#8217;s homepage and Goodreads.com are skipped, since <em>includeExternalLinks</em> is set to <em>false</em> by default. In other words, the Cloudflare Crawl API doesn&#8217;t automatically attempt to fetch data from different domains than the target source URL.</p><p>Et voil&#224;! Implementation complete.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Benchmark Against Protected Websites</h3><p>Cool! The Cloudflare Crawl endpoint works like a charm and is easy to use. However, I was particularly concerned about its documented limitations and wanted to verify whether they actually hold up in practice&#8230;</p><p>So, I ran tests against several well-known sites protected by common WAF and anti-bot solutions (from different providers). Here&#8217;s a summary of the results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!chL4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!chL4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 424w, https://substackcdn.com/image/fetch/$s_!chL4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 848w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1272w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111887,&quot;alt&quot;:&quot;Cloudflare Crawl API vs anti-bot solutions&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cloudflare Crawl API vs anti-bot solutions" title="Cloudflare Crawl API vs anti-bot solutions" srcset="https://substackcdn.com/image/fetch/$s_!chL4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 424w, https://substackcdn.com/image/fetch/$s_!chL4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 848w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1272w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cloudflare Crawl API vs anti-bot solutions</figcaption></figure></div><p>As you can tell, the limitations are very real. The results are quite discouraging:<strong> the Cloudflare Crawl API failed against all anti-bot&#8211;protected websites I tested.</strong></p><p>So, is this solution reliable for web scraping? When (and how) should you actually use it? Let me break that down in a final comment!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Final Comment</h2><p>In this article, I introduced you to one of the newest tools in Cloudflare&#8217;s growing ecosystem: the Crawl API! This endpoint is designed to help you crawl entire websites using distributed crawling tasks running on Cloudflare&#8217;s infrastructure.</p><p>Sure, the crawling mechanism works and is easy to launch, control, and implement. With just a few lines of code, you can get started. Still, several concerns should be raised:</p><ol><li><p><strong>Opaque pricing</strong>: Costs are tied to resource usage rather than the number of pages crawled, making them harder to predict.</p></li><li><p><strong>Fixed </strong><em><strong>User-Agent</strong></em>: The API doesn&#8217;t allow <em>User-Agent</em> customization, meaning even basic server-side checks can block it.</p></li><li><p><strong>Limited effectiveness on protected sites</strong>: The API has an intended very low success rate against anti-bot&#8211;protected websites (unless you specify in Cloudflare Bot Protection settings that you <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#robotstxt-and-bot-protection">allow it against your site</a>).</p></li><li><p><strong>Rate limiting constraints</strong>: It strictly respects <em>robots.txt</em> directives and crawl delays, which can significantly slow or limit large crawls.</p></li></ol><p>In simple terms, if you want to use it for general-purpose, large-scale web crawling, I wouldn&#8217;t recommend it. The market offers more effective solutions that can actually bypass anti-bot limitations. Plus, remember that around <em><a href="https://www.securitymagazine.com/articles/101188-65-of-websites-arent-protected-from-bots">35% of the entire Internet</a></em> is estimated to be protected against bots (i.e., you won&#8217;t be able to crawl it with this API).</p><p>Yet, if you know the target site is not protected, budget isn&#8217;t a concern, and you want to remain (<em>overly?</em>) ethical and compliant, the Cloudflare Crawl API can be an option.</p><p>I hope this breakdown helps you better understand this new solution and make an informed decision on whether to adopt it. Lastly, remember that the Cloudflare Crawl API is still in beta, so things may change soon. Just <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/">keep an eye on the docs for updates</a>. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #103: Bypassing DataDome-Protected Websites in the Agentic Era]]></title><description><![CDATA[Fifteen browser configurations, one tough anti-bot, and only a couple made it to the cart]]></description><link>https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 30 Apr 2026 21:34:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9e5dad0e-b094-41c0-942c-c76f3783b289_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This year every web infrastructure company seems to be shipping a browser. But not a regular browser,  one designed to be driven by an AI agent and to look human while doing it. We wanted to know if any of those browsers actually work against a serious anti-bot, so we picked a hard target, leroymerlin.fr behind DataDome, and tested more than a dozen different setups on the same four-step task: open the homepage, search for a product, open the first result, add it to the cart.<br></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>The short answer is that a couple of tools finished the task, just one with any consistency. The story behind why is worth telling, because it explains what is happening at the intersection of AI agents and web data right now. We ran a similar exercise <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">against Cloudflare earlier this year</a>, and the conclusion is broadly the same: each anti-bot needs its own answer, and the answer changes every quarter.</p><h2>From workflows to agents, and why that changes the data problem</h2><p>Most code shipped under the AI banner is not really agentic. It is workflow code with an LLM dropped into a slot: generate a summary here, classify a record there, draft an email at the end. The control flow is hard-coded, and the model is one component among many.</p><p>The definition of an agent is quite different. The model decides the next action, observes the outcome, and decides again. The control flow lives inside the loop, not outside it. The agent has goals rather than scripts, and it picks tools and steps based on what it sees. That is what makes the engineering interesting, that is what makes it hard, and that is what sometimes makes it unreliable.</p><p>It also forces a different relationship with data. An agent that only sees its training corpus is stuck in the past. To make decisions worth anything, it has to read prices that change daily, stocks that move minute by minute, listings that did not exist last week. Some of that data sits behind APIs. Most of it does not. The web is still the largest and most current dataset in the world, and most of it is reachable only through a browser. So if we want our agents to act on real information, we have to give them a way to browse: opening a page, reading it, clicking a link, typing into a search bar, following a result, filling a form, all on sites that were never built for machines.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:69.35779816513761,&quot;width&quot;:630,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p>This is the constraint that produced the wave of &#8220;agentic browser&#8221; launches we have seen over the last twelve months. Y Combinator alone has backed a long string of them. <a href="https://www.hyperbrowser.ai/">Hyperbrowser</a> (S21) was an early entry: scalable cloud browser infrastructure with built-in CAPTCHA solving, proxy management, and now a multi-agent playground. The newer cohort followed the agent wave more directly: <a href="https://www.browseros.com/">BrowserOS</a> (S24) is an open-source agentic browser that runs the agent locally on the user&#8217;s machine; <a href="https://browser-use.com/">Browser Use</a> (W25) offers an open-source agent loop on top of Playwright, plus a cloud version. <a href="https://www.skyvern.com/">Skyvern</a> is a self-hostable browser agent that uses an LLM and computer vision instead of fixed selectors.  Outside the YC pipeline, <a href="https://lightpanda.io/">Lightpanda</a> is doing something different again, a headless browser engine written from scratch in Zig and aimed squarely at agents and crawlers (claiming roughly 9x faster execution and 16x lower memory than Chrome). It fits the &#8220;browser built for machines&#8221; line of thought we covered in <a href="https://substack.thewebscraping.club/p/rethinking-the-web-browser">Rethinking the web browser</a> earlier this year. <a href="https://www.browserbase.com/">Browserbase</a> ships a managed browser plus Stagehand for natural-language automation. And the big AI labs are now in the same space: OpenAI shipped Operator and the ChatGPT Atlas browser, Anthropic shipped Computer Use, Perplexity launched Comet. Each project attacks the same problem from a slightly different angle, but the goal is identical: a browser an agent can drive without immediately tripping every detection mechanism on the other side.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h2>The same problem scrapers have been chasing for a decade</h2><p>For anyone who has worked in web data, none of this is new. The fight over whether a request looks human or automated has been going on as long as commercial scraping has existed. The product names have changed but the purpose not.</p><p>What has changed is who is selling the bypass. The companies that have spent years selling residential proxies and unblockers noticed quickly that the agentic boom is good for their business. They already have the IP networks, the fingerprint research, the bypass code, the cat-and-mouse experience. They know what TLS handshake Chrome sends in October 2025 and what it sent in October 2024. Pivoting all of that into a managed browser is a smaller leap than building one from scratch. <a href="https://brightdata.com">Bright Data</a>, <a href="https://oxylabs.io">Oxylabs</a>, <a href="https://rayobyte.com">Rayobyte</a>, <a href="https://www.zenrows.com">ZenRows</a> have all added a managed browser product alongside the proxy. </p><p>The other side of the line is moving in the opposite direction. Bot traffic has grown faster than human traffic for years, and the operators of large public sites care more about it than ever. <a href="https://datadome.co">DataDome</a>, <a href="https://www.cloudflare.com/products/bot-management/">Cloudflare Bot Management</a>, <a href="https://www.akamai.com/products/bot-manager">Akamai Bot Manager</a>, <a href="https://www.humansecurity.com">HUMAN</a>, <a href="https://www.kasada.io">Kasada</a>: every one of them ships updates that target the exact tools we just listed. Fingerprint checks get stricter. Behavioral models get more sensitive. The JavaScript challenge changes shape every few weeks. There is no silver bullet, and there is no tool, browser, proxy, or service that bypasses every anti-bot on every site at all times. Anyone who claims otherwise is selling something that worked last quarter and might still work this week. The useful question is what works on a given target, today, at what cost.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Picking a hard target</h2><p>To answer that question concretely, we needed a target where the anti-bot was good and the signal was clean. We picked leroymerlin.fr, the French DIY retailer. Leroy Merlin runs DataDome standalone, with no other anti-bot layer on top, so attribution is straightforward. It also runs one of the more verbose DataDome configurations we have come across: response headers expose <code>x-datadome-riskscore</code>, <code>x-datadome-protection</code>, <code>x-datadome-cid</code>, and <code>x-datadome-endpointid</code>. Most DataDome-protected sites only show us the outcome. Here we see the score the engine assigns at every request, which is rare and very useful when comparing tools side by side.</p><p>The task we picked is small but realistic. From the homepage, the agent has to type &#8220;ampoule B22 led blanc&#8221; into the search bar, click the first product result, and add the product to the cart. Four steps. We dropped the login step on purpose: leroymerlin.fr requires an OTP to sign in, and we did not want OTP friction to confound an anti-bot test.</p><p>A run is a pass if the agent reaches the cart confirmation. Otherwise we record where it stopped and what DataDome said about it. Each tool runs ten times back to back, and we aggregate the results. Tools that support an external proxy use the same residential pool: Bright Data residential FR for the Bright Data runs, <a href="https://geonode.com">Geonode</a> residential FR for the Geonode runs. Tools that ship their own proxy use it. The reason behind two different providers was because we wanted to diversify the IP addresses, to be sure that blocks were not a matter of IP reputation.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The contestants</h2><p>As you&#8217;ve seen before, the browser landscape is quite crowded and we could not cover all the tools. We picked four open-source projects and seven commercial products. Let&#8217;s start with the open source.</p><p><a href="https://camoufox.com">Camoufox</a> is the stealth Firefox fork most people in the scraping world have already met (we <a href="https://substack.thewebscraping.club/p/open-source-python-libraries-scraping">introduced it</a> on TWSC back in September 2024). It rotates real-world fingerprints, patches the obvious automation tells, and ships a Playwright-compatible API. We pair it with both Bright Data and Geonode residential proxies in France. </p><p><a href="https://github.com/autoscrape-labs/pydoll">Pydoll</a> takes a different route: it drives Chromium directly over CDP without WebDriver, with built-in humanized cursor movement and typing. Importantly, Pydoll implements an explicit <code>Fetch.authRequired</code> handler, which lets it authenticate proxies that require Basic auth. </p><p><a href="https://scrapling.readthedocs.io">Scrapling</a> is a higher-level Python library. We use it in two modes. <code>DynamicFetcher</code> launches vanilla Playwright Chromium driven by Scrapling&#8217;s session manager. <code>StealthyFetcher</code> does the same, but under the hood uses an improved and customized version of <a href="https://github.com/Kaliiiiiiiiii-Vinyzu/patchright">patchright</a>, a stealth-patched Playwright fork. Each gets its own row in the comparison. </p><p><a href="https://github.com/rayobyte-data/rayobrowse">RayoBrowse</a> is the self-hosted stealth Chromium fork from Rayobyte, distributed as a Docker container that exposes a CDP endpoint on port 9222. Here we hit a wall worth flagging: for some reason RayoBrowse could not use the Bright Data residential proxy in our setup. Every navigation through that proxy failed instantly, even though the same credentials worked fine through <code>curl</code> from inside the same container. The same RayoBrowse setup worked fine with Geonode. We did not isolate the root cause, so we report RayoBrowse on Geonode only.</p><p>The commercial side is more crowded. </p><p><a href="https://browser-use.com/">Browser Use</a> exists in two flavors, and we tested both. The cloud version is the managed Browser Use, with its own residential proxy, its own stealth fingerprinting, and a fixed set of supported models; we drove it once in raw CDP mode (we steer it ourselves with Playwright) and once in agent mode (we hand the LLM the task in natural language and let it plan the steps). </p><p><a href="https://www.browserbase.com/">Browserbase</a> is a managed Chromium with optional residential proxies, Cloudflare Web Bot Auth verification, and the Stagehand agent SDK. We discovered during the test that the free tier excludes proxies entirely; without one, the session egresses from a US datacenter. We left this configuration in the test because it is what a free user would experience. </p><p><a href="https://www.browserless.io">Browserless</a> is a managed browser-as-a-service whose anti-bot story is a stealth path (<code>/chromium/stealth</code>) plus optional residential proxies for paid plans. The free plan caps sessions at 60 seconds, which is tight for a four-step flow. We tested it with the built-in residential proxy targeting France, and tried to test it with our external proxies via the <code>externalProxyServer</code> parameter; the external mode failed at connection time on every run, in the same Chromium-side authentication way that broke RayoBrowse, so we drop those configurations from the comparison. </p><p><a href="https://zenrows.com/">ZenRows</a> Scraping Browser is a managed Chromium with a built-in residential proxy network and built-in CAPTCHA solving; we connect via the WSS endpoint with <code>proxy_country=fr</code> to get a French exit point. </p><p><a href="https://brightdata.com/">Bright Data Browser API</a> sits at the other end of the same product category: a managed Chromium with built-in residential rotation and CAPTCHA solving, on a dedicated Browser API zone we configured on their dashboard.</p><p>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">103.BROWSERS</a>.</strong></p><h2>What we had to fix before the numbers made sense</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Stop Paying for Bandwidth: How to Leverage IPv6 Subnets for Infinite Proxy Rotation]]></title><description><![CDATA[Escape metered residential proxy billing. Discover how to build a self-hosted, rotating proxy gateway using IPv6 /64 subnets to drastically cut your web scraping costs at scale.]]></description><link>https://substack.thewebscraping.club/p/use-ipv6-scraping-nyxproxy</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/use-ipv6-scraping-nyxproxy</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 26 Apr 2026 20:30:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/21b6b18a-a1f6-4511-aec6-c5fc9ba435cd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p style="text-align: justify;">When your data extraction pipelines scale from a few thousand requests a day to thousands of requests per second, the bottleneck becomes network egress and IP reputation. Modern web architectures are defended by sophisticated Web Application Firewalls (WAFs) that deploy strict rate limiting, fingerprinting, and behavioral analysis.</p><p style="text-align: justify;">This means that if you route all your traffic through a single egress IP, you will be rate-limited in seconds and blacklisted in minutes. To survive at scale, you need to distribute your requests across a massive pool of IP addresses.</p><p style="text-align: justify;">Traditionally, the web scraping industry has solved this issue thanks to commercial proxy providers. However, this is not the only approach. This article responds to the following question: &#8220;<em>Is there a way to scrape at scale without burning budget on proxies</em>?&#8221;</p><p style="text-align: justify;">The answer is yes. But let&#8217;s be clear from the beginning: This approach is not a universal silver bullet. Let&#8217;s see how it works, how to build it, and what its limitations are.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>The Typical Solution for Scraping at Scale: Proxy Provider Services</h2><p style="text-align: justify;">Let&#8217;s start this discussion with the typical choice for scraping at scale. IP bans and rate limits are the #1 operational problem in scraping, especially at scale. The typical solution every web scraping engineer integrates is using proxy servers, for a simple reason: <a href="https://substack.thewebscraping.club/i/164246773/what-are-proxies-and-why-are-they-used">proxies act as intermediaries between your scrapers and the Internet</a>, avoiding your scrapers from getting banned. To do so, companies buy proxy IPs from proxy providers. The most common categories, both with their flaws, are the following:</p><ul><li><p style="text-align: justify;"><strong>Datacenter proxies:</strong> These are cheap and fast, but their ASNs(Autonomous System Numbers) are heavily scrutinized. WAFs maintain databases of known datacenter CIDR (Classless Inter-Domain Routing) blocks, so hitting a target with a static list of 100 datacenter proxies usually results in those IPs being flagged and blocked within hours.</p></li><li><p style="text-align: justify;"><strong>Residential proxies:</strong> These route traffic through actual consumer devices. They have highly trusted IP reputations, making them excellent for bypassing anti-bot systems. However, they are priced by bandwidth, so they are very expensive, especially when scraping at scale.</p></li></ul><p style="text-align: justify;">The main limitation of this approach is that it is highly expensive. So, what if you need to scrape at scale but don&#8217;t have enough budget for doing so?</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>An Alternative Approach: Scraping at Scale With Dedicated Infrastructure</h2><p style="text-align: justify;">To escape metered billing, you can move egress back to dedicated infrastructure. But before presenting the solution, let&#8217;s first point out shortly what happens when you buy and use proxies, at the infrastructure level.</p><h3>Buying Proxies Means Delegating Your Infrastructure</h3><p style="text-align: justify;">When you buy proxies from providers, you are delegating 100% of your infrastructure. When your scrapers make the requests, under the hood, the proxy provider connects to a gateway, which is a massive load balancer controlled entirely by the provider itself.</p><p style="text-align: justify;">Let&#8217;s consider the case of residential proxies, for simplicity. Behind the gateway is a peer-to-peer (P2P) network of millions of consumer devices that the provider has acquired bandwidth from. When your request hits the gateway, <strong>their proprietary routing algorithm decides which consumer device in which country will act as your final exit node</strong>.</p><p style="text-align: justify;">The second you route traffic through their gateway is the exact moment where you delegate the 100% of your scraping infrastructure.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h3>NyxProxy: The Infrastructural Solution</h3><p style="text-align: justify;"><a href="https://github.com/phanes-io/nyxproxy-oss?tab=readme-ov-file">NyxProxy</a> is a self-hosted HTTP/SOCKS5 proxy server that exploits a well-known IPv6 networking trick: When a cloud provider gives you a <em>/64</em> subnet, you legally own 18.4 <em>quintillion</em> IPv6 addresses.</p><p style="text-align: justify;">Let&#8217;s explain the number and the trick around IPv6s. An IPv6 address looks like this:</p><pre><code><code> 2a05:f480:1800:25db:0000:0000:0000:0001</code></code></pre><p style="text-align: justify;">They are 128 bits long. That gives <em>2^128</em> possible addresses. The number is so large that the designers said: &#8220;W<em>e can afford to give every organization a massive block and never worry about running out&#8221;.</em></p><p style="text-align: justify;">Now, here is the trick. An IPv6 address is split into two halves, 64 bits each:</p><pre><code><code>2a05:f480:1800:25db : 0000:0000:0000:0001
|___________________|   |_________________|
   Network prefix            Host part
   (your subnet)          (you control this)</code></code></pre><p style="text-align: justify;">The <em>/64</em> notation means: the first 64 bits identify the network, the last 64 bits are yours to assign however you want. The last 64 bits can be any value from <em>0000:0000:0000:0000</em> to <em>ffff:ffff:ffff:ffff</em>: That&#8217;s <em>2^64</em> = 18.4 quintillion combinations. All valid addresses, all routable to your server.</p><p style="text-align: justify;">Thanks to this trick, NyxProxy can assign a pool of those addresses to your network interface at startup, then rotate your outgoing traffic across them. This means having a fresh IP per request. The tool handles pool management, background rotation, NDP proxying via <em>ndppd</em>, and exposes a monitoring endpoint.</p><p style="text-align: justify;">The best part is, indeed, in the NDP proxying. When your server uses a random address like <em>2a05:f480:1800:25db:a3f1:9922:beef:1234</em> as a source IP, your router upstream needs to know <em>your server is responsible for that address</em>. Otherwise, the response packets have nowhere to go.</p><p style="text-align: justify;">IPv6 uses NDP (Neighbor Discovery Protocol) for this. The router sends an NDP query: <em>&#8220;who has 2a05:f480:1800:25db:a3f1:9922:beef:1234?&#8221;</em> and your server must answer.</p><p style="text-align: justify;"><em><a href="https://github.com/DanielAdolfsson/ndppd">ndppd</a></em> (NDP Proxy Daemon) runs on your server and answers those queries automatically for your entire /64 subnet, essentially saying <em>&#8220;yes, all of those addresses are mine&#8221;</em>. Without it, your packets go out, but responses never come back.</p><p style="text-align: justify;">Below is a summary schema of how this whole process works:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;ac241add-8e8d-40d0-a7df-518bccfc20bc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Provider gives you:  2a05:f480:1800:25db::/64
                     &#8595;
Your server can use: 2a05:f480:1800:25db:[anything]
                     &#8595;
NyxProxy assigns 200 random IPs to your interface
                     &#8595;
Each outgoing request binds to a different one
                     &#8595;
Target sees 200 different source IPs
                     &#8595;
ndppd makes sure responses route back correctly</code></pre></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>How To Use NyxProxy</h2><p>Let&#8217;s now see how to use NyxProxy with a practical implementation.</p><h3>Environment Setup &amp; Prerequisites</h3><p style="text-align: justify;">To replicate this tutorial for deploying NyxProxy and utilizing it in your scraping scripts, you must have the following system and hardware requirements:</p><ul><li><p style="text-align: justify;"><strong>Hardware</strong>: A Virtual Private Server (VPS) or bare-metal server with at least 512 MB of RAM and 100 MB of disk space. Supported architectures are <em>amd64</em> or <em>arm64</em>.</p></li><li><p style="text-align: justify;"><strong>Subnet</strong>: A cloud provider that natively delegates a full IPv6 <em>/64</em> subnet to your network interface. Note that not all the VPS providers are supported: Check out the <a href="https://github.com/phanes-io/nyxproxy-oss?tab=readme-ov-file#network-requirements">NyxProxy documentation to learn more about supported VPSs</a>.</p></li><li><p style="text-align: justify;"><strong>Operating system</strong>: A modern Linux distribution, specifically Ubuntu or Debian, to ensure compatibility with the automated setup scripts and <em>sysctl</em> kernel modifications.</p></li><li><p style="text-align: justify;"><strong>Python</strong>: <a href="https://www.python.org/downloads/">Python 3.7 or higher</a> installed on your local machine to run the scraping scripts.</p></li></ul><p style="text-align: justify;">To get your server ready to run the proxy daemon, you need to verify your IPv6 setup and gain root access. Ensure you are logged into your VPS via SSH as the <em>root</em> user, or have <em>sudo</em> privileges.</p><p style="text-align: justify;">First, verify that your server has a globally routable IPv6 <em>/64</em> subnet assigned to it. You can check this by running the following command in your server&#8217;s terminal:</p><pre><code><code>ip -6 addr show | grep "scope global"</code></code></pre><p>If done correctly, you should see an output similar to the following:</p><pre><code><code>inet6 2a05:f480:1800:25db::1/64 scope global</code></code></pre><p>If you do not see a <em>/64</em> subnet, you will not be able to rotate IPs, and you must review your cloud provider&#8217;s network settings.</p><p>Next, prepare your local development environment. Suppose you call the main folder of your Python project <em>nyxproxy_scraper/</em>. At the end of this step, the folder will have the following structure:</p><pre><code><code>nyxproxy_scraper/
    &#9500;&#9472;&#9472; main.py
    &#9492;&#9472;&#9472; venv/</code></code></pre><p>Where:</p><ul><li><p><em>main.py</em> is the Python file that will store your proxy request logic.</p></li><li><p><em>venv/</em> contains the standard Python virtual environment.</p></li></ul><p>You can create the <em>venv/</em> <a href="https://docs.python.org/3/library/venv.html">virtual environment</a> directory like so:</p><pre><code><code>python -m venv venv</code></code></pre><p>To activate it, on Windows, run:</p><pre><code><code>venv\Scripts\activate</code></code></pre><p>Equivalently, on macOS and Linux, execute:</p><pre><code><code>source venv/bin/activate</code></code></pre><p>As a final prerequisite, install the <a href="https://requests.readthedocs.io/en/latest/">Requests library</a> in your activated virtual environment so your Python script can make HTTP calls:</p><pre><code><code>pip install requests</code></code></pre><p>Well done! You are now ready to test and use Nyxproxy.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3><strong>Installing and Configuring NyxProxy</strong></h3><p style="text-align: justify;">NyxProxy provides a quick setup script that handles the infrastructural heavy lifting. It auto-detects your network interface, installs <em>ndppd</em>, tweaks the Linux kernel parameters via <em>sysctl</em> to allow non-local binding, and downloads the compiled Go binary.</p><p style="text-align: justify;">You can launch it with the following single command:</p><pre><code><code>wget &lt;https://raw.githubusercontent.com/jannik-schroeder/nyxproxy-oss/main/scripts/quick-setup.sh&gt; &amp;&amp; chmod +x quick-setup.sh &amp;&amp; sudo ./quick-setup.sh</code></code></pre><p style="text-align: justify;">During the setup, you will be prompted to configure your proxy credentials and set your rotation rules. Behind the scenes, the script generates a <em>config.yaml</em> file. Let&#8217;s look at the crucial subset of that configuration:</p><pre><code><code>network:
  rotate_ipv6: true
  ipv6_subnet: "2a05:f480:1800:25db::/64"

  # The rotation mechanics:
  ipv6_pool_size: 200
  ipv6_max_usage: 100
  ipv6_max_age: 30</code></code></pre><p style="text-align: justify;">Below is an explanation of what these three parameters mean for your scraping pipeline:</p><ul><li><p style="text-align: justify;"><em>ipv6_pool_size</em>: NyxProxy keeps 200 mathematically unique IPs &#8220;hot&#8221; and bound to your network interface at any given time. This keeps proxy startup times under 100ms while maintaining IP diversity.</p></li><li><p style="text-align: justify;"><em>ipv6_max_usage</em>: After a specific IP has been utilized for 100 requests, it is considered &#8220;burned.&#8221; NyxProxy destroys the route and spins up a fresh address to dynamically replace it.</p></li><li><p style="text-align: justify;"><em>ipv6_max_age:</em> If an IP hasn&#8217;t hit 100 requests but has been alive for 30 minutes, it gets forcefully rotated out. This prevents time-based algorithmic tracking by the target WAF.</p></li></ul><p style="text-align: justify;">Once the daemon is running as a systemd service, your VPS is officially acting as a rotating proxy gateway. When NyxProxy receives a scraper request, the underlying Go binary takes over. It looks at its internal memory, picks one of the 200 rotating IPv6 addresses in its pool, and binds to that specific address to establish the outbound connection.</p><p>The expected output is as follows:</p><pre><code><code>IPv6 rotation mode: IP Pool with dynamic rotation
  Interface: enp1s0
  Subnet: 2a05:f480:1800:25db::/64
  Pool size: 200 IPs
  Rotation: Every 100 uses or 30m0s
  Initializing IP pool...
  Progress: 50/200 IPs added
  Progress: 100/200 IPs added
  Progress: 150/200 IPs added
  Progress: 200/200 IPs added
  IP pool ready with 200 addresses
  Background IP rotation started

Starting https proxy on 0.0.0.0:8080 (Protocol: IPv6)</code></code></pre><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3><strong>Testing the Proxy Logic</strong></h3><p style="text-align: justify;">At this point, NyxProxy has done its job. To verify it works correctly, you can use the following Python script that hits <em><a href="https://www.ipify.org/">api6.ipify.org</a></em>, which is an API that simply bounces back the IP address it sees:</p><pre><code><code>import requests

# Point this to your VPS IP and the credentials you set during setup
proxies = {
    'http': '&lt;http://admin:password@your-vps-ip:8080&gt;',
    'https': '&lt;http://admin:password@your-vps-ip:8080&gt;'
}

# Test 5 consecutive scraping requests
for i in range(5):
    response = requests.get('&lt;https://api6.ipify.org&gt;', proxies=proxies)
    print(f"Request {i+1}: Target sees IP -&gt; {response.text}")
</code></code></pre><p style="text-align: justify;">(NOTE: If you are already familiar with ipify.org, note that the &#8220;api6&#8221; prefix can be used for IPv6 requests only.)</p><p>The result should be similar to the following:</p><pre><code><code>Request 1: Target sees IP -&gt; 2a05:f480:1800:25db:1a2b:3c4d:5e6f:7890
Request 2: Target sees IP -&gt; 2a05:f480:1800:25db:9988:7766:5544:3322
Request 3: Target sees IP -&gt; 2a05:f480:1800:25db:aaaa:bbbb:cccc:dddd
Request 4: Target sees IP -&gt; 2a05:f480:1800:25db:1122:3344:5566:7788
Request 5: Target sees IP -&gt; 2a05:f480:1800:25db:dead:beef:cafe:babe</code></code></pre><p style="text-align: justify;">This shows that every single HTTP request utilizes a completely different, globally routable IPv6 address generated from your subnet block. To the target server, these look like entirely distinct users connecting from across the internet.</p><p style="text-align: justify;">Perfect! You have successfully built a self-healing, infinitely rotating proxy pool without handing over your budget for metered residential bandwidth.</p><h2>The Illusion of Infinity: Critical Limitations of IPv6 Subnet Rotation</h2><p style="text-align: justify;">At this point, you may think you have found a solution to all of your budgeting problems for scraping at scale. But before you tear down your commercial proxy infrastructure, you must understand that a $5/Mo VPS and an open-source rotation daemon are not a universal silver bullet. If it were that simple, the commercial proxy industry would not exist.</p><p>This architecture has the following main limitation:</p><ul><li><p style="text-align: justify;"><strong>The IPv4 compatibility wall:</strong> This entire architecture is built on one absolute prerequisite: Your target endpoint must support IPv6. If you are scraping legacy enterprise systems or platforms that haven&#8217;t migrated to dual-stack networking, this setup is useless. You cannot route an IPv6 packet to an IPv4-only server.</p></li><li><p style="text-align: justify;"><strong>Subnet-level bans (</strong><em><strong>/64</strong></em><strong> prefix blocking):</strong> Enterprise WAFs are fully aware of IPv6 prefix delegation standards. They know that hosting providers allocate a <em>/64</em> subnet to a single client. If their heuristics detect highly concurrent behavioral patterns (like missing <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">browser fingerprints</a> or anomalous TLS handshakes) originating from <em>2a05:f480...:1a2b</em>, they will ban the entire <em>/64</em> CIDR block. Once your <em>/64</em> prefix is banned, all 18 quintillion of your &#8220;infinite&#8221; IPs are simultaneously dead. To recover, you must physically destroy the VPS and provision a new one in a different IP range.</p></li><li><p style="text-align: justify;"><strong>ASN reputation:</strong> No matter how many IPs you rotate, your traffic still originates from a Datacenter Autonomous System Number (ASN). Target firewalls assign a baseline trust score to every ASN. Traffic originating from a Datacenter ASN always starts with a highly degraded trust score compared to a Residential ASN. For highly restrictive targets, any request from a datacenter IP is instantly met with an unpassable CAPTCHA or a hard <em>403 Forbidden</em>, regardless of whether it&#8217;s IPv4 or IPv6.</p></li><li><p style="text-align: justify;"><em>nf_conntrack</em><strong> and hardware exhaustion:</strong> You cannot push enterprise-grade throughput on a $5, 1-vCPU server without consequence. Rotating thousands of IPv6 addresses requires the Linux kernel to aggressively maintain the <em><a href="https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.txt">nf_conntrack</a></em> table and the NDP proxy table. At high concurrencies, the overhead of establishing, tracking, and tearing down thousands of TCP sockets across rotating interfaces will exhaust the memory or CPU of a low-tier VPS. The kernel will begin dropping packets natively, your latency will spike to useless levels, and your scrapers will be greeted with errors.</p></li></ul><h2>Conclusion</h2><p style="text-align: justify;">In this article, you learned how to leverage your hosting provider&#8217;s IPv6 <em>/64</em> subnets to build an infinitely rotating proxy pool with NyxProxy, escaping the metered billing of residential proxy networks.</p><p style="text-align: justify;">The competitive advantage of engineering your own proxy infrastructure is in your unit economics and architectural control. However, you also learned that this solution is not a universal silver bullet for every scraping scenario: It comes with trade-offs and constraints.</p><p style="text-align: justify;">So, let us know: Have you already experimented with bare-metal IPv6 rotation for your scraping pipelines? What targets did it work best for? Let&#8217;s discuss in the comments!</p>]]></content:encoded></item><item><title><![CDATA[The Trick to Scrape Next.js Websites in Seconds]]></title><description><![CDATA[Scraping data from the most widely used full-stack framework in the world with just 3 lines of code!]]></description><link>https://substack.thewebscraping.club/p/scrape-nextjs-websites</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/scrape-nextjs-websites</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 19 Apr 2026 19:18:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/17ee7337-9a3d-445a-a255-2895a6ed8235_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Next.js is one of the most widely adopted full-stack JavaScript frameworks on the planet. If you&#8217;ve ever built or deployed a web app, you definitely know it&#8212;or at least you&#8217;ve heard of it.</p><p>Behind the scenes, it relies on hydration to make server-rendered pages interactive. And here&#8217;s the interesting part: the same mechanism that makes Next.js fast and popular also exposes a significant amount of structured data in the HTML sent by the server. From a scraping perspective, that&#8217;s a huge opportunity!</p><p>In this post, I&#8217;ll show you a simple trick to scrape data from virtually any Next.js website. Follow along as I break down how it works and how you can apply it yourself.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Next.js in Numbers</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1F7B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1F7B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 424w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 848w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1272w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Next.js&#8217; GitHub star growth&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Next.js&#8217; GitHub star growth" title="Next.js&#8217; GitHub star growth" srcset="https://substackcdn.com/image/fetch/$s_!1F7B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 424w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 848w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1272w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Next.js&#8217; GitHub star growth</figcaption></figure></div><p>Next.js needs no introduction, but it&#8217;s worth giving some context to truly understand how popular it is (<em>and therefore how useful the trick I&#8217;m about to present for Next.js web scraping can be</em>):</p><ul><li><p>According to the <a href="https://survey.stackoverflow.co/2025/">2025 Stack Overflow Developer Survey</a>, 20.8% of respondents used Next.js extensively over the past year.</p></li><li><p>Next.js is the 14th largest repository on GitHub, with <a href="https://github.com/vercel/next.js">over 138k stars</a> (and still growing!).</p></li><li><p><a href="https://w3techs.com/technologies/overview/javascript_library">According to W3Techs</a>, Next.js has a 2.9% market share among JavaScript libraries.</p></li><li><p>Major brands such as <a href="https://nextjs.org/showcase">Nike, Stripe, and Notion have chosen this full-stack framework</a> to build their official websites.</p></li></ul><h2>Before Getting Started: A Bit of Context on Hydration</h2><p>I know you probably just want the trick&#8230; Still, let me take a minute to explain why it works in the first place, why it&#8217;s even possible, and what kind of data you&#8217;ll actually retrieve with it!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h3>What Is Hydration?</h3><p><a href="https://en.wikipedia.org/wiki/Hydration_(web_development)">Hydration</a> is the process that makes a server-rendered page interactive in the browser.</p><p>Frameworks like Next.js, Remix, Nuxt, and SvelteKit employ this mechanism to combine the performance benefits of <a href="https://nextjs.org/docs/pages/building-your-application/rendering/server-side-rendering">server-side rendering (SSR)</a> with the interactivity of client-side applications.</p><p>The idea is that the server first sends fully rendered static HTML to the browser. Then, hydration happens next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jt2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 424w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 848w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png" width="1227" height="633" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:633,&quot;width&quot;:1227,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)" title="The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)" srcset="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 424w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 848w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)</figcaption></figure></div><p>The browser downloads the JavaScript bundle, and the frontend framework reconstructs the component tree in memory, attaches event listeners, and links that virtual tree to the existing DOM instead of re-rendering it from scratch. The result is a fully interactive application built on top of server-rendered HTML.</p><h3>How Does the Hydration Mechanism Work?</h3><p>It&#8217;s now clear that in Next.js and similar frameworks, hydration is the process where a static, server-rendered HTML page &#8220;comes to life&#8221; and becomes fully interactive in the browser. But what&#8217;s actually happening under the hood?</p><p>At a high level, hydration is a 3-step process:</p><ol><li><p>The server generates and sends a fully rendered HTML snapshot. The user immediately sees the content (great for <a href="https://web.dev/articles/fcp">First Contentful Paint</a>). At this point, though, the page is just static HTML. Buttons, forms, and other interactive elements are visible, but they don&#8217;t work yet because no JavaScript is attached.</p></li><li><p>The client&#8217;s browser downloads the JavaScript bundle (which includes React and your frontend application code) and executes it.</p></li><li><p>React rebuilds the component tree in memory and attaches event listeners to the existing DOM nodes. Instead of discarding the HTML and re-rendering everything from scratch, React &#8220;hydrates&#8221; the existing markup, meaning it reuses it and wires it up with state and interactivity.</p></li></ol><p>Once hydration completes, the page behaves like a normal single-page application: it responds to clicks, manages state, and updates dynamically.</p><p>And here&#8217;s an important detail: if the browser doesn&#8217;t support JavaScript (or it fails to load), the user still sees the server-rendered HTML. It won&#8217;t be interactive, but the core content is there. That&#8217;s great for SEO and perceived performance!</p><h3>Why It Matters for Scraping Next.js (and Other Full-Stack Frameworks&#8230;)</h3><p>The key insight you need to understand is simple: <strong>hydration requires data</strong>, and that data must be embedded somewhere in the HTML sent by the server!</p><p>In Next.js, when the server renders a page, it doesn&#8217;t only send markup. It also serializes the data required to rebuild the React component tree on the client. That serialized payload is embedded directly into the page&#8217;s HTML.</p><p>That&#8217;s exactly why hydration matters for scraping. Instead of parsing the DOM or simulating user interactions through <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation</a>, you can extract the structured data that React itself uses to hydrate the page.</p><p>In many cases, hydration data is cleaner and easier to parse than the rendered HTML. It can also contain more information than what&#8217;s visibly displayed on the page, including hidden and interesting metadata.</p><p>Keep in mind that this principle applies not only to Next.js! All other full-stack frameworks that rely on hydration, such as Remix, Nuxt, Angular Universal, and SvelteKit, tend to dehydrate state on the server and rehydrate it on the client.</p><p>So remember this simple rule. If a framework hydrates, it must serialize data. And if it serializes data into the HTML, you can scrape it.</p><h2>How to Scrape Next.js Websites: 2 Approaches</h2><p>The approach to scraping Next.js by targeting hydration data depends on how that data is embedded in the HTML generated on the server side.</p><p>I won&#8217;t go too deep into framework internals here (if you&#8217;re a Next.js dev, you already know things shift depending on whether you&#8217;re using the<a href="https://nextjs.org/docs/app/getting-started"> </a><em><a href="https://nextjs.org/docs/app/getting-started">App Router</a></em> or the<a href="https://nextjs.org/docs/pages/getting-started"> </a><em><a href="https://nextjs.org/docs/pages/getting-started">Pages Router</a></em>), but there are essentially two scenarios you&#8217;ll run into.</p><p>In this section, I&#8217;ll walk through both of them and show you exactly how I retrieve data from each!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Approach #1: Target the __NEXT_DATA__ Script</h3><p>As a target site, I&#8217;ll use a <a href="https://www.nike.com/t/air-jordan-5-retro-wolf-grey-mens-shoes-0M9kM1yX/DD0587-002">Nike product page</a> as a reference:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uJcE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uJcE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 424w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 848w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target Nike page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target Nike page" title="The target Nike page" srcset="https://substackcdn.com/image/fetch/$s_!uJcE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 424w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 848w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target Nike page</figcaption></figure></div><p>That&#8217;s actually a great example because Nike.com is even showcased on the Next.js homepage as a real-world site built with the framework.</p><p>Now, right-click on the page and select the &#8220;Inspect&#8221; option in your browser to open the DevTools. Scroll through the DOM and get familiar with the page structure. If the Next.js site is using the <em>Pages Router</em>, you&#8217;ll notice a <em>&lt;script&gt;</em> tag with the id <em>__NEXT_DATA__</em> containing a large JSON blob:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rV1e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rV1e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the JSON data inside the #__NEXT_DATA__ element&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the JSON data inside the #__NEXT_DATA__ element" title="Note the JSON data inside the #__NEXT_DATA__ element" srcset="https://substackcdn.com/image/fetch/$s_!rV1e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the JSON data inside the #__NEXT_DATA__ element</figcaption></figure></div><p>That JSON data is precisely the hydration data I was referring to earlier.</p><p>When a site uses the Pages Router approach in Next.js, the server embeds all the page data directly into that <em>&lt;script&gt;</em> tag. From a scraping perspective, that&#8217;s gold, as the data is already structured and ready to be captured.</p><p>Below&#8217;s a simple JavaScript snippet to extract it:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">const hydartionScript = document.querySelector("#__NEXT_DATA__")
const hydrationData = JSON.parse(hydartionScript.innerHTML)
console.log(hydrationData)</code></pre></div><p>What&#8217;s happening here is straightforward. The JS script:</p><ul><li><p>Selects the <em>&lt;script&gt;</em> element with <em>id</em> <em>__NEXT_DATA__</em>.</p></li><li><p>Reads its inner HTML (which is a JSON string).</p></li><li><p>Parses it into a JavaScript object.</p></li><li><p>Logs it to the console.</p></li></ul><p>Run this directly in the DevTools Console, and you&#8217;ll immediately see the result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2AK7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2AK7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dc28944-8842-4605-be59-b746fef469db_1746x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2AK7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the structured hydration data</figcaption></figure></div><p>What&#8217;s interesting is how much structured data you get right away. This includes product details, images, metadata, and more. All is neatly organized, and it only took three lines of code!</p><p>If you want to store the JSON hydration object, just right-click the object in the Console and select the &#8220;Copy object&#8221; option:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m1uv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m1uv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 424w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 848w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1272w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png" width="1456" height="483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Selecting the &#8220;Copy object&#8221; option&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Selecting the &#8220;Copy object&#8221; option" title="Selecting the &#8220;Copy object&#8221; option" srcset="https://substackcdn.com/image/fetch/$s_!m1uv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 424w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 848w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1272w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Selecting the &#8220;Copy object&#8221; option</figcaption></figure></div><p>From there, you can paste it wherever you need (e.g., into a local <em>.json</em> file, a MongoDB collection, etc.).</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Approach #2: Target the self.__next_f.push() Elements</h3><p>Another, more complex approach to scraping Next.js involves pages built with the <em>App Router</em>.</p><p>Even if the <em>App Router</em> has been the recommended direction for a while, in my experience, it&#8217;s still not as widely adopted as the <em>Pages Router</em>. And honestly, that&#8217;s a bit of a gift for us (as scraping hydration data in <em>App Router</em> sites is definitely more complex!)</p><p>As a reference, let&#8217;s look at the &#8220;<a href="https://openai.com/business/">Business Overview</a>&#8221; page on the OpenAI website, which is built with Next.js <em>App Router</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CAEI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CAEI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 424w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 848w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1272w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png" width="1456" height="709" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:709,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target page" title="The target page" srcset="https://substackcdn.com/image/fetch/$s_!CAEI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 424w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 848w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1272w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target page</figcaption></figure></div><p>Just like before, open DevTools and inspect the page. This time, focus on the <em>&lt;script&gt;</em> tags inside the <em>&lt;body&gt;</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LTkB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LTkB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png" width="1456" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the hydration script elements&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the hydration script elements" title="Note the hydration script elements" srcset="https://substackcdn.com/image/fetch/$s_!LTkB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the hydration script elements</figcaption></figure></div><p>You&#8217;ll notice several scripts containing content like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">self.__next_f.push(&lt;some_data&gt;)</code></pre></div><p>That &#8220;<em>&lt;some_data&gt;</em>&#8221; is serialized using the <a href="https://tonyalicea.dev/blog/understanding-react-server-components/">React Flight protocol for React Server Components (RSC)</a>. I won&#8217;t go too deep into the internals here (it&#8217;s a dense topic!), but what matters is that <strong>deserializing that data is </strong><em><strong>not</strong></em><strong> straightforward!</strong></p><p>React Flight isn&#8217;t plain JSON. It mixes control records (<em>HL</em>, <em>I</em>, <em>J</em>, etc.), module references, streaming boundaries, and serialized model fragments into a transport format that React incrementally resolves at runtime.</p><p>You might think: &#8220;Why not just reuse the frontend deserialization library?&#8221; In practice, that doesn&#8217;t work well because:</p><ul><li><p>The client decoder (<em><a href="https://www.npmjs.com/package/react-server-dom-webpack">react-server-dom-webpack</a></em>) expects a full React runtime.</p></li><li><p>It relies on module maps and webpack IDs generated at build time.</p></li><li><p>It resolves component references against the exact bundle that produced the stream.</p></li><li><p>It assumes streaming semantics and internal React wiring.</p></li></ul><p>Basically, outside that exact environment, you don&#8217;t have the module graph, build manifest, or hydration context. So even if you import the decoder, you can&#8217;t reconstruct the component tree the way the browser does.</p><p>There have been recent security issues in the React Flight payload deserialization system, highlighting just how sensitive and complex this layer is. For more details, refer to:</p><ul><li><p><em><a href="https://nextjs.org/blog/CVE-2025-66478">Security Advisory: CVE-2025-66478</a></em></p></li><li><p><em><a href="https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components">Critical Security Vulnerability in React Server Components</a></em></p></li></ul><p>Thus, instead of fighting the protocol, I&#8217;d simplify and accept that in this case, it&#8217;s better to extract the unparsed React Flight string data. Achieve that with the JS script below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">const nextFlightScripts = [...document.querySelectorAll("script")]
  .filter(script =&gt; script.textContent.includes("self.__next_f"))
  .map(script =&gt; script.textContent.trim())
console.log(nextFlightScripts)</code></pre></div><p>This selects all <em>&lt;script&gt;</em> elements containing &#8220;self.__next_f&#8221; and builds an array of their raw contents.</p><p>Run it in the Console, and you&#8217;ll get an array of React Flight chunks:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LBAG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LBAG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png" width="1456" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the React Flight strings&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the React Flight strings" title="Note the React Flight strings" srcset="https://substackcdn.com/image/fetch/$s_!LBAG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the React Flight strings</figcaption></figure></div><p>From there, the simplest way to extract structured data is often to copy the array, feed it to an AI, and ask it to reconstruct a parsed JSON representation of the meaningful payload sections:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!08ee!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!08ee!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 424w, https://substackcdn.com/image/fetch/$s_!08ee!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 848w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1272w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png" width="1456" height="973" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:973,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the parsed version of the source data produced by Gemini&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the parsed version of the source data produced by Gemini" title="Note the parsed version of the source data produced by Gemini" srcset="https://substackcdn.com/image/fetch/$s_!08ee!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 424w, https://substackcdn.com/image/fetch/$s_!08ee!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 848w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1272w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the parsed version of the source data produced by Gemini</figcaption></figure></div><p>Is this more complicated than the <em>__NEXT_DATA__</em> trick? Absolutely! Yet, it&#8217;s still a powerful way to access a large amount of page data with just a few lines of code.</p><h2>Final Script to Quickly Access Data From Next.js Sites</h2><p>If you combine the two approaches, you can build a production-ready script for brute-force hydration data scraping in Next.js:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// Pages Router approach (__NEXT_DATA__)
const hydrationScript = document.querySelector("#__NEXT_DATA__")
let nextData = null
if (hydrationScript) {
  try {
    nextData = JSON.parse(hydrationScript.textContent)
    console.log("__NEXT_DATA__ found:")
    console.log(nextData)
  } catch (err) {
    console.warn("Failed to parse __NEXT_DATA__:", err)
  }
} else {
  console.log("No __NEXT_DATA__ script found.")
}

// App Router approach (self.__next_f)
const nextFlightScripts = [...document.querySelectorAll("script")]
  .map(script =&gt; script.textContent.trim())
  .filter(content =&gt; content.includes("self.__next_f.push"))

if (nextFlightScripts.length &gt; 0) {
  console.log("React Flight scripts found:")
  console.log(nextFlightScripts)
} else {
  console.log("No React Flight scripts found.")
}</code></pre></div><p>To test it, just open the Console in DevTools, paste the script, and run it.</p><p><strong>Important</strong>: The <em>&lt;script&gt;</em> components containing hydration data aren&#8217;t loaded dynamically via client-side rendering. They&#8217;re embedded directly in the HTML generated by the server.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Km-I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Km-I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Km-I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the #__NEXT_DATA__ element in the page source</figcaption></figure></div><p>That means you can:</p><ol><li><p>Fetch the target Next.js-powered page with an HTTP client.</p></li><li><p>Parse the HTML using an HTML parsing library like Beautiful Soup or Cheerio.</p></li><li><p>Apply a similar version of the JavaScript script above, but adapt it to the API provided by your HTML parser.</p></li></ol><p>In other words, this trick for scraping Next.js doesn&#8217;t only work in the browser DevTools. It also works perfectly in regular scraping scripts!</p><h2>Pros and Cons of This Approach to Next.js Scraping</h2><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Simple and effective, requiring only a few lines of code.</p></li><li><p>Works on all Next.js websites (and, more generally, on most sites that rely on hydration).</p></li><li><p>Can let you access more data than what is actually displayed on the page.</p></li><li><p>No need for browser automation, waiting for client-side rendering, or simulating user interactions.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>You may only get partial data, meaning you might still need to complement it with a more traditional scraping approach.</p></li><li><p>React Flight data is difficult to parse and may require custom logic or even <a href="https://substack.thewebscraping.club/p/llms-ai-web-scraping">AI-assisted parsing</a>.</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve shared <a href="https://brightdata.com/blog/how-tos/web-scraping-with-next-js">a trick I personally documented years ago</a>, and that still works to this day. It allows you to quickly scrape data from virtually any Next.js site by targeting the hydration data embedded in the HTML document generated by the server and sent to the client for rendering.</p><p>As you&#8217;ve seen, with just a few lines of JavaScript, you can extract hydration data from any Next.js-powered page. What you get back is clean, or at least almost clean, data that you can process directly in your data pipelines.</p><p>Instead of fighting the frontend, this Next.js web scraping approach helps you leverage the data the framework itself needs to function!</p><p>I hope you found this useful and insightful. If you have questions or thoughts, feel free to share them in the comments below!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #102: How Fast Can You Call Polymarket's APIs?]]></title><description><![CDATA[Three languages, four locations, 1,000 requests. The biggest speed gain has nothing to do with code.]]></description><link>https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 16 Apr 2026 14:08:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/dd002f1e-6fe6-4cde-8c7d-1fdaa94d11d3_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is a platform name that has been bouncing in the news for over a year now. A new military action? Someone predicted it on Polymarket. An event that moves the price of oil or shakes a currency? Someone else, or maybe the same person, placed a bet a few hours before and walked away with a pile of money. Every time a headline breaks, Polymarket seems to have already priced it in, or worse, someone appears to have known in advance. <br>Even here on Substack, you can share the predictions coming from the platform.<br></p><div class="polymarket-embed" data-attrs="{&quot;eventSlug&quot;:&quot;claude-5-released-by&quot;,&quot;marketSlug&quot;:&quot;&quot;,&quot;profileName&quot;:&quot;&quot;,&quot;belowTheFold&quot;:false,&quot;fullEmbedUrl&quot;:&quot;https://substack.com/embed/polymarket/claude-5-released-by&quot;,&quot;isGraphMode&quot;:false}" data-component-name="PolymarketToDOM"></div><p><br>But what is Polymarket, and how does it work?<br></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What Polymarket is, and why it keeps making headlines</h2><p>Polymarket is a prediction market, a platform where you buy and sell shares tied to the outcome of real-world events. If the event happens, your share pays $1. If it doesn&#8217;t, it pays $0. The trading price at any moment reflects what the market collectively believes the probability of that outcome is. You can bet on elections, geopolitics, sports, crypto prices, and increasingly anything else with a verifiable resolution. </p><p>It is the largest prediction market by volume, built on the Polygon blockchain. Its main competitor, Kalshi, operates as a CFTC-regulated exchange in the US. Both are attracting billions in volume, and Wall Street firms are now building dedicated trading desks around them.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><p>The platform handled <a href="https://www.ccn.com/news/crypto/polymarket-7-5-billion-2025-prediction-markets/">at least $7.5 billion in volume during 2025</a> (a conservative figure, since <a href="https://www.paradigm.xyz/2025/12/polymarket-volume-is-being-double-counted">Polymarket volume is commonly double-counted</a> due to how OrderFilled events are summed, <a href="https://www.trmlabs.com/resources/blog/how-prediction-markets-scaled-to-usd-21b-in-monthly-volume-in-2026">and set a single-day record of $425 million in February 2026</a> when Iran-related markets resolved simultaneously. Those are not toy numbers. And with that kind of money flowing through, the headlines have followed.</p><p>In January 2026, a newly created Polymarket account invested $30,000 <a href="https://www.npr.org/2026/01/05/nx-s1-5667232/polymarket-maduro-bet-insider-trading">and walked away with $436,759</a> after correctly betting on Maduro&#8217;s removal from power. The account was created less than a week before the U.S. military operation, and the bulk of bids were placed hours before Trump&#8217;s announcement. In a separate case, <a href="https://www.haaretz.com/israel-news/israel-security/2026-03-28/ty-article/.premium/court-clears-air-force-officer-charged-with-leaking-iran-strike-for-online-bets/0000019d-2f2e-d868-a1bd-7fef78860000">an Israeli Air Force reservist was indicted for leaking classified detail</a>s about a strike on Iran to guide Polymarket bets, netting roughly $244,000. <a href="https://www.cnn.com/2026/03/24/politics/iran-war-bets-prediction-markets">A different trader has made nearly $1 million since 2024</a> from dozens of well-timed bets correctly predicting U.S. and Israeli military actions against Iran, winning 93% of five-figure wagers. <a href="https://www.cnbc.com/2026/04/15/kalshi-and-polymarket-congress-regulation-washington-influence.html">These incidents triggered at least eight prediction market bills in Congress</a> since January 2026, and federal prosecutors in Manhattan are <a href="https://www.cnn.com/2026/03/30/politics/prediction-markets-justice-department">actively exploring whether certain prediction market bets violate insider trading laws</a>.</p><p>But insider trading is not the only way people make money on Polymarket. There is a quieter, more interesting story happening in parallel.</p><p></p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p><br></p><h2>The efficiency gap</h2><p><a href="https://www.princeton.edu/~ceps/workingpapers/91malkiel.pdf">The Efficient Market Hypothesis</a>, formalized by Eugene Fama in the 1960s, states that asset prices reflect all available information, making it impossible to consistently beat the market. In traditional equity markets, this largely holds because massive institutional capital from hedge funds, pension funds, and proprietary trading firms constantly hunts for and eliminates mispricings. The S&amp;P 500 trades roughly $500 billion daily. Any pricing error gets corrected in milliseconds by algorithms running in colocated data centers.</p><p>Polymarket&#8217;s individual markets often have only tens of thousands of dollars in liquidity. The ratio of &#8220;smart money&#8221; to &#8220;total market cap&#8221; is fundamentally different from equity markets, and that is why edges persist longer than they would on Wall Street. <a href="https://arxiv.org/abs/2508.03474">A 2025 study by IMDEA Networks Institute</a> documented $40 million in arbitrage profits extracted from Polymarket alone between April 2024 and April 2025, analyzing 86 million bets. <a href="https://www.financemagnates.com/trending/prediction-markets-are-turning-into-a-bot-playground/)">Arbitrage opportunities on the platform last an average of 4 seconds, with 73% of profits captured by bots</a>, executing in under 100 milliseconds.</p><p>The institutional side is catching up. <a href="https://www.financemagnates.com/fintech/wall-street-quants-move-into-prediction-markets-to-hunt-for-arbitrage-not-to-bet/">DRW is hiring dedicated prediction market traders</a> at a $200,000 base salary. Susquehanna International Group became the first official market maker on Kalshi (a competing platform). Jump Trading is building specialized desks. But the market is not there yet. Liquidity is too thin for these firms to deploy serious capital without moving prices, leaving room for smaller, faster actors.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Speed as edge: what people are building</h2><p>I&#8217;ve been studying Polymarket for some months now, and I&#8217;ve probably ended up on a bubble on Instagram and other social media. I&#8217;m seeing a growing number of traders share systems designed to exploit exactly this kind of market inefficiency. The approaches vary, but the pattern is the same: faster information, faster execution, profit.</p><p>One notable case involves a trader who claims to use computer vision models processing live football match video feeds. His system watches the match in real time, detects events (goals, red cards, penalties) through frame analysis, and places bets on prediction markets seconds before the event registers on official data feeds and bookmaker odds adjust. He claims an 8-second advantage over other traders (unfortunately, I cannot find the post on Instagram about it anymore). Whether that specific claim holds up or not, this is nothing new: courtsiders have been doing this in tennis for years, <a href="https://fivethirtyeight.com/features/inside-the-shadowy-world-of-high-speed-tennis-betting">attending live matches and transmitting scores</a> faster than official data feeds reach bookmakers. In 2016, tennis umpires from Kazakhstan, Turkey, and Ukraine were banned for deliberately delaying score updates for courtside accomplices.</p><p>The same principle applies at a larger scale. <a href="https://www.bloomberg.com/news/features/2018-05-03/the-gambler-who-cracked-the-horse-racing-code">Bill Benter built a multinomial logit model</a> with over 120 variables per horse for Hong Kong racing and extracted over $1 billion between 1987 and 2001. <a href="https://www.racingpost.com/news/britain/high-court-case-alleges-tony-blooms-betting-empire-makes-600m-a-year-so-what-do-we-know-about-his-starlizard-syndicate-aNlkE7t8daxQ/">Tony Bloom&#8217;s Starlizard syndicate employs 160 people </a>to model Asian handicap football markets and reportedly generates 600 million GBP per year. <a href="https://www.financemagnates.com/trending/prediction-markets-are-turning-into-a-bot-playground/">On Polymarket itself, 14 of the top 20 most profitable wallets are bots</a>. <a href="https://www.coindesk.com/markets/2026/02/21/how-ai-is-helping-retail-traders-exploit-prediction-market-glitches-to-make-easy-money">One bot turned $313 into $414,000</a> in a single month, exploiting temporal arbitrage in 15-minute crypto markets.</p><p>All of these systems share two requirements: data and speed. They need real-time access to market prices, order books, and event outcomes, and they need to act on that data faster than everyone else. All of this is possible because Polymarket provides a full set of APIs that can be used to operate programmatically on the platform.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Polymarket&#8217;s API architecture</h2><p>Polymarket exposes <a href="https://docs.polymarket.com/api-reference/introduction">three distinct APIs</a>, each serving a different purpose. Understanding which one to use and when is the first step toward building anything that trades or monitors this market.</p><h3>Gamma API: market discovery</h3><p>The Gamma API is the browsing layer. It returns human-readable market data: questions, descriptions, outcome prices, volume, liquidity, event metadata. No authentication required.</p><p><strong>Base URL</strong>: https://gamma-api.polymarket.com</p><p>Key endpoints:</p><p>- <code>GET /markets</code> returns a paginated list of markets with filtering options (limit, offset, closed, tag_id)</p><p>- <code>GET /markets/{id} </code>returns a single market by ID or slug</p><p>- <code>GET /events</code> and <code>GET /events/{id}</code> return event-level data (events group related markets)</p><p>- <code>GET /search?query=... </code>performs keyword search across markets and events</p><p>A single call to /markets?limit=1&amp;closed=false returns something like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7e25266f-af9b-4b4c-b40f-660d9c8e031f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{
  "id": "540816",
  "question": "Russia-Ukraine Ceasefire before GTA VI?",
  "conditionId": "0x9c1a953fe92c8357f1b646ba25d983aa83e90c525992db14fb726fa895cb5763",
  "outcomes": "[\"Yes\", \"No\"]",
  "outcomePrices": "[\"0.545\", \"0.455\"]",
  "volume": "1516211.89",
  "liquidity": "62104.61",
  "clobTokenIds": "[\"850149715...\", \"252731249...\"]"
}</code></pre></div><p>The `clobTokenIds` field is the bridge to the trading layer. Each outcome (Yes/No) gets its own token ID, which is what you pass to the CLOB API to get real-time prices and order book data.</p><p>The Gamma API is rate-limited at roughly 60 requests per minute. It is useful for discovery and metadata, not for real-time price monitoring.</p><h3>CLOB API: the order book</h3><p>The CLOB (Central Limit Order Book) API is where trading happens. It has both public and authenticated endpoints.</p><p><strong>Base URL</strong>: https://clob.polymarket.com</p><p><strong>Public endpoints (no authentication):</strong></p><p>- <code>GET /price?token_id=X&amp;side=BUY|SELL</code> returns the current best price</p><p>- <code>GET /midpoint?token_id=X</code> returns the midpoint between best bid and ask</p><p>- <code>GET /spread?token_id=X</code> returns the current spread</p><p>- <code>GET /book?token_id=X</code> returns the full order book with all bids and asks</p><p>- <code>GET /last-trade-price?token_id=X</code> returns the last executed trade price</p><p>- <code>GET /tick-size?token_id=X</code> returns the minimum price increment</p><p>A call to /midpoint returns a minimal payload:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;5b93dac3-b924-4bde-9cca-74db20b575d9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{"mid": "0.545"}</code></pre></div><p>The <code>/book</code> endpoint returns the full depth:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7ba6f05f-de54-4ad7-ba8b-50d1d9f4eddc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{
  "market": "0x9c1a95...",
  "asset_id": "850149715...",
  "bids": [
    {"price": "0.54", "size": "15234.50"},
    {"price": "0.53", "size": "8920.00"}
  ],
  "asks": [
    {"price": "0.55", "size": "12100.00"},
    {"price": "0.56", "size": "6500.00"}
  ]
}</code></pre></div><p>These public endpoints are what matters for price monitoring. They are lightweight, return small payloads, and have no authentication overhead.</p><p><strong>Authenticated endpoints</strong> require a <a href="https://docs.polymarket.com/developers/CLOB/authentication">two-level authentication system</a>:</p><p><strong>Level 1 (L1)</strong> uses EIP-712 wallet signatures. You sign a structured message proving you control a specific Ethereum wallet address. This is a one-time operation that generates API credentials:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;a46d1352-1fcd-4eb2-98dd-5baea0815327&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">POST /auth/api-key
Headers: POLY_ADDRESS, POLY_SIGNATURE, POLY_TIMESTAMP, POLY_NONCE
Returns: { apiKey, secret, passphrase }</code></pre></div><p><code>Level 2 (L2) </code>uses HMAC-SHA256 signing on every request. Every authenticated call requires five headers: <code>POLY_ADDRESS</code>, <code>POLY_SIGNATURE</code> (computed HMAC of the request), <code>POLY_TIMESTAMP</code>, <code>POLY_API_KEY</code>, and <code>POLY_PASSPHRASE</code>. </p><p>Even with L2 auth, placing an order requires the user to sign the order payload locally with their private key. Three cryptographic operations total: key derivation (once), request signing (per call), order signing (per order).</p><p>The authenticated endpoints are:</p><p>- <code>POST /order</code> places a single order</p><p>- <code>POST /orders</code> places a batch of orders</p><p>- <code>DELETE /order</code> cancels an order</p><p><strong>WebSocket feeds</strong> provide real-time streaming at <code>wss://ws-subscriptions-clob.polymarket.com/ws/ </code>for order book updates, price changes, and user-specific events.</p><h3>Data API: analytics</h3><p>The Data API at <code>https://data-api.polymarket.com</code> provides analytics-oriented data: user positions, trade history, leaderboards, and holder information. It is less documented and less stable than the other two. Some endpoints returned 404 or empty responses during our testing. Useful for research, not reliable for production.</p><h2>The speed game: calling the APIs as fast as possible</h2><p>If arbitrage opportunities on Polymarket last 4 seconds on average, and 73% of profits go to bots executing in under 100 milliseconds, then the speed at which you can read prices and place orders is a direct competitive advantage. We set up a benchmark to answer two questions: where should you run your code, and which language and HTTP strategy gets you there fastest?</p><p>We did our tests and chose the <code>/midpoint </code>endpoint for the benchmark because it requires no authentication, returns the smallest possible payload, and isolates HTTP client performance from payload parsing. Each benchmark runs 1,000 requests in two modes: sequential (one request at a time, measuring per-request latency) and concurrent (50 simultaneous workers, measuring throughput).<br><br>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">102.POLYMARKET</a>.</strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Stealth Stack: A Guide to Preventing Data Leaks in Web Scraping Infrastructure]]></title><description><![CDATA[A four-layer defense strategy for making your web scraping infrastructure indistinguishable from real users]]></description><link>https://substack.thewebscraping.club/p/the-stealth-stack-web-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-stealth-stack-web-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 12 Apr 2026 03:00:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ef273b12-ade2-4ba6-a14a-701876041775_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When hearing about &#8220;data leaks&#8221;, I&#8217;m sure you think about cybersecurity, databases, and personal information lost due to malicious intent. But what if I tell you your web scraper is leaking data? But in the specific context of web scraping, no one is stealing your data. Rather, this means that your scraper is revealing its automated nature through a set of signals. </p><p>In particular, your scrapers leak information at four distinct layer levels. Modern anti-bot systems, in fact, fingerprint your browser, analyze your TLS handshake, trace your network infrastructure, and track your behavioral patterns. And a single inconsistency across these layers triggers permanent blocking.</p><p>This means your scrapers aren&#8217;t competing only against rate limits anymore. Today, they are competing against <a href="https://substack.thewebscraping.club/p/machine-learning-for-detecting-bots">machine learning models trained on billions of legitimate requests</a>, and any deviation from the expected pattern is a signal. So, if you want to scrape at scale, your infrastructure must be indistinguishable from a real user&#8217;s browser, network stack, and behavior.</p><p>This article guides you through a systematic approach: First, understanding where leaks occur, then learning how anti-bot systems detect them, and finally building a layered defense that makes your scraper invisible.</p><p></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>Identifying the Leaks: Where Your Scraper Exposes Itself</strong></h2><p>Before fixing anything, you need to understand the complete attack surface. Modern anti-bot systems analyze your scraper at four distinct layers, and a leak at any layer can expose you.</p><h3><strong>Layer 1: The Browser Level</strong></h3><p>Headless browsers are loud by default. Launch a <a href="https://pptr.dev/">Puppeteer</a> instance and check the  <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">navigator.webdriver</a> </em>flag. It surely returns <em>true</em>, and that&#8217;s a signal every major anti-bot system checks in the first 100ms of page load.</p><p>But this obvious flag is just the beginning. Anti-bot systems probe deeper:</p><ul><li><p><strong>Error messages and stack traces</strong>: They differ between headless and headed modes. The execution context leaves fingerprints in error objects.</p></li><li><p><strong>Window dimensions</strong>: Properties like <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/outerWidth#:~:text=outerWidth%20read%2Donly%20property%20returns,and%20window%20resizing%20borders%2Fhandles.">window.outerWidth</a></em> and <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/outerHeight">window.outerHeight</a></em> reveal a headless operation because headless mode doesn&#8217;t render a visible window frame.</p></li><li><p><strong>Canvas rendering</strong>: They can produce pixel-level differences. Software rendering (headless) creates different anti-aliasing and color values than GPU-accelerated rendering (headed). Color channels can differ by 1-2 units per pixel.</p></li><li><p><strong><a href="https://developer.mozilla.org/en-US/docs/Web/API/WebGLShader">WebGL shader</a> timing</strong>: This can vary a lot, depending on the underlying technology. GPU-accelerated browsers complete WebGL operations in microseconds. Software-rendered headless browsers take milliseconds.</p></li><li><p><strong>Font rendering</strong>: Headless environments often lack the full system font stack. This creates detectable layout differences when JavaScript measures text dimensions.</p></li><li><p><strong>Performance benchmarks</strong>: When run, they can reveal software rendering. For example, there are websites that run JavaScript stress tests, creating thousands of DOM elements, calculating layouts, and triggering reflows. In such scenarios, real browsers with GPU acceleration show consistent performance. Headless browsers, instead, show different timing patterns.</p></li><li><p><strong>The </strong><em><strong><a href="https://developer.chrome.com/docs/extensions/reference/api/windows">window.chrome</a></strong></em><strong> object behaves differentl</strong>y: Real Chrome populates this object with specific properties for extension management and runtime APIs. Headless Chrome, instead, either lacks this object or provides an incomplete implementation.</p><p></p></li></ul><h3><strong>Layer 2: The Network Level</strong></h3><p>Your SSL/TLS handshake identifies you before you send any application data. When your scraper connects over HTTPS, it sends a TLS Client Hello message containing supported encryption methods, protocol versions, and extensions. All in a specific order.</p><p>Here&#8217;s what makes this dangerous:</p><ul><li><p><strong>Every browser and HTTP library has a unique TLS pattern:</strong> Real browsers send their TLS parameters in a specific sequence that matches their version and underlying platform. Python&#8217;s standard HTTP libraries send a completely different pattern. So do Node.js, Go, and any other programming language you use for coding your scrapers.</p></li><li><p><strong>Anti-bot systems fingerprint your TLS handshake:</strong> They capture these patterns and convert them into a fingerprint, commonly called a <a href="https://github.com/salesforce/ja3">JA3 hash</a>. They maintain databases of known fingerprints for every major browser and HTTP library.</p></li><li><p><strong>Mismatches between User-Agent and TLS fingerprint are instant red flags:</strong> When you claim to be Chrome in your User-Agent header but your TLS handshake matches Python&#8217;s urllib library, that inconsistency triggers blocking.</p></li><li><p><strong>Detection happens before you send any application data:</strong> The first TCP connection already identifies you as automated traffic.</p></li><li><p><strong>HTTP/2 fingerprinting adds another layer:</strong> Beyond TLS, the order and priority of HTTP/2 frames, settings, and window updates create additional fingerprints. Your HTTP library&#8217;s frame ordering must match your claimed browser identity.</p></li></ul><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo </strong>with high reputatation IPs<strong>,</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3><strong>Layer 3: The Infrastructure Level</strong></h3><p>Your proxy configuration can expose your real infrastructure through network-level leaks via the following main mechanisms:</p><ul><li><p><strong>DNS leaks:</strong> They happen when your browser resolves domain names using your local DNS server instead of routing through the proxy. Your scraper might send requests through a Miami residential proxy, but if DNS queries go through your AWS datacenter in Virginia, the target site knows your real location.</p></li><li><p><strong>WebRTC leaks:</strong> <a href="https://webrtc.org/">WebRTC </a>is a browser API designed for peer-to-peer communication. Even with a proxy configured, WebRTC will attempt to discover your real local IP and public IP through STUN servers, completely bypassing your proxy.</p></li><li><p><strong>IP reputation:</strong> Not all IPs are created equal. Cloudflare and similar services maintain databases of every AWS, Google Cloud, and Azure IP range. Requests from known cloud providers receive instant higher suspicion scores before any other analysis happens.</p></li></ul><h3><strong>Layer 4: The Behavioral Level</strong></h3><p>Even if your browser, network, and infrastructure are perfectly disguised, your behavior patterns can still expose you:</p><ul><li><p><strong>Timing patterns:</strong> Requesting data at fixed and precise intervals creates a perfect periodicity. No human browses with mathematical precision.</p></li><li><p><strong>Mouse and scroll behavior:</strong> Real humans accelerate and decelerate smoothly. Instant jumps from point A to point B are mechanically impossible.</p></li><li><p><strong>Session state:</strong> Stateless scrapers that never accumulate cookies or maintain persistent sessions across days look like fresh bots on every run.</p></li><li><p><strong>Interaction sequences:</strong> The time between page load and first click, between mouse-over and click, or the pattern of how you scroll through content. They all follow detectable human patterns.</p></li></ul><h2><strong>Understanding the Detection: How Anti-Bot Systems Catch You</strong></h2><p>Now that you know where leaks occur, let&#8217;s understand how anti-bot systems actually detect them.</p><h3><strong>Fingerprint Consistency Checks</strong></h3><p>Anti-bot systems cross-reference your claimed identity with actual behavior. If your <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent">User-Agent</a> says &#8220;Chrome 120 on Windows 10,&#8221; they verify that your JavaScript features, WebGL capabilities, canvas rendering, and TLS handshake all match Chrome 120 on Windows 10.</p><p>A single mismatch anywhere flags the entire request. You can&#8217;t be Chrome in your User-Agent, Firefox in your TLS handshake, and headless Chrome in your canvas fingerprint. <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">Anti-bot systems create composite fingerprints combining dozens of properties</a>, then compare them against databases of known legitimate and bot patterns.</p><h3><strong>Machine Learning Pattern Recognition</strong></h3><p>Modern anti-bot systems use ML models trained on billions of requests. They learn what &#8220;normal&#8221; looks like for each type of visitor. This means that consumer browsers from residential IPs have different behavioral patterns than datacenter scrapers.</p><p>For ML models, statistical anomalies trigger investigation. Perfect timing intervals, impossible mouse movements, or timing patterns that don&#8217;t match human variance distributions are scored as anomalous. These models adapt continuously, so when new stealth techniques emerge, the models retrain on that data. This means that what works today might fail tomorrow.</p><h3><strong>Progressive Trust Scoring</strong></h3><p>Anti-bot systems block or allow requests, but they also score. This means that lower trust scores receive degraded service: slower response times, rate limits, or <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">CAPTCHA challen</a>ges before blocking.</p><p>Also, scores accumulate across sessions. If you leak information across multiple visits, the system builds a profile associating your various identities. In other words, one leak can poison future requests, and even fixing the leak might not restore trust if your IP or fingerprint is already marked.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2><strong>Building the Defense: A Layered Approach to Stealth</strong></h2><p>Building a defense from data leaks in web scraping requires addressing each layer systematically. Your stealth stack must work from the inside out: browser &#8594; network &#8594; infrastructure &#8594; behavior. Each layer must remain consistent with your claimed identity.</p><h3><strong>Defense Layer 1: Hardening the Browser</strong></h3><p>The goal at this layer is to make the browser fingerprint indistinguishable from a real user&#8217;s browser and ensure every property is consistent with your claimed identity.</p><p><strong>Step 1: Mask Automation Signals</strong></p><p>Start with stealth libraries that patch the most common detection vectors:</p><ul><li><p><strong>For Puppeteer:</strong> Use <em><a href="https://www.npmjs.com/package/puppeteer-extra-plugin-stealth">puppeteer-extra-plu</a>gin-stealth</em> to automatically override <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">navigator.webdriver</a></em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">,</a> DevTools Protocol signatures, and plugin arrays.</p></li><li><p><strong>For <a href="https://www.selenium.dev/">Selenium</a>:</strong> Use <em><a href="https://pypi.org/project/undetected-chromedriver/">undetected-chromedriver</a>,</em> which patches automation signals and uses real Chrome binaries instead of ChromeDriver.</p></li><li><p><strong>For Playwright:</strong> Leverage native evasion features that handle many detection vectors out of the box.</p></li></ul><p>Additionally, disable automation flags at launch. For example, in Playwright:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--no-sandbox'
        ]
    )</code></code></pre><p>But remember: Stealth libraries handle the most common 20-30 leak vectors but miss advanced fingerprinting techniques. They&#8217;re your foundation, not your complete solution.</p><p><strong>Step 2: Spoof Hardware Signatures</strong></p><p>Cloud server canvas and WebGL fingerprints are obvious red flags. AWS, GCP, and Azure rendering signatures are well-known to anti-bot systems.</p><p>You have two approaches for your defense here:</p><ul><li><p><strong>Add consistent noise:</strong> Inject deterministic noise into canvas operations so the fingerprint remains stable across sessions but doesn&#8217;t match your server&#8217;s real hardware. Override canvas methods to modify pixel data slightly before it&#8217;s read back. Keep noise minimal: just enough to mask the real hardware signature without appearing obviously manipulated.</p></li><li><p><strong>Emulate common consumer hardware:</strong> Spoof WebGL parameters to mimic common consumer GPUs. Override vendor and renderer strings returned by WebGL APIs to match your chosen hardware profile. Use existing libraries designed for canvas fingerprint defense or implement your own parameter overrides.</p></li></ul><p><strong>Step 3: Ensure Version Consistency</strong></p><p>This is where most scrapers fail, even with stealth libraries. Your User-Agent string must match your actual browser engine behavior precisely. Consider the following rules of thumb:</p><ul><li><p><strong>Use real browser binaries instead of spoofing:</strong> Tools like Playwright can launch actual Chrome, ensuring perfect consistency between claimed version and actual behavior.</p></li><li><p><strong>If you must spoof, maintain complete version profiles:</strong> Track which JavaScript features, WebGL capabilities, and API behaviors correspond to each browser version. Every property must align.</p></li><li><p><strong>Never mix components from different versions:</strong> If you claim Chrome 120 on Windows 10, every single API, from JavaScript features to WebGL renderers, must behave exactly like Chrome 120 on Windows 10.</p></li></ul><h3><strong>Defense Layer 2: Hardening the Network Stack</strong></h3><p>Your goal at this layer is to make your TLS handshake and HTTP traffic indistinguishable from the browser you&#8217;re claiming to be.</p><p><strong>Step 4: Match TLS Fingerprints to Your Browser Identity</strong></p><p>Standard HTTP libraries can&#8217;t mimic browser TLS fingerprints because they use different SSL/TLS implementations. The solution requires specialized libraries that replicate browser behavior at the protocol level:</p><ul><li><p><strong>For Python:</strong> Use <em><a href="https://curl-cffi.readthedocs.io/en/latest/">curl_cffi</a></em> or similar wrappers. These libraries use <em><a href="https://curl.se/libcurl/">libcurl</a></em> compiled with <em><a href="https://github.com/google/boringssl">BoringSSL</a></em>, which is the same SSL library Chrome uses. This creates identical JA3 fingerprints to real browsers.</p></li><li><p><strong>For Node.js:</strong> Use <em><a href="https://www.npmjs.com/package/cycletls">cycletls</a></em> or equivalent libraries that allow you to specify exact JA3 fingerprint strings matching real browsers.</p></li></ul><p><strong>Critical requirement:</strong> Your TLS fingerprint must match your User-Agent. Chrome 120&#8217;s JA3 fingerprint is different from Firefox 115&#8217;s fingerprint. The browser identity must be consistent across all layers.</p><p><strong>Step 5: Match HTTP/2 Fingerprints</strong></p><p>Beyond TLS, HTTP/2 frame ordering creates additional fingerprints. Libraries like <em>curl_cffi</em> handle this automatically when you specify a browser to impersonate, but verify that:</p><ul><li><p>Settings frames match your target browser.</p></li><li><p>Window update sequences align.</p></li><li><p>Priority headers follow the correct pattern.</p></li></ul><p>In Python, you can do so with the following code:</p><pre><code><code>response = requests.get(
    '&lt;https://tls.peet.ws/api/all&gt;',
    impersonate='chrome120'
)
print(response.json()['http2']['sent_frames'])
</code></code></pre><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3><strong>Defense Layer 3: Hardening Infrastructure</strong></h3><p>Your goal at this layer is to ensure your network traffic originates from legitimate-looking IPs and doesn&#8217;t leak your real location or identity.</p><p><strong>Step 6: Choose the Right Proxy Type</strong></p><p>IP reputation is the first filter that anti-bot systems check. This means that your<a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies"> proxy choice determines your baseline trust score</a>. Consider the following guidelines:</p><ul><li><p><strong>Datacenter IPs = instant red flag:</strong> Requests from AWS, Google Cloud, and Azure IP ranges receive instant higher suspicion scores. </p></li><li><p><strong>Residential proxies = highest legitimacy:</strong> These IPs come from real ISP connections, so they look legitimate because they are legitimate consumer connections.</p></li><li><p><strong>Mobile proxies = premium legitimacy</strong>: These IPs originate from cellular networks (4G/5G) and receive the highest trust scores. Mobile IPs rotate naturally as devices move between cell towers, making them appear even more organic than static residential connections.</p></li></ul><p><strong>Step 7: Prevent DNS Leaks</strong></p><p>Force all DNS resolution through your proxy tunnel. For SOCKS5 proxies, use the SOCKS5h protocol variant, which forces DNS resolution on the remote proxy server instead of locally.</p><p>For example, in Python, write the following:</p><pre><code><code>import requests

proxies = {
    'http': 'socks5h://proxy.example.com:1080',
    'https': 'socks5h://proxy.example.com:1080'
}

response = requests.get('&lt;https://example.com&gt;', proxies=proxies)
</code></code></pre><p>For browser automation, configure DNS-over-HTTPS to prevent local DNS leakage. The following is an example that applies to Playwright:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        args=[
            '--dns-over-https-server=https://cloudflare-dns.com/dns-query'
        ]
    )
</code></code></pre><p><strong>Step 8: Disable WebRTC Completely</strong></p><p>WebRTC will expose your real IP unless you completely disable it in browser automation. For example, in Playwright, you can do so as follows:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    
    # Remove WebRTC entirely
    await page.add_init_script("""
        delete window.RTCPeerConnection;
        delete window.RTCSessionDescription;
        delete window.RTCIceCandidate;
        delete navigator.mediaDevices;
    """)
</code></code></pre><p>When you&#236;ve done this, verify it&#8217;s actually disabled before deploying your scraper. Visit <a href="http://browserleaks.com/webrtc">browserleaks.com/webrtc</a> with your scraper. You should see &#8220;WebRTC is not supported by your browser&#8221;, or only your proxy IP should be visible. Never your real IP.</p><h3><strong>Defense Layer 4: Mimicking Human Behavior</strong></h3><p>Your goal at this layer is to make your interaction patterns indistinguishable from those of real human users.</p><p><strong>Step 9: Add Timing Jitter and Randomization</strong></p><p>Humans are inconsistent. Perfect patterns are robotic. The solution here is not to just add randomness. You also need to match the statistical distribution of real human behavior. To do so, consider the following example in Python:</p><pre><code><code>import numpy as np
import time

# Wrong example (do not use this)

# Fixed interval
time.sleep(5)  # Always 5 seconds - DETECTABLE

# Random uniform
time.sleep(random.uniform(3, 7))  # Still doesn't match human patterns

------------

# Correct example (use this!)

# Log-normal distribution (matches real human reaction times)
delay = np.random.lognormal(mean=1.5, sigma=0.5)
time.sleep(delay)
</code></code></pre><p>For improving randomization, model different action types with appropriate distributions. Use the following rules of thumb:</p><ul><li><p>Clicks: 0.3-2 seconds (short delays)</p></li><li><p>Reading: 5-45 seconds (high variance)</p></li><li><p>Scrolling: 1-8 seconds (irregular intervals)</p></li></ul><p><strong>Step 10: Implement Realistic Mouse and Scroll Behavior</strong></p><p>High-security sites like banking, ticketing, and heavily protected e-commerce websites track interaction patterns in real-time. To defend from leaking your information on such websites, you have to define mouse movements and scrolling for your automated scripts.</p><p>For mouse movements, you can:</p><ul><li><p>Use Bezier curves to create natural arcing movements between points.</p></li><li><p>Add slight randomness to destination coordinates.</p></li><li><p>Include hover delays before clicking.</p></li><li><p>Vary the number of intermediate steps based on distance.</p></li></ul><p>The following is an example you can try in Python:</p><pre><code><code>import numpy as np
from playwright.sync_api import sync_playwright

def bezier_curve(start, end, control_points, num_steps=20):
    """Generate points along a Bezier curve for natural mouse movement"""
    t = np.linspace(0, 1, num_steps)
    points = []
    
    # Simplified cubic Bezier
    for t_val in t:
        x = (1-t_val)**3 * start[0] + \\
            3*(1-t_val)**2*t_val * control_points[0][0] + \\
            3*(1-t_val)*t_val**2 * control_points[1][0] + \\
            t_val**3 * end[0]
        y = (1-t_val)**3 * start[1] + \\
            3*(1-t_val)**2*t_val * control_points[0][1] + \\
            3*(1-t_val)*t_val**2 * control_points[1][1] + \\
            t_val**3 * end[1]
        points.append((x, y))
    
    return points

async def human_like_click(page, selector):
    element = await page.query_selector(selector)
    box = await element.bounding_box()
    
    # Add slight randomness to destination
    target_x = box['x'] + box['width']/2 + np.random.normal(0, 2)
    target_y = box['y'] + box['height']/2 + np.random.normal(0, 2)
    
    # Move mouse along curve
    current_pos = await page.mouse.position()
    control_points = [
        (current_pos['x'] + np.random.uniform(-50, 50), 
         current_pos['y'] + np.random.uniform(-50, 50)),
        (target_x + np.random.uniform(-20, 20), 
         target_y + np.random.uniform(-20, 20))
    ]
    
    points = bezier_curve(
        (current_pos['x'], current_pos['y']), 
        (target_x, target_y), 
        control_points
    )
    
    for x, y in points:
        await page.mouse.move(x, y)
        await page.wait_for_timeout(np.random.uniform(5, 15))
    
    # Hover briefly before clicking
    await page.wait_for_timeout(np.random.uniform(100, 300))
    await page.mouse.click(target_x, target_y)
</code></code></pre><p>For scrolling, you can:</p><ul><li><p>Pause between scroll actions for variable amounts of time (simulating reading).</p></li><li><p>Scroll in chunks of varying size, not uniform pixels.</p></li><li><p>Occasionally scroll backwards (humans re-read).</p></li><li><p>Don&#8217;t scroll in perfect increments or at constant speeds.</p></li></ul><p>Use the following Python code to try such scrolling behaviour:</p><pre><code><code>async def human_like_scroll(page, total_distance):
    """Scroll with human-like patterns"""
    scrolled = 0
    
    while scrolled &lt; total_distance:
        # Vary chunk size
        chunk = np.random.randint(100, 400)
        
        await page.mouse.wheel(0, chunk)
        scrolled += chunk
        
        # Pause to simulate reading
        pause = np.random.lognormal(mean=1.2, sigma=0.8)
        await page.wait_for_timeout(pause * 1000)
        
        # Occasionally scroll backwards (humans re-read)
        if np.random.random() &lt; 0.15:
            await page.mouse.wheel(0, -np.random.randint(50, 150))
            await page.wait_for_timeout(np.random.uniform(500, 1500))
</code></code></pre><p><strong>Step 10: Maintain Persistent Session State</strong></p><p>Stateless scrapers look like stateless bots. Real browsers, instead, accumulate state over time because:</p><ul><li><p>Cookies persist across requests and sessions.</p></li><li><p>LocalStorage accumulates tracking data over time.</p></li><li><p>Session IDs remain stable across days or weeks.</p></li></ul><p>To mimic real browser states, you can use the following Python code:</p><pre><code><code>import pickle
import requests

# Save cookies to disk after each session
session = requests.Session()

# ... perform scraping ...

with open('cookies.pkl', 'wb') as f:
    pickle.dump(session.cookies, f)

# Before next scraping session
with open('cookies.pkl', 'rb') as f:
    session.cookies.update(pickle.load(f))
</code></code></pre><p>In case you use a browser automation tool:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    
    # Save browser storage state
    context = browser.new_context()
    # ... perform scraping ...
    context.storage_state(path='state.json')
    
    # Reload in next session
    context = browser.new_context(storage_state='state.json')
</code></code></pre><p>As a final note, consider keeping sessions alive for weeks to allow third-party tracking cookies to build up. Long-lived sessions with accumulated tracking data appear more legitimate than constantly refreshed clean states.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Conclusion</strong></h2><p>In this article, you learned that, if you don&#8217;t want your data to be leaked while scraping, you have to take several defensive measures, as no single technique makes you invisible. Anti-bot systems analyze multiple signals simultaneously, and any inconsistency across layers triggers detection and blocks your scrapers.</p><p>Also, detection methods evolve. So, what works today might fail tomorrow. This means you should also monitor the defenses you implemented and test new ones.</p><p>Now, let us know: How do you prevent data leaks in your scrapers? Did we miss some technique?</p>]]></content:encoded></item><item><title><![CDATA[rayobrowse: A Hands-On Look at the Stealth Browser From Rayobyte]]></title><description><![CDATA[Looking for a Camoufox alternative? Here&#8217;s an interesting stealth browser worth checking out!]]></description><link>https://substack.thewebscraping.club/p/rayobrowse-browser-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/rayobrowse-browser-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 05 Apr 2026 03:00:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/442d19ad-ddc9-4b14-afda-71c81a91ffc4_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The open&#8209;source nature of Camoufox is what made the project so popular and appealing. Unfortunately, that same openness is also what allowed anti&#8209;bot giants to study it closely and eventually crack down on it.</p><p>Rayobyte, the proxy and web scraping solutions provider, has taken a different approach. They recently released <em>rayobrowse</em>, a closed&#8209;source yet Docker&#8209;based, self&#8209;hostable stealth browser built for local browser automation and web scraping.</p><p>In this post, I&#8217;ll take a deep look at this solution and walk you through everything you need to know about it. By the end, you&#8217;ll understand what rayobrowse is, how its stealth browser approach works, how to set it up, and whether it&#8217;s actually worth paying attention to.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>An Introduction to rayobrowse</h2><p>Let me introduce you to the world of rayobrowse, helping you understand what it is and what makes this project special.</p><h3>What is rayobrowse?</h3><p><a href="https://github.com/rayobyte-data/rayobrowse">rayobrowse</a> is a self-hosted, Chromium-based stealth browser engineered for web scraping, AI agents, and automation workflows. It&#8217;s available as a Docker image, with optional support via a Python SDK (<em><a href="https://pypi.org/project/rayobrowse">rayobrowse</a></em> on PyPI) for simplified connection. The project is developed and maintained by Rayobyte.</p><p>The stealth browser runs inside Docker and is available via the <a href="https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi">Chrome DevTools Protocol (CDP)</a>. That means tools like Playwright, Puppeteer, and Selenium (or any other tool that speaks CDP) can natively connect to it for automation purposes.</p><p>What makes it noteworthy is its approach to <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">device fingerprinting</a>. User agents, screen size, WebGL, fonts, timezone, and other signals are tuned so each session looks like a real browser. That way, it helps your automation avoid detection on protected websites.</p><h3>Core Principles Driving the Solution</h3><p>These are the core principles and goals behind the project:</p><ol><li><p>It should run on Linux server environments without GPUs or a GUI/desktop interface.</p></li><li><p>It should patch Chromium at the C++ level, rather than at higher layers like CDP, which are easier for anti-bot systems to detect.</p></li><li><p>It should work with Playwright, a common framework in <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browsing automation stacks</a>.</p></li><li><p>It should support both headful mode (via <a href="https://www.x.org/archive/X11R7.7/doc/man/man1/Xvfb.1.xhtml">Xvfb</a>) and headless mode.</p></li><li><p>It should emulate fingerprints from real-world devices across different regions.</p></li><li><p>It should be self-hostable, so you can run it locally without relying on cloud infrastructure.</p></li><li><p>It should be free to test and use for certain user segments.</p></li><li><p>It should reliably bypass major anti-bot systems and scraping targets, including complex ecommerce and SERP platforms.</p></li></ol><p><strong>Note</strong>: If you&#8217;re not familiar with Xvfb, that&#8217;s an in&#8209;memory display server for Unix-like systems that implements the X11 display protocol without requiring a physical display or input devices. In simpler terms, it allows GUI applications to run in headless environments. rayobrowse relies on it to launch headful browser sessions even on servers without a graphical interface (that&#8217;s beneficial as headful sessions are harder to detect than purely headless ones).</p><h2>Main Features for Stealth Browsing and More</h2><p>Here is a list of the most relevant rayobrowse features:</p><ul><li><p><strong>Fingerprint spoofing</strong>:<strong> </strong>Each browser session comes with a real-world realistic device fingerprint drawn from a database of thousands of profiles. Signals include user agent, OS metadata, screen resolution, fonts, WebGL, hardware concurrency, and timezone.</p></li><li><p><strong>Human&#8209;like mouse movement</strong>: Optional human&#8209;style cursor behavior (inspired by <a href="https://github.com/riflosnake/HumanCursor">HumanCursor</a>) makes automation appear more natural. When using standard Playwright actions like <em>page.click()</em> or <em>page.mouse.move()</em>, the library applies realistic curves and timing.</p></li><li><p><strong>Proxy Integration</strong>: Traffic can be routed through any HTTP proxy, including authenticated and rotating proxies.</p></li><li><p><strong>Headless and headful Support</strong>: rayobrowse supports both execution modes, even on GUI-less Linux servers.</p></li><li><p><strong>Live session viewer</strong>:<strong> </strong>A built&#8209;in noVNC interface (available at http://localhost:6080) lets you watch browser sessions in real time directly from the browser. This is particularly useful for debugging scraping flows and visually verifying fingerprint behavior.</p></li><li><p><strong>Official integrations</strong>:<strong> </strong>The browser integrates with common automation frameworks, namely Playwright, Puppeteer, Selenium, and Scrapy (via <em><a href="https://substack.thewebscraping.club/p/basic-scrapy-configuration">scrapy-playwright</a></em>), as well as emerging <a href="https://substack.thewebscraping.club/p/my-first-week-with-openclaw">AI&#8209;driven tools such as OpenClaw</a>. As of this writing, additional integrations (e.g., Firecrawl and LangChain) are planned.</p></li><li><p><strong>Remote/Cloud mode</strong>: rayobrowse can run as a <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#remote--cloud-mode-beta">remote browser service</a>. Your server requests new browser instances through a REST API, and workers connect directly to the returned CDP WebSocket endpoint. This is still a beta feature.</p></li><li><p><strong>API&#8209;driven browser management</strong>:<strong> </strong>The daemon exposes REST endpoints for creating, listing, and deleting browser sessions, allowing you to orchestrate multiple browsers across a distributed scraping infrastructure.</p></li></ul><h2>Technical Details About the Project</h2><p>Now that you know what the project is and the features it provides, you&#8217;re ready to dive into the technical aspects.</p><h3>How rayobrowse Works</h3><p>At a high level, rayobrowse follows these steps:</p><ol><li><p><strong>Chromium patching</strong>:<strong> </strong>The project tracks upstream Chromium releases and applies a focused set of patches (relying on an <a href="https://github.com/brave/brave-core/blob/master/tools/cr/plaster.py">approach similar to Brave&#8217;s &#8220;plaster&#8221; model</a>). These patches normalize exposed browser APIs, reduce fingerprint entropy leaks, improve automation compatibility, and preserve native Chromium behavior whenever possible.</p></li><li><p><strong>Fingerprint assignment</strong>: When a browser session starts, rayobrowse assigns a realistic device fingerprint.</p></li><li><p><strong>Automation integration</strong>: Browser automation libraries connect to rayobrowse through the native CDP.</p></li></ol><h3>Architecture</h3><p>Architecturally, rayobrowse follows a clean separation between the browser runtime and the automation code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vdVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vdVO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 424w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 848w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png" width="1456" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;rayobrowse&#8217;s architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="rayobrowse&#8217;s architecture" title="rayobrowse&#8217;s architecture" srcset="https://substackcdn.com/image/fetch/$s_!vdVO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 424w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 848w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">rayobrowse&#8217;s architecture</figcaption></figure></div><p>In particular, the system runs as a Docker container that bundles three core components:</p><ol><li><p>A daemon server that manages browser sessions.</p></li><li><p>A browser manager that downloads and retrieves the correct version of Chromium, a fingerprint engine that injects realistic device profiles, and a stealth browser layer containing a custom Chromium build with stealth patches.</p></li><li><p>A <a href="https://github.com/novnc/noVNC">noVNC viewer</a>, which lets you watch browser sessions in real time. This is useful for debugging and demos.</p></li></ol><p>As you can see, the automation scripts don&#8217;t run inside the container. Instead, they run on the host machine and connect to the browser remotely through the Chrome DevTools Protocol.</p><p>When a new session starts, rayobrowse assigns a real-user-looking fingerprint from a large database of actual devices, containing thousands of permutations collected from websites Rayobyte owns.</p><h3>Requirements</h3><p>The rayobrowse project is designed to run on Linux servers without GPUs (which is a common deployment environment).</p><p>These are the required prerequisites:</p><ul><li><p>Docker, as the browser runs entirely inside a container.</p></li><li><p>~2GB of available RAM, as each browser instance uses ~300MB.</p></li></ul><p>The main benefit of this Docker-based approach is that you don&#8217;t need to install Chromium locally, configure fonts, or set up Xvfb manually. All of those dependencies live inside the container, which keeps the host machine clean, portable, and reproducible.</p><p>It also makes the project well-suited for self-hosted environments without exposing its internal Chromium patching logic, making it much harder for anti-bot solution providers to reverse engineer how it works.</p><p>In terms of compatibility, rayobrowse works on Linux, Windows (native or WSL2), and macOS. The supported architectures are <em>x86_64 (amd64)</em> and <em>ARM64</em> (Apple Silicon and AWS Graviton). Still, you don&#8217;t have to worry about the architecture, as Docker automatically pulls the correct image for the host machine.</p><p><strong>Optional</strong>: If you plan to use the stealth browser through the Python SDK, an additional requirement is Python 3.10+.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>How to Access rayobrowse</h2><p>There are two main ways you can access rayobrowse:</p><ol><li><p>The <em>/connect</em> endpoint.</p></li><li><p>The built-in Python SDK.</p></li></ol><h3>Method #1: Use the /connect Endpoint</h3><p>The first rayobrowse usage method involves connecting directly to the <em>/connect</em> endpoint. This allows any CDP&#8209;compatible tool (including Selenium, Playwright, and Puppeteer) to open a browser session simply by pointing to a WebSocket URL like <em>ws://localhost:9222/connect</em>.</p><p>For instance, take a look at the Playwright connection example below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Connect to rayobrowse via CDP
    browser = p.chromium.connect_over_cdp("ws://localhost:9222/connect")
    page = browser.new_context().new_page()

    # Automation logic...

    browser.close()</code></pre></div><p>Keep in mind that the WebSocket browser connection URL can be customized using query parameters, as follows:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">ws://localhost:9222/connect?headless=false&amp;os=android&amp;proxy=http://user:pass@host:port</code></pre></div><p>This URL creates a rayobrowse Chromium browser session in headful mode, using Android-based fingerprints, while routing all requests through the proxy <em><a href="http://user:pass@host:port">http://user:pass@host:port</a></em>.</p><p>Explore all <em>/connect</em> query parameters <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#using-connect-simplest">in the docs</a>.</p><h3>Method #2: Use the Python SDK</h3><p>You can also interact with rayobrowse through the built-in Python SDK. This exposes a <em><a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#api-reference">create_browser()</a></em> function that returns a CDP WebSocket URL for a newly created browser instance. From there, connect using Playwright or another automation framework, as shown below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from rayobrowse import create_browser
from playwright.sync_api import sync_playwright

# Configure the rayobrowse connection to run in headful mode 
# while simulating a Windows-based fingerprint
ws_url = create_browser(headless=False, target_os="windows")

with sync_playwright() as p:
    # Connect to rayobrowse with the configured URL via CDP
    browser = p.chromium.connect_over_cdp(ws_url)
    page = browser.contexts[0].pages[0]
 
    # Automation logic...

    browser.close()</code></pre></div><p>This approach gives you more control over the browser lifecycle, but it also involves more configuration and setup.</p><p>For more examples (e.g., proxy integration, multi-browser management, etc.), <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#using-the-python-sdk">check out the docs</a>.</p><h2>Get Started with rayobrowse: Step-by-Step Guide</h2><p>In this guided section, I&#8217;ll show you how to build a simple Playwright script that connects to rayobrowse.</p><p>For the sake of simplicity, I&#8217;ll assume you already have:</p><ul><li><p>A Unix-based system (Linux, macOS, or Windows via WSL).</p></li><li><p>Docker installed and running on your machine.</p></li><li><p>Git installed locally.</p></li><li><p>A Python environment set up <a href="https://substack.thewebscraping.club/p/scraping-vs-playwright-web-scraping">with Playwright installed</a>.</p></li></ul><p>Follow the instructions below!</p><h3>Step #1: Clone the rayobrowse Repository</h3><p>The first step is to clone the rayobrowse repository to your machine:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">git clone https://github.com/rayobyte-data/rayobrowse</code></pre></div><p>Then, enter the project folder with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">cd rayobrowse</code></pre></div><p>The cloned folder already includes everything you need to get started, including:</p><ul><li><p><em>docker-compose.yml</em>:<strong> </strong>For running the browser container.</p></li><li><p><em>requirements.txt</em>: For installing the Python SDK.</p></li></ul><h3>Step #2: Set Up the Environment</h3><p>rayobrowse requires a .env file that contains the configuration needed to run the browser daemon. For a full list of available environment variables and what they enable, <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#environment-variables">explore the official documentation</a>.</p><p>Start by creating a <em>.env</em> file as a copy of the <em>.env.example</em> file coming with the repository:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">cp .env.example .env</code></pre></div><p>Then open the <em>.env</em> file and make sure it contains:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">STEALTH_BROWSER_ACCEPT_TERMS=true</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zjWr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zjWr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Setting the STEALTH_BROWSER_ACCEPT_TERMS env&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Setting the STEALTH_BROWSER_ACCEPT_TERMS env" title="Setting the STEALTH_BROWSER_ACCEPT_TERMS env" srcset="https://substackcdn.com/image/fetch/$s_!zjWr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Setting the STEALTH_BROWSER_ACCEPT_TERMS env</figcaption></figure></div><p>This confirms that you accept the project&#8217;s <a href="https://github.com/rayobyte-data/rayobrowse/blob/main/LICENSE">LICENSE</a>. Without that setting, the daemon will refuse to create browser sessions.</p><h3>Step #3: Start the Docker Container</h3><p>Launch the rayobrowse Docker container:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">docker compose up -d</code></pre></div><p>Docker will automatically pull the appropriate image for your system architecture (<em>x86_64</em> or <em>ARM64</em>). Then, it&#8217;ll start the container, as explained earlier.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FB1x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FB1x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 424w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 848w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1272w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png" width="1456" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output of the &#8220;docker compose up -d&#8221; command&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output of the &#8220;docker compose up -d&#8221; command" title="The output of the &#8220;docker compose up -d&#8221; command" srcset="https://substackcdn.com/image/fetch/$s_!FB1x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 424w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 848w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1272w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output of the &#8220;docker compose up -d&#8221; command</figcaption></figure></div><h3>Step #4: Connect via CDP and Apply the Automation Logic</h3><p>You can now connect to the running rayobrowse instance through the <em>/connect</em> endpoint using any CDP-compatible client. In this example, I&#8217;ll use Playwright with Python:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Connect to the rayobrowse browser through the CDP WebSocket endpoint
    browser = p.chromium.connect_over_cdp(
        "ws://localhost:9222/connect?headless=false&amp;os=windows"
    )

    # Create a new browser context and page
    page = browser.new_context().new_page()

    # Navigate to the target (sample) page
    page.goto("https://quotes.toscrape.com/")

    # Print the page title to verify the session is working
    print(page.title()) # Output: "Quotes to Scrape"

    # Add your scraping logic here...

    # Close the browser session
    browser.close()</code></pre></div><p>At this point, write your scraping or automation logic, which will run inside the stealth Chromium browser provided by rayobrowse.</p><p>For debugging, you can watch the browser session live through noVNC at <em><a href="http://localhost:6080/vnc.html">http://localhost:6080/vnc.html</a></em>. While the script is running, you should see a headful Chromium session opening and navigating to the target page specified in the script:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v1V8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v1V8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Monitoring the target browser session at http://localhost:6080/vnc.html&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Monitoring the target browser session at http://localhost:6080/vnc.html" title="Monitoring the target browser session at http://localhost:6080/vnc.html" srcset="https://substackcdn.com/image/fetch/$s_!v1V8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Monitoring the target browser session at http://localhost:6080/vnc.html</figcaption></figure></div><p>As you can tell, the server creates a headful Chromium session (due to the <em>headless=false</em> query parameter) and connects it to the page requested by the script.</p><p><strong>Optional</strong>: If you want more control over the browser lifecycle, install the Python SDK with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">pip install -r requirements.txt</code></pre></div><p>Take a look at the <a href="https://github.com/rayobyte-data/rayobrowse/tree/main/examples">official examples in the repository</a> for more guidance.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Pricing and Limitations</h3><p>This is how the rayobrowse pricing model works:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bDvq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bDvq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 424w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 848w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202193,&quot;alt&quot;:&quot;rayobrowse&#8217;s pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/190103610?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="rayobrowse&#8217;s pricing model" title="rayobrowse&#8217;s pricing model" srcset="https://substackcdn.com/image/fetch/$s_!bDvq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 424w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 848w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">rayobrowse&#8217;s pricing model</figcaption></figure></div><p>What matters most for us, developers, is that you can run rayobrowse for free via self&#8209;hosting. In practice, the only real cost comes from proxies, which are necessary for scaling scraping workloads and avoiding IP bans (something that&#8217;s standard in most production scraping setups).</p><p>The main thing to keep in mind is that rayobrowse is still in beta. Rayobyte already uses it to scrape millions of pages per day, but results can vary depending on the target site and configuration.</p><p>Fingerprint coverage is currently strongest for Windows and Android, while macOS and Linux profiles are less mature. In addition, Canvas and WebGL fingerprinting are still evolving, which means some websites may detect the current implementation.</p><h2>Benchmarks and Final Comment</h2><p>To put rayobrowse to the test, I ran a simple script against a single page for each of the most popular anti&#8209;bot detection systems. These are the results I obtained:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lZAd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lZAd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 424w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 848w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1272w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png" width="1456" height="369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80344,&quot;alt&quot;:&quot;Playright vs rayobrowse: Benchmark comparison table&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/190103610?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Playright vs rayobrowse: Benchmark comparison table" title="Playright vs rayobrowse: Benchmark comparison table" srcset="https://substackcdn.com/image/fetch/$s_!lZAd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 424w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 848w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1272w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Playwright vs rayobrowse: Benchmark comparison table</figcaption></figure></div><p><strong>Note:</strong> These tests were performed on my local machine using my ISP&#8217;s IP address.</p><p>As you can see, in this simple experiment rayobrowse achieved a 100% success rate, while Playwright failed consistently in headless mode and even struggled in some headful scenarios.</p><p>This suggests that the project is definitely worth keeping an eye on, especially thanks to its self&#8209;hosted nature.</p><p><em>To be honest, and this is just my personal opinion as an expert who works in this field, I don&#8217;t usually get very excited about projects like this&#8230;. In my experience, many libraries of this type either get cracked down on or simply don&#8217;t receive the long&#8209;term support they deserve. In this case, however, things are a bit different. The project is closed&#8209;source and backed by a well&#8209;known company in the industry, which makes the expectations for its future understandably much higher!</em></p><p>Here, I covered what the project is about, what it offers, how it works, and how to use it. As always, remember to use rayobrowse only for legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical web scraping</a>. Until next time!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>FAQ</h2><h3>Why is rayobrowse based on Chromium and not Chrome?</h3><p>rayobrowse is based on Chromium simply because Chrome is closed-source. Plus, tests performed on difficult websites show no meaningful difference in detection rates between Chrome and Chromium. Using Chromium also avoids false positives and reflects the broader ecosystem of Chromium-based browsers like Brave, Edge, and Samsung Internet.</p><h3>Is rayobrowse open source?</h3><p>rayobrowse isn&#8217;t open-source to prevent anti-bot companies from reverse-engineering it. Similar projects, like <a href="https://github.com/daijro/camoufox">Camoufox</a>, were quickly studied and countered once their code became public. Rayobyte decided to keep the project closed-source to help maintain its effectiveness and reliability over the long term.</p><h3>Can everyone use rayobrowse?</h3><p>No, not all companies can use rayobrowse. Its license prohibits organizations listed in <a href="https://cdn.sb.rayobyte.com/list-of-prohibited-companies.txt">Rayobyte&#8217;s restricted list</a> from using the software. For everyone else, the project is free to download and run locally.</p><h3>Does rayobrowse support proxy integration?</h3><p>Yes, Rayobrowse fully supports proxy integration. You can route traffic through any HTTP proxy using the <em>proxy </em>query parameter on the <em>/connect</em> endpoint or via the <em>proxy </em>option exposed by the <em>create_browser() </em>function from the Python SDK. The proxy support includes authentication and rotating proxies.</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #101: Building an Internal Knowledge Base for Your Scraping Team]]></title><description><![CDATA[Every scraping team that survives long enough develops the same disease.]]></description><link>https://substack.thewebscraping.club/p/building-knowledge-base-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/building-knowledge-base-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 02 Apr 2026 19:17:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3dba6c6a-f027-4c60-ad27-2c2378c217c6_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every scraping team that survives long enough develops the same disease. Someone figures out how to bypass Cloudflare&#8217;s latest challenge, writes it up in Notion, and moves on. Three months later, a teammate runs into the same problem, spends two days reinventing the solution, and documents it in a Google Doc. Meanwhile, the original Notion page has become outdated because Cloudflare changed its challenge flow, and nobody updated it.</p><p>We have seen this pattern in every scraping operation we have worked with. The knowledge exists. It is just scattered across wikis, Slack threads, internal repos, and people&#8217;s heads. The real problem is not documentation; it is retrieval. People write things down. They just cannot find them when it matters.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>In <a href="https://substack.thewebscraping.club/p/ingest-web-data-rag-llm">THE LAB #77</a>, we explored the concept of RAG (Retrieval-Augmented Generation) applied to scraped data and showed how to build a basic knowledge assistant using FAISS. That was a proof of concept. This time we are going deeper. We are showing the production system we actually built and use daily, and we are explaining the reasoning behind each design choice: why markdown, how embeddings work, which chunking strategy actually performs better, and what role auto-tagging plays in retrieval.</p><p>After reading this article, we hope you will understand the mechanics well enough to build the same system for your team.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>What we are building and why</h2><p>At TWSC, we have published around 300 articles over the past four years. Tutorials, reverse-engineering deep dives, tool comparisons, anti-bot analysis. When we sit down to write a new article, we need to remember what we have already covered, find previous work to link to, and check whether a technique we are about to describe was already explained in a past issue. Doing this by memory or by searching Substack&#8217;s archive stops working after the first hundred articles. </p><p>We also follow what the broader community publishes. Projects like <a href="https://crawl4ai.dev">Crawl4AI</a>, which appeared on Hacker News, show that the need to ingest web content into structured, LLM-ready knowledge bases is shared across the industry. The tools for crawling and extracting content keep getting better, but the retrieval side, finding the right piece of information in a growing archive, still requires a purpose-built system.</p><p>So we built one. Here is what the complete pipeline looks like:<br></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;631f6d4b-586d-4ef5-ba12-640a3cb186b0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Sources                                  Processing              Storage &amp; Retrieval
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;                                &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;              &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
Substack articles                   &#9472;&#9472;&#9488;
                                      &#9500;&#9472;&#9472;&gt; HTML-to-Markdown &#9472;&#9472;&gt; Frontmatter + Tagging &#9472;&#9472;&gt; Markdown files
Hacker News and other sources       &#9472;&#9472;&#9496;

Markdown files &#9472;&#9472;&gt; Chunker &#9472;&#9472;&gt; Embedder (e5-large-v2) &#9472;&#9472;&gt; PostgreSQL + pgvector

Search query &#9472;&#9472;&gt; Query embedding &#9472;&#9472;&gt; Cosine similarity search &#9472;&#9472;&gt; Ranked results</code></pre></div><p>Three stages, each independent and replaceable. You scrape content from your sources. You process and embed it. You search it. </p><p>If your team writes in Confluence instead of Substack, you swap the scraper. If you prefer Qdrant over pgvector, you swap the vector store. The architecture remains the same.<br><br>And here&#8217;s the hardware used for most of the steps, from embedding to the storage and retrieval: my DGX Spark.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yhsw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg" width="566" height="511.689557855127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:961,&quot;width&quot;:1063,&quot;resizeWidth&quot;:566,&quot;bytes&quot;:181504,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192358785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40edcbd4-e6c6-4172-bf4c-ee62da325b0f_1280x1707.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yes, I know, probably an overkill.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>The tools</h2><p><strong>Playwright</strong> handles browser-based scraping for our own Substack articles. Substack serves content dynamically and requires authentication for premium posts, so a plain HTTP client is not an option.</p><p><strong>Algolia API</strong> (via Hacker News) provides structured search over HN stories. No scraping needed: HN exposes its full search index through public endpoints.</p><p><strong><a href="https://scrapegraphai.com/">ScrapegraphAI</a> and <a href="https://www.firecrawl.dev/">Firecrawl</a></strong> convert external article URLs into clean markdown. ScrapegraphAI is the primary extractor, Firecrawl is the fallback.</p><p><strong>sentence-transformers</strong> with the <code>intfloat/e5-large-v2</code> model generates 1024-dimensional embeddings. We will explain why we chose this model later in the article.</p><p><strong>PostgreSQL with pgvector</strong> stores embeddings and handles similarity search. We chose it over dedicated vector databases because we already need PostgreSQL for metadata, and pgvector with HNSW indexing handles our scale without adding infrastructure.</p><p><strong>Docker Compose</strong> ties everything together as three containers: PostgreSQL, the API server, and the indexer.</p><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">101.KNOWLEDGE_BASE</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Why markdown as the universal format</h2><p>The first design choice we had to make was what format our knowledge base would store. We had content from Substack (HTML), Hacker News links (various formats), and potentially Confluence, Google Docs, or Slack in the future. We needed a common representation.</p><p>We chose markdown for three reasons.</p><p><strong>First</strong>, markdown preserves document structure without carrying rendering noise. An HTML page contains navigation bars, ad slots, JavaScript, CSS classes, and layout dividers. None of that is content. When you convert to markdown, you keep headings, paragraphs, code blocks, links, and lists. Everything the embedding model needs, nothing it would choke on.</p><p><strong>Second</strong>, markdown is readable by humans and machines alike. When something goes wrong in the pipeline, you can open a markdown file and immediately see what the system is working with. Try doing that with a serialized HTML DOM or a JSON blob from an API response.</p><p><strong>Third</strong>, YAML frontmatter is a natural fit for markdown and gives us a structured metadata header without mixing it into the content. Each file gets an `id`, `type`, `title`, `publish_date`, `topics`, and `visibility` field. This metadata drives filtering at search time and never enters the embedding model. The separation is important: embeddings capture meaning, frontmatter captures facts.</p><p>There are two paths to get content into markdown. You can build your own converter using open-source libraries, or you can use commercial services that handle extraction and conversion for you. In this article we show both approaches deliberately. For our own Substack articles, we built a converter from scratch with BeautifulSoup and markdownify. It costs nothing, we control every detail, and it works because we know the source HTML structure intimately. For external content discovered on Hacker News, we use commercial services like ScrapegraphAI and Firecrawl instead, because every URL leads to a different site with a different HTML structure. Building custom converters for thousands of unknown domains would be impractical. The trade-off is clear: when you control the source, build your own; when you are scraping the open web, commercial extraction services save an enormous amount of development time.</p><p>Our Substack HTML-to-markdown converter is deliberately simple. It strips scripts, styles, buttons, navigation, and footers, then converts the remaining HTML:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;aa028f4f-e1d2-412f-88bc-29153974e70e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def html_to_markdown(html: str) -&gt; str:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.find_all(["script", "style", "button", "form", "nav", "footer"]):
        tag.decompose()

    md = markdownify(
        str(soup),
        heading_style="ATX",
        bullets="-",
        strip=["script", "style", "button", "form", "nav"],
    )
    md = re.sub(r"\n{4,}", "\n\n\n", md)
    return md.strip()</code></pre></div><p>The final output for each document looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;95188632-dd66-4b1e-a5fe-167c1807dcdc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">---
id: a1b2c3d4e5f6...
type: twsc_article
title: "THE LAB #94: Using Cookies and Session Persistence"
slug: the-lab-94-using-cookies-and-session
canonical_url: https://substack.thewebscraping.club/p/the-lab-94-using-cookies-and-session
publish_date: 2025-11-15
visibility: premium
topics:
  - browser-automation
  - cloudflare
  - scraping-infra
---

[article body in markdown]</code></pre></div><h2>Scraping your own content</h2><p>The first source we built was a scraper for our own Substack articles. The pattern applies to any CMS: discover URLs, authenticate if needed, extract content, convert to markdown with frontmatter.</p><h3>URL discovery and authentication</h3><p>Most publishing platforms expose a sitemap. We fetch it, filter for article URLs (Substack uses <code>/p/</code> in the path), and track the <code>lastmod</code> date to detect changes:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;cf6dd5c0-8f99-4466-bc88-5bfe8f8b109a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def fetch_sitemap(sitemap_url: str) -&gt; list[dict]:
    req = Request(sitemap_url)
    req.add_header("User-Agent", "Mozilla/5.0 ...")
    with urlopen(req) as response:
        content = response.read()

    root = ET.fromstring(content)
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    articles = []
    for url_elem in root.findall("sm:url", ns):
        loc = url_elem.find("sm:loc", ns)
        lastmod = url_elem.find("sm:lastmod", ns)
        if loc is not None and "/p/" in loc.text:
            articles.append({"url": loc.text.strip(), "lastmod": lastmod.text or ""})
    return articles</code></pre></div><p>Substack gates premium content behind authentication. We handle this with a persistent Playwright browser context that stores cookies across runs. On the first run you log in manually; after that, the saved session keeps you authenticated. For cron jobs, we verify the session by loading a known premium article and checking if the full content appears.</p><p>We try multiple CSS selectors for extraction because Substack has changed its DOM structure over time. The extracted HTML goes through the markdown converter we showed earlier.</p><h2>Ingesting external sources: Hacker News</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/building-knowledge-base-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Data Scraping for Market Research: A Developers Guide]]></title><description><![CDATA[Build scrapers that deliver real market intelligence, not just raw data dumps]]></description><link>https://substack.thewebscraping.club/p/data-scraping-market-research</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/data-scraping-market-research</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 29 Mar 2026 20:38:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e95388da-deb3-4a33-9e90-438b2658fddd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Market research has always been about answering a simple question: &#8220;<em>What&#8217;s happening in the market, and how do I use that to make better decisions?&#8221;</em></p><p>The traditional way to answer that question involved surveys, focus groups, and expensive reports from firms that charge you a fortune for data that&#8217;s already a few months old by the time you read it. Today, the data you need is sitting on public web pages: You just need to collect it.</p><p>In this article, we&#8217;ll discuss how to scrape data for market research, what sources actually matter, how to build a pipeline that doesn&#8217;t fall apart after a week, and where the legal lines are.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What &#8220;Market Research&#8221; Actually Means Web Scraping Professionals</h2><p>Market research needs to answer three questions:</p><ul><li><p>&#8220;<em>What are our competitors doing?</em>&#8221;</p></li><li><p>&#8220;<em>What are our customers saying?</em>&#8221;</p></li><li><p>&#8220;<em>How is the market moving?</em>&#8221;</p></li></ul><p>That&#8217;s it. Everything else is a variation of those three. And if you think about it, the web gives you access to all three, if you know where to look.</p><p>In practice, scraped market intelligence sits on three pillars:</p><ul><li><p><strong>Competitive data</strong>: Pricing, product catalogs, feature changes, hiring signals. This is the &#8220;what are they doing?&#8221; pillar.</p></li><li><p><strong>Customer sentiment</strong>: Reviews, forum discussions, social media posts. This is the &#8220;what are people saying?&#8221; pillar.</p></li><li><p><strong>Market signals</strong>: Job postings, regulatory filings, trend volumes, new product launches. This is the &#8220;where is the market going?&#8221; pillar.</p></li></ul><p>Now, why scraping instead of traditional research? Because scraping is real-time, it&#8217;s continuous, and it doesn&#8217;t depend on people filling out forms. A survey tells you what 500 people said last month. A scraper tells you what thousands of customers are saying right now, every single day, without anyone having to opt in.</p><p>That&#8217;s the competitive advantage. And it&#8217;s a big one.</p><div><hr></div><blockquote><p><em>For your scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>Where to Scrape: Sources That Actually Matter</h2><p>Not all sources are worth your time. You could scrape the entire Internet and still end up with nothing useful if you&#8217;re not targeting the right places. Below is a list of high-value targets for market research and what you can extract from each:</p><ul><li><p><strong>Competitor websites</strong>: Pricing pages, product pages, feature matrices, changelog, and blog posts. This is your primary source for understanding what competitors are offering and how they position themselves. Pricing pages, in particular, are gold. They change more often than you&#8217;d think, and tracking those changes over time tells you a lot about a competitor&#8217;s strategy.</p></li><li><p><strong>Review platforms</strong> <strong>(G2, Trustpilot, Amazon, Yelp)</strong>: Customer pain points, feature requests, sentiment shifts. Reviews are unfiltered customer feedback. Nobody writes a G2 review because they were asked nicely in a survey. They write it because they feel strongly about something&#8212;and that&#8217;s exactly the kind of signal you want.</p></li><li><p><strong>Job boards</strong> <strong>(LinkedIn, Indeed)</strong>: Hiring patterns reveal where a company is investing. If a competitor suddenly posts 20 machine learning engineer roles, that tells you something no press release will. Job postings are one of the most underrated market research signals out there.</p></li><li><p><strong>Social media and forums (Reddit, X, niche communities)</strong>: Unfiltered opinions, emerging trends, early complaints about products. Reddit threads and niche forums are where people say what they actually think, not what they&#8217;d say in a focus group.</p></li><li><p><strong>Government and public data portals</strong>: SEC filings, patent databases, import/export records. These are slower-moving signals, but they&#8217;re authoritative. A patent filing can tell you what a competitor is building 18 months before it ships.</p></li></ul><p>Here&#8217;s the key question to ask yourself before adding a source to your scraper: <em>&#8220;Does this data answer a specific research question, or am I just hoarding?&#8221;</em>. If you can&#8217;t tie a source to a concrete insight, skip it. You&#8217;ll save yourself storage costs, maintenance headaches, and potential legal issues.</p><h2>Building the Pipeline: From Raw HTML to Market Intelligence</h2><p>A market research scraper is not a one-off script you run from your terminal. It&#8217;s a pipeline. And pipelines need structure. If you treat it like a quick script, you&#8217;ll end up with a mess of cron jobs, inconsistent data formats, and no idea whether your data is fresh or stale. So, build it properly from the start.</p><p>A scraping for market intelligence pipeline should have four stages:</p><ol><li><p><strong>Collection</strong>: Fetch the pages, extract the fields you need, throw the rest away. Don&#8217;t store raw HTML &#8220;just in case&#8221; (you&#8217;ll learn why in the legal section of this article).</p></li><li><p><strong>Storage</strong>: Store facts and metadata (source URL, timestamp, extracted fields). Use a structure that makes deduplication and versioning easy. In practice, this means designing your schema around a composite key (for example: <em>source </em>+ <em>entity ID</em> + <em>scraped timestamp</em>) so you can track how a data point changes over time without overwriting previous records.</p></li><li><p><strong>Transformation</strong>: Normalize the data across sources, deduplicate records, and enrich with additional context (geocoding, industry classification, entity linking).</p></li><li><p><strong>Analysis</strong>: Turn rows into insights. This is where the actual market research happens. And to be clear: &#8220;Analysis&#8221; doesn&#8217;t mean opening a CSV and scrolling through it. The goal is to turn your pipeline&#8217;s output into dashboards, scheduled reports, or Slack alerts that reach the people who make decisions. If the data sits in a database and nobody looks at it, the whole pipeline is wasted effort.</p></li></ol><h3>Scheduling Matters More Than You Think</h3><p>Different data types have different freshness requirements. Getting this wrong means either wasting resources or working with stale data. The main ideas to consider when engineering the triggering times are the following:</p><ul><li><p><strong>Price tracking</strong>: Daily or hourly, depending on the market. Consider that e-commerce prices can change multiple times a day. SaaS pricing pages, instead, change less often. But when they do, it&#8217;s significant.</p></li><li><p><strong>Review monitoring</strong>: Monitoring reviews daily is usually enough. Reviews don&#8217;t appear in real-time, and sentiment trends are measured in weeks, not minutes.</p></li><li><p><strong>Job postings</strong>: A weekly schedule works for trend analysis of the job market. Remember that you&#8217;re looking for patterns, not individual listings.</p></li><li><p><strong>Social media</strong>: This depends on your use case. If you&#8217;re tracking a product launch or a PR crisis, you might need near-real-time. For general trend analysis, daily or even weekly batches work fine.</p></li></ul><h3>Tools That Work Well for Market Research Scraping</h3><p>You don&#8217;t need to reinvent the wheel. The software industry already provides you with the best tools for your market research scraping pipeline. Here&#8217;s a solid stack for a market research pipeline:</p><ul><li><p><strong><a href="https://www.scrapy.org/">Scrapy</a></strong> for structured crawling. <a href="https://substack.thewebscraping.club/p/scrapy-ten-years-of-scraping-framework">Scrapy&#8217;s architecture is designed for exactly this kind of work</a>: You define spiders per source, plug in middleware for proxy rotation and retry logic, and use item pipelines to clean and store data as it flows through. For market research specifically, Scrapy&#8217;s built-in feed exports let you dump results straight to JSON, CSV, or even S3 without writing custom I/O code. And if you need to coordinate multiple spiders (say, one per competitor), Scrapy&#8217;s project structure keeps things organized as your source list grows.</p></li><li><p><strong><a href="https://playwright.dev/">Playwright</a></strong> or <strong><a href="https://pptr.dev/">Puppeteer</a></strong> for JS-heavy pages. The key difference from Scrapy is that <a href="https://substack.thewebscraping.club/p/handling-infinite-scrolling-python-js">you&#8217;re running a real browser, which means you can handle dynamic content, infinite scroll</a>, and client-side rendering. The trade-off is resource cost: Each browser instance eats memory and CPU, so you don&#8217;t want to use this for targets that serve static HTML.</p></li><li><p><strong>A</strong> <strong>task queue</strong> for scheduling and orchestration. This is what turns a collection of scrapers into an actual pipeline. Instead of running scripts manually or relying on cron jobs, a task queue lets you schedule scrapes per source at different intervals, retry failed jobs automatically, and <a href="https://substack.thewebscraping.club/p/python-async-for-faster-scraping">control concurrency so you&#8217;re not overwhelming a target site with parallel requests.</a> It also gives you visibility: you can see what&#8217;s queued, what&#8217;s running, what failed, and why.</p></li><li><p><strong><a href="https://www.postgresql.org/">PostgreSQL</a></strong> for structured market data that needs querying and versioning. Relational databases shine here because market research data is inherently relational: competitors have products, products have prices, prices change over time.</p></li></ul><p>The point is this: Pick tools that let you build a maintainable system, not just a working script. Every tool in this stack solves a specific problem, and none of them requires you to build infrastructure from scratch. The best market research pipeline is the one that&#8217;s boring to operate, because boring means reliable.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Scaling Without Getting Blocked</h2><p>If you&#8217;re scraping one competitor once a week, you don&#8217;t need this section. If you&#8217;re tracking 50 competitors daily across thousands of pages, you do.</p><p>Here&#8217;s the reality: The moment you start scraping at scale, you become visible. But sites don&#8217;t like bots, even polite ones. So you need to be smart about how you scale. Consider the following rules of thumb to avoid getting blocked:</p><ul><li><p><strong>Proxy rotation</strong>: <a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies">Residential proxies for sensitive targets (sites with aggressive anti-bot systems), datacenter proxies for everything else</a>. Rotate per request or per session, depending on the site&#8217;s detection mechanisms. The key is to not send thousands of requests from the same IP in an hour.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/rate-limit-scraping-exponential-backoff">Rate limiting and backoff</a></strong>: Be a good citizen. If you hammer a site with concurrent requests, you&#8217;ll get blocked, and you&#8217;ll deserve it. Implement exponential backoff on failures, and set reasonable delays between requests. A 2-3 second delay between requests is a good starting point for most sites.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">Fingerprint management</a></strong>: Headers, TLS fingerprint, and browser-level signals matter on sites with serious anti-bot systems. Make sure your request headers look consistent and realistic.</p></li><li><p><strong>CAPTCHAs</strong>: <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">If you&#8217;re hitting CAPTCHAs regularly, your approach is too aggressive</a>. Fix the root cause (rate, fingerprint, proxy quality) before reaching for solver services. CAPTCHA solvers are a band-aid, not a solution.</p></li></ul><p>The general principle is simple: Scrape at a pace that doesn&#8217;t degrade the target site&#8217;s performance.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Turning Scraped Data into Actual Market Insights</h2><p>Let&#8217;s be clear about something: Raw scraped data is not market research. It&#8217;s just data. A CSV with 50&#8217;000 rows of competitor prices is not an insight. A chart showing that competitor X has dropped their enterprise tier price by 15% over three months: That&#8217;s an insight.</p><p>Here&#8217;s where the value gets created:</p><ul><li><p><strong>Price tracking and competitive benchmarking</strong>: Track changes over time, visualize trends, and set alerts for significant moves. The goal is not to know what a competitor charges today. It&#8217;s to understand their pricing trajectory. Are they moving upmarket? Are they running more frequent discounts? Are they simplifying their tier structure? This is where predictive <a href="https://substack.thewebscraping.club/p/predictive-analytics-web-scraped-data">analytics meets scraped data with the goal of predicting future moves</a> from your competitors.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/sentiment-analysis-amazon-reviews">Sentiment analysis on reviews</a></strong><a href="https://substack.thewebscraping.club/p/sentiment-analysis-amazon-reviews">: Use NLP to extract themes from customer reviews</a>. This is powerful for product teams who want to understand what customers love and hate about competitors. But remember: You&#8217;re analyzing the data internally, not republishing the reviews.</p></li><li><p><strong>Hiring signal analysis</strong>: Aggregate job postings by role type, department, and location. A competitor suddenly posting 15 ML engineer roles tells you they&#8217;re investing in AI. A wave of sales hiring in EMEA tells you they&#8217;re expanding geographically. This is a signal that&#8217;s almost impossible to get from any other source.</p></li><li><p><strong>Trend detection</strong>: Time-series analysis on product launches, feature changes, pricing moves, or social media mentions. <a href="https://substack.thewebscraping.club/p/scraping-data-anomaly-detection">The goal is to spot patterns or anomalies</a> before they become obvious. If three competitors all add the same feature within two months, that&#8217;s a market trend, not a coincidence.</p></li></ul><p>Overall, the <a href="https://substack.thewebscraping.club/p/building-a-scraper-dashboard-streamlit">output of your scraping pipeline should be dashboards</a>, reports, or automated alerts, not a database dump that someone has to manually dig through. If the insights don&#8217;t reach decision-makers in a usable format, the whole pipeline is wasted effort.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Legal and Ethical Considerations: Don&#8217;t Skip This Section</h2><p>I know, I know. You&#8217;re a developer, not a lawyer. But here&#8217;s a thing I&#8217;m sure you know: Most legal problems in scraping are self-inflicted. They happen because someone scraped &#8220;everything on the page,&#8221; stored it &#8220;for later,&#8221; and only then asked: <em>&#8220;Wait, can we actually use this?&#8221;</em></p><p>As discussed in detail in &#8220;<a href="https://substack.thewebscraping.club/p/avoid-copyright-violations-scraping">How to Avoid Copyright Violations While Scraping</a>&#8221;, let&#8217;s go through the key legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical principles of web scraping</a> shortly:</p><ul><li><p><strong>Scrape facts, not expression</strong>: Copyright protects expression, not facts. Prices, SKUs, dates, availability, and job titles are facts. No one owns the fact that a SaaS product costs $49/month. On the other hand, product descriptions, review text, and blog posts are creative expressions.</p></li><li><p><strong>Don&#8217;t store raw pages by default</strong>: Storing the HTML of entire pages means creating copies of copyrighted content. Instead, parse in-memory, extract only the fields you need, and discard the rest. If you need to debug, store a small sample with short retention.</p></li><li><p><strong>Respect </strong><em><strong>robots.txt</strong></em>: <a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">The </a><em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">robots.txt</a></em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications"> file is not the law, but ignoring it is evidence of bad faith if things go sideways</a>. In disputes, it can be used to show that you knew you were unwelcome and kept going anyway.</p></li><li><p><strong>Terms of Service matter</strong>: If the ToS explicitly forbids scraping and you scrape anyway, you may have a breach-of-contract problem. This is often easier for the site owner to prove than copyright infringement, because the argument is straightforward: you agreed to a contract, then you violated it.</p></li><li><p><strong>Don&#8217;t scrape behind a login</strong>: Once you log in, you&#8217;ve affirmatively agreed to a contract. Breaking that contract to scrape is a fast track to legal trouble. If your plan requires authenticated access, treat it as a licensing problem, not an engineering challenge.</p></li><li><p><strong>GDPR/CCPA</strong>: If you&#8217;re scraping anything that could be personal data (usernames, reviewer names, profile information), you need to know which privacy laws apply. This is especially relevant for review scraping and social media monitoring.</p></li></ul><p>Here&#8217;s the mental model that works: A price comparison tool that shows prices and links back to the source? Generally safe. A product catalog that copies descriptions, images, and reviews so users never need to visit the original site? That&#8217;s where you get into trouble, even if you don&#8217;t publicly display the results because you use them for internal analysis.</p><h2>Keeping Your Scrapers Alive: Monitoring and Maintenance</h2><p>Scrapers in production break for several reasons. Sites change layouts, add anti-bot measures, restructure their URLs, or just go down for maintenance. If you don&#8217;t monitor your scrapers, your data goes stale silently, and you won&#8217;t know until someone asks why the pricing dashboard hasn&#8217;t updated in three weeks.</p><p>Here&#8217;s a breakdown of what you need:</p><ul><li><p><strong>Dead selector detection</strong>: Alert when a CSS selector or XPath returns empty across multiple consecutive runs. A selector that worked yesterday and returns nothing today means the site changed its HTML structure. The keyword here is &#8220;multiple consecutive runs&#8221;. A single empty result could be a transient issue, so consider not triggering alerts on the first failure. Instead, set a threshold, like three consecutive empty results, before flagging it. When it does fire, you need to inspect the current page structure and update your selectors. Alternatively, try to <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">go beyond the DOM using AI and LLMs</a>, to make your extraction more resilient to layout changes in the first place.</p></li><li><p><strong>HTTP status monitoring</strong>: A spike in 403s means you&#8217;re getting blocked. A spike in 429s means you&#8217;re hitting rate limits. A spike in 404s means URLs have changed. Each of these requires a different response. For 403s, check your proxy pool and rotation logic: You might need fresher IPs or a lower request rate. For 429s, back off and increase your delays between requests; the site is telling you exactly what the problem is. For 404s, the target has likely restructured its URL patterns, which means you need to update your URL generation logic, not just retry the same broken links. Log these status codes per source and per run so you can spot trends early. A gradual increase in 403s over a week is a warning sign that your current setup is losing effectiveness, even if individual runs still return some data.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/ensuring-data-quality-in-web-scraping">Data quality checks</a></strong>: Row counts, null rates, value distributions. If your price tracker suddenly shows all prices as $0 or your review scraper returns empty text fields, you want to know immediately. Build quality checks into your pipeline as a post-scrape validation step, not as something you run manually. Compare each run&#8217;s output against baseline expectations: If you normally get 200 rows from a source and today you got 12, something is wrong, even if those 12 rows look fine individually.</p></li><li><p><strong>Automated tests against fixture HTML</strong>: Save sample HTML pages from your targets and write tests against them. When a test fails, you know the site has changed before your production scraper breaks. Treat your scrapers like production code, because they are. In practice, this means saving a snapshot of a relevant section in the target page as a local HTML file. Then, write unit tests that run your extraction logic against that fixture and assert expected outputs. Store these fixtures in version control alongside your scraper code. When a site changes and your production scraper breaks, update the fixture with the new HTML. This gives you a repeatable workflow for handling site changes instead of scrambling every time something breaks.</p></li></ul><p>The goal is simple: You should know when something breaks before your stakeholders do. A Slack alert that says &#8220;Competitor X pricing scraper returned 0 results&#8221; is infinitely better than a product manager asking why the dashboard is empty.</p><h2>Conclusion</h2><p>In this article, you learned that market research scraping is about building a reliable pipeline that collects the right facts, transforms them into insights, and doesn&#8217;t get you in legal trouble.</p><p>The competitive advantage of scraping for market research is in what you do with the data. Anyone can code a scraper. But building a system that delivers reliable, actionable market intelligence week after week? That&#8217;s where the real value is!</p><p>So, let us know: Are you using web scraping for market research? What sources have you found most valuable? How did you structure your scraping pipeline? Let&#8217;s discuss in the comments!</p>]]></content:encoded></item></channel></rss>