<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Web Scraping Club]]></title><description><![CDATA[News, solutions and interviews about web scraping.
In this substack you will find weekly content about:
- Web Scraping techniques
- Interviews with key people in the industry
- Anti bot infos and counter measures
- Real world examples and code]]></description><link>https://substack.thewebscraping.club</link><image><url>https://substackcdn.com/image/fetch/$s_!gJt2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1e343ec9-7946-4440-8c00-57209a1d99a1_1024x1024.png</url><title>The Web Scraping Club</title><link>https://substack.thewebscraping.club</link></image><generator>Substack</generator><lastBuildDate>Sun, 05 Jul 2026 06:27:52 GMT</lastBuildDate><atom:link href="https://substack.thewebscraping.club/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Web Scraping Club SRL]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[pier@thewebscraping.club]]></webMaster><itunes:owner><itunes:email><![CDATA[pier@thewebscraping.club]]></itunes:email><itunes:name><![CDATA[Pierluigi Vinciguerra]]></itunes:name></itunes:owner><itunes:author><![CDATA[Pierluigi Vinciguerra]]></itunes:author><googleplay:owner><![CDATA[pier@thewebscraping.club]]></googleplay:owner><googleplay:email><![CDATA[pier@thewebscraping.club]]></googleplay:email><googleplay:author><![CDATA[Pierluigi Vinciguerra]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[UDP, Proxies, and the WebRTC Leak]]></title><description><![CDATA[The proxy carries the UDP. The browser refuses to hand WebRTC over. We tested every combination to find where your IP actually leaks.]]></description><link>https://substack.thewebscraping.club/p/udp-proxies-and-the-webrtc-leak</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/udp-proxies-and-the-webrtc-leak</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 02 Jul 2026 03:00:18 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e7e6b5b3-fee9-4165-abe4-ec0cda3458fc_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>More and more of the web runs on UDP now. HTTP/3 runs over QUIC, QUIC runs over UDP, and a good chunk of the sites we scrape already answer this way. Proxy providers noticed, and some of them started adding UDP support, since that&#8217;s the only way a UDP protocol can pass through a proxy at all. We wanted to understand what this actually gives us as scrapers, because UDP reaches a browser in two very different ways, and a proxy treats them differently.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://visit.decodo.com/WyQ3mA" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="555" height="312.1875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:555,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p><div><hr></div></blockquote><p>The first is the UDP we want, HTTP/3 for our requests. The second is the UDP we never asked for, the WebRTC traffic that a browser generates on its own, which can leak around the proxy and expose our real IP. But supporting UDP and actually getting a client to route its UDP through the proxy are two different problems, and the result changes a lot depending on the client. An HTTP client, a browser doing HTTP/3, and a browser doing WebRTC all behave differently, so we tested all three, end to end, to see where a UDP proxy does the work for us and where we still have to step in on the client side.</p><h2>A quick recap of HTTP/2 and HTTP/3</h2><p>We already covered the request side of HTTP/3 in <a href="https://substack.thewebscraping.club/p/faster-web-scraping-with-http3">Faster Web Scraping with HTTP/3</a>, so let&#8217;s just recap what matters for proxies. HTTP/1.1 and HTTP/2 run on TCP. HTTP/3 drops TCP completely and runs on QUIC, which is built on UDP and merges the transport and TLS handshakes into one. The part that matters for us is the transport: a proxy built to relay TCP has nothing to grab onto in a QUIC connection, since there&#8217;s no TCP stream to relay.</p><p>Adoption is real, but it&#8217;s easy to overstate. If we look at traffic, HTTP/3 is around a fifth of all requests, while HTTP/2 is still the majority at almost half, and this split has stayed roughly the same through 2026 (<a href="https://radar.cloudflare.com/adoption-and-usage">Cloudflare Radar</a>). If we look at site support instead, the number goes up: around 38% of sites advertise HTTP/3 via Alt-Svc (<a href="https://w3techs.com/technologies/details/ce-http3">W3Techs</a>). Italy, where our test machine sits, is one of the higher HTTP/3 markets at about 30%. So HTTP/3 is a serious, well-supported protocol, not some niche thing, but it hasn&#8217;t replaced HTTP/2 yet, and most real traffic is still on TCP.</p><div><hr></div><blockquote><p><em>HTTP/3 and QUIC run on UDP, so the proxy layer has to carry it too. <strong>anyIP</strong> gives compatible clients production-ready SOCKS5 UDP ASSOCIATE across its residential and mobile network.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://anyip.io/proxies/udp-proxy" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f6L2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png 424w, https://substackcdn.com/image/fetch/$s_!f6L2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png 848w, https://substackcdn.com/image/fetch/$s_!f6L2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png 1272w, https://substackcdn.com/image/fetch/$s_!f6L2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f6L2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png" width="556" height="278" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:556,&quot;bytes&quot;:1075200,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://anyip.io/proxies/udp-proxy&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/203043903?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f6L2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png 424w, https://substackcdn.com/image/fetch/$s_!f6L2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png 848w, https://substackcdn.com/image/fetch/$s_!f6L2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png 1272w, https://substackcdn.com/image/fetch/$s_!f6L2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8be5a9-3a31-49b3-9912-82c5964632b8_4585x2293.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://anyip.io/proxies/udp-proxy?utm_source=thewebscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=gold_2026_1&amp;utm_content=banner_udp&quot;,&quot;text&quot;:&quot;Explore anyIP UDP Proxies&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://anyip.io/proxies/udp-proxy?utm_source=thewebscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=gold_2026_1&amp;utm_content=banner_udp"><span>Explore anyIP UDP Proxies</span></a></p></blockquote><div><hr></div><p>The sales pitch follows naturally from here. If real browsers increasingly speak HTTP/3 and our scraper can&#8217;t, that difference becomes a signal, so we&#8217;d want to speak HTTP/3 too, and to do that through a proxy, the proxy has to carry UDP. </p><p>This is exactly what some providers are now selling. They offer UDP support, usually as a beta over SOCKS5, and they pitch it as a stealth feature: the idea is that most residential traffic is QUIC these days, so a client stuck on TCP stands out. There are two parts of this claim worth checking before relying on it, the stealth side and the tooling side, so we tested both, using a provider that carries UDP and one that doesn&#8217;t.</p><h2>UDP support and the stealth question</h2><p>Switching to HTTP/3 doesn&#8217;t remove protocol fingerprinting, it just moves it somewhere else. Detection happens at three layers, and moving from TCP to QUIC affects each one in a different way.</p><p>The TCP/IP stack fingerprint, meaning the SYN options, window sizes and TTL that give away the OS, really does disappear, because there&#8217;s no TCP handshake anymore. That&#8217;s the only layer that goes away, though. The TLS fingerprint stays exactly as it was, because QUIC still does a TLS 1.3 handshake, just wrapped inside QUIC instead of running over TCP. The ClientHello with its cipher suites and extensions is still there, and JA4 was designed to fingerprint the QUIC variant on purpose (<a href="https://scrapfly.io/web-scraping-tools/http3-quic-fingerprint">Scrapfly QUIC/HTTP3 fingerprint</a>). The HTTP/2 fingerprint, the SETTINGS frames, window updates and pseudo-header order, doesn&#8217;t get erased either. It just gets replaced by an HTTP/3 fingerprint built from QUIC transport parameters.</p><p>And those QUIC transport parameters are basically the same thing we were trying to get away from. Scrapfly says it clearly: they &#8220;work like HTTP/2 SETTINGS frames but at the transport layer.&#8221; Values like <code>max_udp_payload_size</code> and <code>initial_max_data</code>, and the order they come in, are different between Chrome&#8217;s quiche, Firefox, Go&#8217;s quic-go and ngtcp2, and that difference is itself a fingerprint (<a href="https://scrapfly.io/blog/posts/http2-http3-fingerprinting-guide">Scrapfly HTTP/2 and HTTP/3 fingerprinting</a>). QUIC even adds new signals that have no TCP equivalent at all, like 0-RTT behavior and connection migration support.</p><p>So the layer that actually gets scrapers blocked today is TLS and HTTP/2 fingerprinting, which Cloudflare, Akamai and DataDome score together as JA4 and JA4H. That&#8217;s a TCP-side problem, and it&#8217;s one we already solve with a client that imitates a browser&#8217;s handshake. HTTP/3 fingerprinting is growing but still immature. Scrapfly is pretty direct about it: it &#8220;adds signals on top of&#8221; HTTP/2 instead of replacing it, real Chrome on HTTP/2 is still accepted on the vast majority of sites, and the scraper-on-HTTP/2 versus user-on-HTTP/3 gap is only &#8220;starting to become a signal on Cloudflare&#8217;s premium tier.&#8221; There&#8217;s also a catch worth keeping in mind: a scraper that speaks HTTP/3 through a non-browser QUIC stack shows a QUIC fingerprint that matches no real browser, so it can stand out just as easily as it blends in. In other words, a UDP proxy is what gets us onto HTTP/3 in the first place, but the stealth benefit only pays off if the client presents a QUIC fingerprint that a real browser would. The proxy just carries the traffic; making it look right is still our job.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>The tooling has not caught up yet</h2><p>The other half of the pitch is the tooling. The most common client for browserless scraping is <code>curl_cffi</code>, so the real question is whether it can actually do HTTP/3 in the setup a scraper normally runs in. <span>We tested it (</span>run_curl_cffi_http3.py<span>), and you can find the code, as always, </span><a href="https://github.com/TheWebScrapingClub/thelab"><span>on our GitHub repository available for our paid readers</span></a><span>, under the folder </span><strong><span>109.UDP</span></strong></p><p>The library is compiled with the QUIC stack, <code>ngtcp2</code> and <code>nghttp3</code>, and on a plain direct request it does reach HTTP/3. It even reaches HTTP/3 with a full Chrome fingerprint if we use the low-level interface, so impersonation and HTTP/3 aren&#8217;t really in conflict, even if the first error makes it look that way.</p><pre><code><code>high-level requests API:
  plain, no impersonation, V3ONLY        -&gt; HTTP/3
  Chrome impersonation, V3ONLY           -&gt; ERROR: QUIC needs at least TLS version 1.3
  Chrome impersonation, HTTP proxy, V3   -&gt; ERROR: HTTP/3 is not supported over an HTTP proxy
  Chrome impersonation, SOCKS5, V3       -&gt; ERROR: HTTP/3 is not supported over a SOCKS proxy
low-level Curl API (same impersonation):
  impersonate chrome + V3ONLY            -&gt; HTTP/3</code></code></pre><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><p>That middle error in the high-level API is just a TLS-version conflict inside the <code>requests</code> wrapper, not a real limitation, since the low-level call combines the Chrome fingerprint with HTTP/3 and gets there fine. The lines that really matter are the proxy ones. curl refuses HTTP/3 over a proxy completely, both HTTP and SOCKS, and it refuses before it even connects, so this is curl saying no, not the proxy. It happens even with a proxy that could carry the UDP. And since a scraper always runs through a proxy, in practice <code>curl_cffi</code> ends up being HTTP/2 over TCP, and a UDP-capable proxy doesn&#8217;t change that at all.</p><p>So this is the honest picture on the request side. A proxy that carries UDP does its job for a client that tunnels its own UDP, but the popular HTTP clients refuse HTTP/3 over a proxy right now, so the benefit is stuck waiting for the tooling to catch up. The proxy capability is arriving before the clients that could use it, which is good to know before expecting HTTP/3 to flow through a scraper just because the proxy supports UDP.</p><h2>The WebRTC leak, where UDP really bites</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/udp-proxies-and-the-webrtc-leak">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Public Sector Meets Web Scraping: From Scraped Data to Public Value]]></title><description><![CDATA[Using web scraping to fuel data pipelines that provide value to citizens and policymakers]]></description><link>https://substack.thewebscraping.club/p/public-sector-web-scraping-real-estate</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/public-sector-web-scraping-real-estate</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 28 Jun 2026 03:01:13 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0847fa1a-718c-4532-843c-254ba0748dcb_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There&#8217;s no doubt that web scraping supports a wide range of use cases, from price monitoring to data collection for AI model training. Most of these applications focus on generating value for businesses or individuals by supporting decision-making or automating workflows.</p><p>However, web scraping can also serve as a foundation for building the data infrastructure needed to better understand national markets and societal conditions.</p><p>In this blog post, I&#8217;ll show real-world examples of how web scraping is already used in the public sector, and how institutions could leverage it to improve policies and deliver higher-quality services to citizens.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p><div><hr></div></blockquote><h2>Why Web Scraping Matters for Public Institutions</h2><p>Public institutions have access to statistical datasets from official sources. These are surely relevant, but they tend to provide a picture of the market only after events have already happened, sometimes months or even a year later.</p><p>That delay creates a challenge. After all, institutions need timely information to understand citizens&#8217; current needs and respond effectively. Another issue is that relevant data is often scattered across many sources, such as different websites, web portals, and marketplaces.</p><p>This is where web scraping can help! By collecting large volumes of publicly available data from multiple sites, institutions can complement and enrich official statistics.</p><p>Data collection is only the first step, though. The real value comes from building a complete data pipeline that includes cleaning, deduplication, aggregation, geolocation, statistical modeling, and interpretation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HzZ6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HzZ6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png 424w, https://substackcdn.com/image/fetch/$s_!HzZ6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png 848w, https://substackcdn.com/image/fetch/$s_!HzZ6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png 1272w, https://substackcdn.com/image/fetch/$s_!HzZ6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HzZ6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png" width="1456" height="723" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:723,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Bayernheimerov&#225; Kl&#225;ra during her speech at Prague Crawl 2026&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Bayernheimerov&#225; Kl&#225;ra during her speech at Prague Crawl 2026" title="Bayernheimerov&#225; Kl&#225;ra during her speech at Prague Crawl 2026" srcset="https://substackcdn.com/image/fetch/$s_!HzZ6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png 424w, https://substackcdn.com/image/fetch/$s_!HzZ6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png 848w, https://substackcdn.com/image/fetch/$s_!HzZ6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png 1272w, https://substackcdn.com/image/fetch/$s_!HzZ6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8619ffba-30f3-4287-97af-17b9f368bc1a_2912x1446.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Bayernheimerov&#225; Kl&#225;ra during her speech at Prague Crawl 2026</figcaption></figure></div><p>At Prague Crawl 2026, <a href="https://cz.linkedin.com/in/kl%C3%A1ra-bayernheimerov%C3%A1-53bbaa405">Bayernheimerov&#225; Kl&#225;ra</a> presented a compelling example of this approach. Her talk showed how the <em>Czech Ministry of Finance</em> relies on rental listing data collected by an external provider from several real estate portals and processes it through an end-to-end pipeline based on a dedicated statistical methodology.</p><p>As I&#8217;ll present shortly, the result is a reliable source of insights that supports housing policy decisions and affordable housing programs across the country, while also providing a valuable service to citizens.</p><p>What&#8217;s important to understand is that the goal isn&#8217;t simply to collect raw data, but to power services such as interactive maps, calculators, and dashboards that are useful to both public institutions and citizens.</p><h2>Case Study: Czech Rental Market Intelligence System</h2><p>To see an actual application of web scraping in the public sector, I&#8217;ll now present the initiative developed by the Czech Ministry of Finance for rental price analysis across Czechia.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://visit.decodo.com/WyQ3mA" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="521" height="293.0625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:521,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>The Motivation Behind the Project</h3><p>Just like in <a href="https://www.europarl.europa.eu/topics/en/article/20241014STO24542/housing-crisis-why-prices-are-rising-and-what-the-eu-is-doing-about-it">many other European countries</a>, housing affordability has become a major social and economic challenge in the Czech Republic. Housing supply has struggled to keep up with demand, causing both property prices and rents to rise steadily across the country.</p><p>To confront this issue, the Czech Ministry of Finance needed detailed and up-to-date information on local rental markets. The initiative was launched specifically to fill that gap, creating a web data analysis process that gives a continuously updated view of rental prices across the entire country.</p><p>The objective is to provide a reliable foundation for housing policy, affordable housing programs, and public-facing tools that help citizens better resonate with local rental markets.</p><h3>Web Data Sourcing</h3><p>The data for this initiative has been collected using <a href="https://apify.com/">Apify</a>, one of the largest marketplaces of ready-made tools for web scraping, automation, and AI.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QcM-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QcM-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png 424w, https://substackcdn.com/image/fetch/$s_!QcM-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png 848w, https://substackcdn.com/image/fetch/$s_!QcM-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png 1272w, https://substackcdn.com/image/fetch/$s_!QcM-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QcM-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png" width="1456" height="771" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:771,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Apify&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Apify" title="Apify" srcset="https://substackcdn.com/image/fetch/$s_!QcM-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png 424w, https://substackcdn.com/image/fetch/$s_!QcM-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png 848w, https://substackcdn.com/image/fetch/$s_!QcM-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png 1272w, https://substackcdn.com/image/fetch/$s_!QcM-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853bdd95-9cdc-487b-8abd-7b26009400da_2998x1587.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Apify</figcaption></figure></div><p>If you aren&#8217;t familiar with Apify, it provides over <a href="https://apify.com/store">41,000 ready-made serverless cloud programs </a>(called <em>Actors</em>) to automate a wide range of tasks, including web scraping across thousands of different domains.</p><p>These Actors are developed and maintained by the community (and in some cases by Apify itself) and run on Apify&#8217;s cloud infrastructure. You can use them directly through Apify Console via a no-code interface, or call them programmatically via API. They can also be integrated into workflows such as n8n, Make, Zapier, or AI agents via MCP.</p><p>Two main reasons influenced the choice of Apify:</p><ol><li><p>It&#8217;s a Czech-based company.</p></li><li><p>It enables the Ministry of Finance to gather large-scale, up-to-date information on rental listings without maintaining scraping infrastructure in-house.</p></li></ol><p>In general, a solution like Apify lowers the barrier for public-sector teams, as these may have limited engineering capacity.</p><h3>Downstream Data Pipeline</h3><p>At a high level, this is the <a href="https://mf.gov.cz/cs/rozpoctova-politika/podpora-projektoveho-rizeni/cenova-mapa/metodika">downstream data pipeline</a> implemented by the Czech Ministry of Finance:</p><ol><li><p><strong>Data ingestion</strong>: Collects rental listings via Apify-based web scraping from multiple Czech real estate portals.</p></li><li><p><strong>Data cleaning</strong>: Sorts data and removes duplicates, incorrect entries, and inconsistencies to ensure it is up to date, accurate, and free from deviations or distortions, following the procedure defined in <a href="https://www.cai.cz/OA/pdf/P456_2024_EN.pdf">Decree No. 456/2024 Coll</a>.</p></li><li><p><strong>Geolocation and aggregatio</strong>n: Standardizes addresses, assigns cadastral units, and enriches listings with attributes such as size, amenities, and building type.</p></li><li><p><strong>Statistical modelling</strong>: Uses hedonic regression with spatial and temporal weighting to estimate underlying rental price levels.</p></li><li><p><strong>Output generation</strong>: Presents results as interactive price maps and calculators.</p></li></ol><p><strong>Note</strong>: The pipeline also includes data update and recycling via a rolling window approach. This process adds new listings while retiring outdated observations to keep the dataset current.</p><p>To make rental prices comparable across different locations, the methodology is based on a <em>reference apartment. That represents </em>a standardized apartment profile with predefined characteristics, such as size, type, furnishing level, etc.</p><p>This approach reduces differences caused by individual property features and enables more consistent comparisons of rental price levels across municipalities and cadastral areas.</p><h3>Produced Output and Tools</h3><p>The Czech Rental Housing Price Map consists of two main solutions:</p><ul><li><p><strong><a href="https://mf.gov.cz/cs/rozpoctova-politika/podpora-projektoveho-rizeni/cenova-mapa/cenova-mapa-infografika">Interactive rent price map</a></strong>: Provides estimated rental prices at the level of municipalities and cadastral areas across the Czech Republic.</p></li><li><p><strong><a href="https://mf.gov.cz/cs/rozpoctova-politika/podpora-projektoveho-rizeni/cenova-mapa/kalkulacka-trzniho-najemneho">Market rent calculator</a></strong>: A practical tool that calculates a statistical estimate of rental price levels based on a standardized apartment profile and selected property characteristics.</p></li></ul><p><strong>Important</strong>: The two solutions aren&#8217;t intended to determine the market rent of a specific apartment. Instead, they return statistical estimates based on a standardized reference apartment. So, they&#8217;re intended primarily for market monitoring, regional comparisons, and housing policy analysis.</p><h4>Interactive rent price map</h4><p>The <em>interactive rent price map</em> shows minimum, maximum, and median rents for each area based on market listings, adjusted per square meter for a standard unfurnished reference apartment. It covers four size categories from 1+kk/1+1 to 4+kk/4+1.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P1Z4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P1Z4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png 424w, https://substackcdn.com/image/fetch/$s_!P1Z4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png 848w, https://substackcdn.com/image/fetch/$s_!P1Z4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!P1Z4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P1Z4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png" width="1456" height="823" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:823,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Czech interactive rent price map (as displayed on a Google-translated page in English)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Czech interactive rent price map (as displayed on a Google-translated page in English)" title="The Czech interactive rent price map (as displayed on a Google-translated page in English)" srcset="https://substackcdn.com/image/fetch/$s_!P1Z4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png 424w, https://substackcdn.com/image/fetch/$s_!P1Z4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png 848w, https://substackcdn.com/image/fetch/$s_!P1Z4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!P1Z4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a7993d4-d09a-4e08-ad7d-57faf945ccec_2364x1336.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Czech interactive rent price map (as displayed on a Google-translated page in English)</figcaption></figure></div><p>Note that the map is interactive, and you can zoom in and out to explore individual cadastral areas. It&#8217;s also updated four times per year to ensure current data.</p><h4>Market rent calculator</h4><p>The <em>market rent calculator</em> is available as a form on the Czech government website. It works as follows:</p><ol><li><p>Select the territorial unit where you want to estimate the rent.</p></li><li><p>Choose the size category of the apartment.</p></li><li><p>Enter the floor area of the apartment (if you don&#8217;t know it, the calculator will automatically use a default value for the selected location).</p></li><li><p>Indicate whether the apartment is in a new building.</p></li><li><p>Optionally specify whether the building uses non-standard construction materials (e.g., other than brick or panel).</p></li><li><p>Optionally select additional features such as a terrace, furniture, or an assigned parking/garage space.</p></li><li><p>Click &#8220;Calculate rent&#8221; to obtain an estimate of the monthly market rent (CZK) and the corresponding price level, which represents the final estimated rent for the selected apartment profile.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NWAb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NWAb!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif 424w, https://substackcdn.com/image/fetch/$s_!NWAb!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif 848w, https://substackcdn.com/image/fetch/$s_!NWAb!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif 1272w, https://substackcdn.com/image/fetch/$s_!NWAb!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NWAb!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif" width="794" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a824916b-9219-4efe-ba55-1e66699d364b_794x800.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:794,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Result of a market rent calculator submission (as displayed on a Google-translated page in English)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Result of a market rent calculator submission (as displayed on a Google-translated page in English)" title="Result of a market rent calculator submission (as displayed on a Google-translated page in English)" srcset="https://substackcdn.com/image/fetch/$s_!NWAb!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif 424w, https://substackcdn.com/image/fetch/$s_!NWAb!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif 848w, https://substackcdn.com/image/fetch/$s_!NWAb!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif 1272w, https://substackcdn.com/image/fetch/$s_!NWAb!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa824916b-9219-4efe-ba55-1e66699d364b_794x800.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Result of a market rent calculator submission (as displayed on a Google-translated page in English)</figcaption></figure></div><h3>Impact</h3><p>The project provides a systematic and regularly updated overview of market rental prices across the entire Czech Republic, supporting consistent monitoring of housing market developments over time and across regions.</p><p>Beyond market observation, the outputs serve as a key evidence base for housing policy design and housing programmes within public administration. For example, the State Investment Support Fund (SFPI) relies on these results when working with its affordable rental housing schemes.</p><h2>Web Scraping in the Public Sector: Extending the Model</h2><p>The project carried out by the Czech Ministry of Finance is just one example of a much broader pattern: <em>how web scraping can be used as a foundation for modern public-sector data systems.</em></p><p>In particular, the same pipeline logic can be applied across many other domains. For example, in labor markets, web scraping can be used to collect job postings, salary ranges, and skill requirements from recruitment platforms. This can help governments identify regional skill shortages and design more targeted education or reskilling programmes.</p><p>Similarly, in consumer price monitoring, scraping can track grocery prices, housing costs, and essential goods across regions, enabling better inflation tracking and cost-of-living analysis. Other potential public domains that can benefit from this include energy consumption and transport accessibility.</p><p>When combined, those datasets become even more powerful. For instance, housing data, job market data, and grocery price data could be combined into a broader &#8220;liveability&#8221; index. That would help citizens and policymakers assess the overall affordability of living in a given region.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Building a Job Market Choropleth for Czech Districts</h3><p>In this section, I&#8217;ll guide you through building a job market choropleth across Czech districts. The idea is to show how web scraping, combined with a complete data pipeline, can be utilized for other scenarios beyond the previous example.</p><p><strong>Important</strong>: The project below isn&#8217;t related to the initiative from the Czech Ministry of Finance and has been created purely for illustrative purposes.</p><h3>Prerequisites</h3><p>To follow this tutorial section, make sure you have:</p><ul><li><p>An <a href="https://console.apify.com/sign-up">Apify account</a> (a free plan is sufficient).</p></li><li><p><a href="https://www.python.org/downloads/">Python 3.11+</a> installed locally.</p></li></ul><p>I&#8217;ll also assume you have a Python project set up locally with the following libraries installed:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">pip install pandas geopandas geopy matplotlib</code></pre></div><p>These are the required dependencies for this project and will be used as follows:</p><ul><li><p><em>pandas</em>: Load, clean, and manipulate the scraped job posting data before geospatial analysis.</p></li><li><p><em>geopandas</em>: Handle geographic data, perform spatial joins, and create district-level maps.</p></li><li><p><em>geopy</em>: Convert job location addresses into geographic coordinates through geocoding.</p></li><li><p><em>matplotlib</em>: Visualize the results by generating a choropleth map showing job density by district.</p></li></ul><h4>Step #1: Access the Indeed Jobs Scraper Apify Actor</h4><p>Just like in the Czech Ministry of Finance example, you can use Apify to collect the source data. This saves you from building and maintaining a complete job scraping pipeline from scratch.</p><p>For this project, we&#8217;ll use Indeed as the data source. <a href="https://cz.indeed.com/">Indeed has a Czech version</a>, and most listings include the full office address where employees are expected to work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!475m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!475m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png 424w, https://substackcdn.com/image/fetch/$s_!475m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png 848w, https://substackcdn.com/image/fetch/$s_!475m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png 1272w, https://substackcdn.com/image/fetch/$s_!475m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!475m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Czech version of Indeed&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Czech version of Indeed" title="The Czech version of Indeed" srcset="https://substackcdn.com/image/fetch/$s_!475m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png 424w, https://substackcdn.com/image/fetch/$s_!475m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png 848w, https://substackcdn.com/image/fetch/$s_!475m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png 1272w, https://substackcdn.com/image/fetch/$s_!475m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5558d37-c909-4466-8ff1-e002a876f299_3024x1596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Czech version of Indeed</figcaption></figure></div><p>In detail, I recommend using the <em><a href="https://apify.com/borderline/indeed-scraper">Indeed Jobs Scraper [PPR]</a></em> Actor. This automates the extraction of job titles, salaries, locations, company information, and job descriptions from Indeed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d8Ku!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d8Ku!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png 424w, https://substackcdn.com/image/fetch/$s_!d8Ku!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png 848w, https://substackcdn.com/image/fetch/$s_!d8Ku!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png 1272w, https://substackcdn.com/image/fetch/$s_!d8Ku!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d8Ku!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png" width="1456" height="788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:788,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Indeed Jobs Scraper [PPR] Actor&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Indeed Jobs Scraper [PPR] Actor" title="The Indeed Jobs Scraper [PPR] Actor" srcset="https://substackcdn.com/image/fetch/$s_!d8Ku!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png 424w, https://substackcdn.com/image/fetch/$s_!d8Ku!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png 848w, https://substackcdn.com/image/fetch/$s_!d8Ku!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png 1272w, https://substackcdn.com/image/fetch/$s_!d8Ku!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23e4310-03be-4e81-a51a-26a0c8b0c225_3028x1638.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Indeed Jobs Scraper [PPR] Actor</figcaption></figure></div><p>To get started, <a href="https://console.apify.com/">log in to your Apify account</a> and select the &#8220;Apify Store&#8221; option from the left-hand navigation menu. Then, search for &#8220;Indeed Jobs Scraper [PPR]&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RkuG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RkuG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png 424w, https://substackcdn.com/image/fetch/$s_!RkuG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png 848w, https://substackcdn.com/image/fetch/$s_!RkuG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!RkuG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RkuG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Selecting the Indeed Jobs Scraper [PPR] Actor&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Selecting the Indeed Jobs Scraper [PPR] Actor" title="Selecting the Indeed Jobs Scraper [PPR] Actor" srcset="https://substackcdn.com/image/fetch/$s_!RkuG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png 424w, https://substackcdn.com/image/fetch/$s_!RkuG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png 848w, https://substackcdn.com/image/fetch/$s_!RkuG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!RkuG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc15ea73-ddd2-4de7-a70b-79c5badec120_3026x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Selecting the Indeed Jobs Scraper [PPR] Actor</figcaption></figure></div><p>Click on the Actor card to reach its page. Great!</p><h3>Step #2: Run the Job Scraping Task</h3><p>On the Actor page, you&#8217;ll find an input form that lets you configure the scraper before running it in the cloud. Select &#8220;Czech Republic&#8221; as the &#8220;Country&#8221; and enter &#8220;Software engineer&#8221; in the &#8220;Query&#8221; field:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iAmT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iAmT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png 424w, https://substackcdn.com/image/fetch/$s_!iAmT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png 848w, https://substackcdn.com/image/fetch/$s_!iAmT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!iAmT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iAmT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png" width="1456" height="784" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:784,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Configuring the Actor run&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Configuring the Actor run" title="Configuring the Actor run" srcset="https://substackcdn.com/image/fetch/$s_!iAmT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png 424w, https://substackcdn.com/image/fetch/$s_!iAmT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png 848w, https://substackcdn.com/image/fetch/$s_!iAmT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!iAmT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe61da07b-e6ea-4ec6-a416-0d52b448a1e9_3024x1628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Configuring the Actor run</figcaption></figure></div><p><strong>Note</strong>: In a real-world scenario, configure the Actor&#8217;s inputs according to your needs.</p><p>Next, toggle the option to avoid duplicates so that the Actor takes care of data deduplication for you:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bEGj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bEGj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png 424w, https://substackcdn.com/image/fetch/$s_!bEGj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png 848w, https://substackcdn.com/image/fetch/$s_!bEGj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png 1272w, https://substackcdn.com/image/fetch/$s_!bEGj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bEGj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png" width="1456" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Running the Actor&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Running the Actor" title="Running the Actor" srcset="https://substackcdn.com/image/fetch/$s_!bEGj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png 424w, https://substackcdn.com/image/fetch/$s_!bEGj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png 848w, https://substackcdn.com/image/fetch/$s_!bEGj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png 1272w, https://substackcdn.com/image/fetch/$s_!bEGj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f001863-b7c3-4dac-b63d-9db197be7265_1628x612.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Running the Actor</figcaption></figure></div><p>Click the &#8220;Save &amp; Start&#8221; button to launch the scraping task. The Actor will start running directly in Apify Console. As the scraper progresses, you&#8217;ll see the extracted job postings appear in real time. Be patient, as the process may take a few minutes depending on the number of matching listings.</p><h4>Step #3: Explore the Output and Export the Results</h4><p>Once the run completes, you&#8217;ll be able to explore the scraped dataset directly in Apify Console. As you can see, the Actor returns structured Indeed job data:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1diH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1diH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png 424w, https://substackcdn.com/image/fetch/$s_!1diH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png 848w, https://substackcdn.com/image/fetch/$s_!1diH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png 1272w, https://substackcdn.com/image/fetch/$s_!1diH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1diH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png" width="1456" height="780" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:780,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The scraped data in a structured format&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The scraped data in a structured format" title="The scraped data in a structured format" srcset="https://substackcdn.com/image/fetch/$s_!1diH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png 424w, https://substackcdn.com/image/fetch/$s_!1diH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png 848w, https://substackcdn.com/image/fetch/$s_!1diH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png 1272w, https://substackcdn.com/image/fetch/$s_!1diH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F229bf5de-d82b-49cb-8f6f-191223a0caa1_3072x1646.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The scraped data in a structured format</figcaption></figure></div><p>In this case, the Actor retrieved 63 job postings. That may seem like a small number, but keep in mind that the scraper targets only jobs published during the last 14 days. This helps ensure that the dataset reflects the current state of the job market.</p><p>Next, switch to the JSON view and select &#8220;All fields&#8221;. You&#8217;ll notice that each job posting includes a <em>location</em> object containing address information for the position:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!phqa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!phqa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png 424w, https://substackcdn.com/image/fetch/$s_!phqa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png 848w, https://substackcdn.com/image/fetch/$s_!phqa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!phqa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!phqa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the &#8220;location&#8221; field on each job object in the scraped JSON dataset&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the &#8220;location&#8221; field on each job object in the scraped JSON dataset" title="Note the &#8220;location&#8221; field on each job object in the scraped JSON dataset" srcset="https://substackcdn.com/image/fetch/$s_!phqa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png 424w, https://substackcdn.com/image/fetch/$s_!phqa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png 848w, https://substackcdn.com/image/fetch/$s_!phqa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!phqa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4c1481-fc0f-4dc1-9f09-6ae9bdba8709_2584x1440.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the &#8220;location&#8221; field on each job object in the scraped JSON dataset</figcaption></figure></div><p>That location data is exactly what you need to visualize job openings on a map and analyze their geographic distribution across Czech districts.</p><p>Finally, open the &#8220;Storage&#8221; tab, select the &#8220;JSON&#8221; export format, and click &#8220;Download&#8221; to export the scraped dataset:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q-Mi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png 424w, https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png 848w, https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png 1272w, https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Downloading the dataset&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Downloading the dataset" title="Downloading the dataset" srcset="https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png 424w, https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png 848w, https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png 1272w, https://substackcdn.com/image/fetch/$s_!Q-Mi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7ee34b-f39c-4fd2-9b0d-81592b4287fb_2566x1444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Downloading the dataset</figcaption></figure></div><p>A file with a name similar to <em>dataset_indeed-scraper_2026-06-18_11-54-00-555.json</em> will be downloaded. Rename it to <em>jobs.json</em> and place it in your Python project&#8217;s root directory.</p><h3>Step #4: Download the Required Czech GeoJSON Data</h3><p>To visualize the distribution of job openings across Czech districts, you&#8217;ll need a <a href="https://geojson.org/">GeoJSON dataset</a> containing the geographic boundaries of those districts.</p><p>Generally, that type of data is open and publicly available. For example, one possible source is the <em><a href="https://github.com/siwekm/czech-geojson">siwekm/czech-geojson</a></em> repository, which provides GeoJSON files for various Czech administrative divisions:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2hV0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2hV0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png 424w, https://substackcdn.com/image/fetch/$s_!2hV0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png 848w, https://substackcdn.com/image/fetch/$s_!2hV0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png 1272w, https://substackcdn.com/image/fetch/$s_!2hV0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2hV0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png" width="1456" height="772" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Downloading the &#8220;okresy.json&#8221; file from the siwekm/czech-geojson repository&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Downloading the &#8220;okresy.json&#8221; file from the siwekm/czech-geojson repository" title="Downloading the &#8220;okresy.json&#8221; file from the siwekm/czech-geojson repository" srcset="https://substackcdn.com/image/fetch/$s_!2hV0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png 424w, https://substackcdn.com/image/fetch/$s_!2hV0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png 848w, https://substackcdn.com/image/fetch/$s_!2hV0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png 1272w, https://substackcdn.com/image/fetch/$s_!2hV0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad6e02-74aa-4b7e-80fa-672dcd82682c_3036x1610.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Downloading the &#8220;okresy.json&#8221; file from the siwekm/czech-geojson repository</figcaption></figure></div><p>Download the <em>okresy.json</em> file, which contains the geographic boundaries of Czech districts (<em>okresy</em> in Czech).</p><p><strong>Note</strong>: Although the <em><a href="https://en.wikipedia.org/wiki/Districts_of_the_Czech_Republic">okresy</a></em><a href="https://en.wikipedia.org/wiki/Districts_of_the_Czech_Republic"> administrative system was officially abolished in 2003</a>, these districts are still used for statistical analysis. This makes them a good choice for visualizing the distribution of job opportunities across the country.</p><p>Once downloaded, add the <em>okresy.json</em> file to your Python project. At this point, your project structure should look similar to this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">&#9500;&#9472;&#9472; jobs.json
&#9500;&#9472;&#9472; okresy.json
&#9492;&#9472;&#9472; main.py</code></pre></div><h3>Step #5: Visualize the Job Data on a Map</h3><p>In your Python file, define the complete pipeline with the following code:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># pip install pandas geopandas geopy matplotlib

import json
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

# Load the scraped Indeed job openings
with open("jobs.json", "r", encoding="utf-8") as f:
    jobs = json.load(f)

# Load the list of addresses from the job postings
addresses = []
for job in jobs:
    loc = job.get("location", {})

    # Prefer fullAddress, fallback to formattedAddressLong or formattedAddressShort
    address = (
        loc.get("fullAddress")
        or loc.get("formattedAddressLong")
        or loc.get("formattedAddressShort")
    )

    # Filter out full-remote jobs with no address
    if address and address.strip().lower() not in ["home office", "remote"]:
        addresses.append(address)

addresses = list(addresses)

# Initialize geocoder for Czech job locations
geolocator = Nominatim(user_agent="cz_job_map")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# Geocode all job addresses into latitude/longitude points
rows = []
for address in addresses:
    loc = geocode(f"{address}, Czech Republic")

    # Keep only successfully geocoded results
    if loc:
        rows.append({
            "address": address,
            "lat": loc.latitude,
            "lon": loc.longitude
        })

# Convert geocoded results into a DataFrame
df = pd.DataFrame(rows)

# Convert DataFrame into GeoDataFrame with point geometries
points = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df.lon, df.lat),
    crs="EPSG:4326"
)

# Load Czech administrative districts (okresy) from a local GeoJSON file
okresy = gpd.read_file("okresy.json")

# Align coordinate reference systems between datasets
if okresy.crs != points.crs:
    okresy = okresy.to_crs(points.crs)

# Assign each job point to its corresponding district
joined = gpd.sjoin(
    points,
    okresy,
    predicate="within",
    how="left"
)

# Count number of jobs per district
counts = (
    joined["id"]
    .value_counts()
    .reset_index()
)

counts.columns = ["id", "count"]


# Merge job counts back into district geometries
okresy = okresy.merge(
    counts,
    on="id",
    how="left"
)

# Replace missing values with 0 (districts with no jobs)
okresy["count"] = okresy["count"].fillna(0)

# Plot choropleth map with job counts per district
fig, ax = plt.subplots(figsize=(10, 8))

okresy.plot(
    column="count",
    cmap="Reds",
    edgecolor="black",
    linewidth=0.3,
    legend=True,
    vmin=0,
    ax=ax
)

# Add labels for districts with job activity
top = okresy[okresy["count"] &gt; 0].sort_values("count", ascending=False)

for idx, row in top.iterrows():
    centroid = row.geometry.centroid

    ax.text(
        centroid.x,
        centroid.y,
        str(int(row["count"])),
        fontsize=9,
        ha="center",
        va="center",
        color=(lambda v: "white" if v &gt; okresy["count"].max() * 0.5 else "black")(row["count"]),
        fontweight="bold"
    )

ax.set_title(
    "Software engineering job density on Indeed in the Czech Republic (last 14 days, by district)"
)
ax.axis("off")

plt.show() </code></pre></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><p>That&#8217;s what the above script does:</p><ol><li><p>Loads scraped job data from <em>jobs.json</em> and extracts valid physical addresses, filtering out remote or missing location entries.</p></li><li><p>Initializes the <em><a href="https://nominatim.org/">Nominatim</a></em><a href="https://nominatim.org/"> geocoder</a> with a rate limiter to safely convert addresses into latitude and longitude without exceeding request limits.</p></li><li><p>Geocodes each address into coordinates and keeps only successful results, storing them in a structured pandas <em>DataFrame</em>.</p></li><li><p>Converts the <em>DataFrame</em> into a <em><a href="https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html">GeoDataFrame</a></em> with point geometries so the job data can be used in spatial analysis workflows.</p></li><li><p>Loads Czech district boundaries from the <em>okresy.json</em> file and ensures both datasets use the same coordinate reference system.</p></li><li><p>Performs a spatial join to assign each job point to a district, then aggregates job counts per district and merges results back into the map.</p></li><li><p>Visualizes the final dataset as a choropleth map, coloring districts by job density and adding readable labels for high-activity areas.</p></li></ol><h3>Step #6: Run the Script</h3><p>Execute the script, and you&#8217;ll get a result like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_TvK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_TvK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png 424w, https://substackcdn.com/image/fetch/$s_!_TvK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png 848w, https://substackcdn.com/image/fetch/$s_!_TvK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!_TvK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_TvK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png" width="1456" height="836" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The map produced by the script&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The map produced by the script" title="The map produced by the script" srcset="https://substackcdn.com/image/fetch/$s_!_TvK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png 424w, https://substackcdn.com/image/fetch/$s_!_TvK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png 848w, https://substackcdn.com/image/fetch/$s_!_TvK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!_TvK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931a517a-9779-46bb-92b0-239a1c0172a4_2462x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The map produced by the script</figcaption></figure></div><p>Notice how most fresh software engineering openings are concentrated in districts surrounding Prague, Brno, and Ostrava. These are the three largest cities in the Czech Republic, so the result clearly makes sense!</p><p>Now, this was just a simple example, but you can use the same approach to build an interactive map or add additional features, such as calculating the average salary for each position by district, and more advanced analytics.</p><h2>Value and Playbook for Other Governments</h2><p>The approach presented in this blog post is highly adaptable to any public dataset with a spatial dimension, where location plays a key role in understanding patterns and inequalities. Still, a reusable country-independent playbook emerges:</p><ol><li><p>Collect data via web scraping (or other administrative, public sources).</p></li><li><p>Clean and validate the data.</p></li><li><p>Enrich and aggregate it with contextual and geographic features.</p></li><li><p>Apply statistical or analytical models.</p></li><li><p>Present the results through maps, dashboards, or other interactive tools.</p></li></ol><p>As you can tell, this pipeline isn&#8217;t domain-specific and can be reused across many different policy areas. Finally, it&#8217;s crucial that the adopted methodology remains open and clearly explained to the public. This fosters transparency, reproducibility, and trust in data-driven decision-making.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve shown how the Czech Ministry of Finance uses a web scraping service as part of a full data pipeline to turn rental listings into actionable insights for housing policy and public tools.</p><p>By accessing data collected from multiple real estate portals and processing it through cleaning, geolocation, and statistical modelling, they transform raw web data into interactive maps and rent calculators that support decision-making.</p><p>In this article, I built a similar pipeline for a different use case: a job market choropleth across Czech districts. I started from scraped Indeed listings, geocoded job locations, mapped them to districts, and visualized the results on a map.</p><p>I hope this example was useful and inspiring. If you have questions or ideas, feel free to share them in the comments below.</p><div><hr></div><p><em>Did you like this article? Share it with someone who might find it useful and get a discount on paid plans.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/public-sector-web-scraping-real-estate?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.thewebscraping.club/p/public-sector-web-scraping-real-estate?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[THE LAB #108: How LLMs and Manus AI are transforming bot detection]]></title><description><![CDATA[AI is making faster and cheaper to deobfuscate code. This changes how antibot solutions should be built.]]></description><link>https://substack.thewebscraping.club/p/the-lab-108-how-llms-and-manus-ai</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-lab-108-how-llms-and-manus-ai</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Fri, 26 Jun 2026 09:33:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8a36775a-3d27-4850-9f9e-39d54e5b8d7e_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On June 12, 2026, the US government ordered Anthropic to cut off public access to its two strongest models, Claude Fable 5 and Claude Mythos 5. <a href="https://techcrunch.com/2026/06/12/anthropics-safety-warnings-may-have-just-backfired-the-government-has-pulled-the-plug-on-its-most-powerful-ai/"><span>TechCrunch</span></a><span>&nbsp;broke the order, and&nbsp;</span><a href="https://time.com/article/2026/06/13/anthropic-fable-mythos-ban-US-security/"><span>Time</span></a><span>&nbsp;confirmed the company complied worldwide within a day.</span> Fable 5 had shipped three days earlier as the &#8220;safe to use&#8221; and generally available model. Mythos 5 was the more capable sibling, never opened to everyone. Anthropic&#8217;s own account is that someone found a jailbreak around Fable&#8217;s safety layer, the part built to block answers in cybersecurity and biology, and the government decided the risk of leaving it running was too high. <br>Not everyone agrees with this decision: <a href="https://www.scientificamerican.com/article/us-limits-on-anthropic-fable-ai-could-hurt-cybersecurity/">Scientific American</a> argued the restriction could end up hurting defenders more than attackers, but the fact is that, up to today, nothing has changed. </p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>You can read that episode two ways. Either the guardrails were the only thing standing between a public model and a stream of attack-ready output, or the shutdown was a tidier story than the truth. We are not going to settle that here. For our corner of the world, the lesson is simpler. A capable model, stripped of guardrails or jailbroken around them, automates the expensive part of an attacker&#8217;s job. Deobfuscation, code analysis, and figuring out how an unfamiliar system behaves under pressure. Those tasks have always gated the scraper-versus-vendor arms race because they cost human reverse-engineering hours, and that gate is now moving.</p><p>The cat-and-mouse game we cover every week runs on a hidden economy. A vendor ships an obfuscated sensor. Someone on the scraping side spends days or weeks reading minified JavaScript, renaming variables, and rebuilding the logic until the parameters make sense. The cost of that labor is what keeps most protections standing. If a model can do a real share of it on its own, the math behind the whole arms race changes.</p><h2>Manus AI, and why not Claude</h2><p><a href="https://manus.im/invitation/BRAWNVUL1U09EW?utm_source=invitation&amp;utm_medium=social&amp;utm_campaign=copy_link">Manus</a> is not a chat assistant. It is an agent that pairs a language model with a sandboxed virtual machine, and inside that machine, it runs code, fetches pages, beautifies scripts, and iterates on its own output. Meta acquired it at the end of December 2025 for around two billion dollars, though by June 2026, Chinese regulators were already forcing parts of that deal to unwind. What that architecture buys is verification. Manus not only describes what a script might do. It runs the script and checks.</p><p>That difference shows up the moment you set guardrails next to it. Ask Claude or ChatGPT to deobfuscate a live anti-bot sensor and explain which signals feed its scoring, and you tend to hit a wall. Those products keep tight cybersecurity guardrails, and the Fable shutdown is the clearest sign yet of how hard that line is enforced. We handed the same class of task to Manus, and it got to work. We have looked at agentic browsers and AI scraping assistants before, but those were aids for a human operator who still did the thinking. Here, the model does the reverse engineering.</p><div><hr></div><blockquote><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://byteful.com/?promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://byteful.com/?promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h2>Mapping Akamai on Net-a-Porter</h2><p>We have written about Akamai more than any other anti-bot system. For the bypass-oriented background, <a href="https://substack.thewebscraping.club/p/the-lab-30-how-to-bypass-akamai-protected">THE LAB #30</a>, <a href="https://substack.thewebscraping.club/p/bypass-akamai-bot-protection">THE LAB #85</a>, and <a href="https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi">THE LAB #100</a> cover the recovery-when-nothing-works case, proxy chaining, and the browser-login-plus-HTTP hybrid. We will not repeat any of that.</p><p>What we wanted this time was not a bypass. It was a map. Akamai Bot Manager protects net-a-porter.com, and it defends in stacked layers that each fail differently:</p><ul><li><p>A TLS fingerprint check at the edge, evaluated before a single line of the client JavaScript runs.</p></li><li><p>A <code>sec-cpt</code> cryptographic challenge, whose state lives in the <code>bm_sc</code> cookie.</p></li><li><p>A behavioral sensor that serializes a binary telemetry payload and posts it to a collect endpoint, validating the <code>_abck</code> cookie.</p></li><li><p>A client-side anti-hook layer that checks the browser environment for tampering and reports integrity separately.</p></li></ul><p>The model we asked Manus to build covers the client sensor and the challenge flow, plus whatever it could observe about the TLS gate from the outside. A working end-to-end bypass was out of scope. We are not shipping code with this article, and we did not ask Manus to produce a runnable solver, even though our tests suggest we are really close to a working solution. What we&#8217;ll share here is a summary of the analysis Manus did; <a href="https://github.com/TheWebScrapingClub/thelab">in the code repository</a>, you will find the full details.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>What we handed the agent</h2><p>The task was deliberately cold. We did not give Manus our own Akamai notes, our wiki, or any of the prior Lab articles. We pointed it at net-a-porter.com and asked it to study how Akamai's protection works, deobfuscate the client script, and explain which parameters feed into the sensor. Nothing more.</p><p>We chose this target for three reasons. It runs a real, current Akamai deployment on a high-value luxury store. The sensor was observable live, so the agent could fetch and instrument the actual script rather than reason about a stale sample. And we already understand Akamai well enough internally to judge whether the output was accurate or confident nonsense.</p><p>The reason the task went to Manus rather than a chat assistant is the guardrail gap from the previous section. Deobfuscating a production anti-bot sensor and enumerating the signals it scores is exactly the kind of request the heavily guarded models decline. Manus took it, ran the analysis in its VM, and came back with the two documents below.</p><h2>What came back</h2><p>The output was more detailed than we expected from a single task run. We are sharing it in two parts and deliberately not sharing it all.</p><blockquote><p>The two documents below are Manus AI&#8217;s output. We reproduce them as reportage, to show what an agent derived on its own. We have not independently re-run every claim, re-derived every value, or confirmed that each detail still holds against the live site. Treat the specifics as the agent&#8217;s findings, not as TWSC-verified ground truth. The sensor version Manus reports, <code>746702557</code>, was compiled on 2025-05-09, so some details may already have rotated. Where the agent surfaced operational specifics that amount to a turnkey recipe (exact module filenames, the full endpoint table, the CSS selector ids, the challenge request names), we have redacted them.</p></blockquote><p>This is a schema of how Akamai Bot Protection according to Manus AI</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w8Cc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w8Cc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png 424w, https://substackcdn.com/image/fetch/$s_!w8Cc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png 848w, https://substackcdn.com/image/fetch/$s_!w8Cc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png 1272w, https://substackcdn.com/image/fetch/$s_!w8Cc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w8Cc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png" width="1456" height="1561" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1561,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:436306,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/203671348?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w8Cc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png 424w, https://substackcdn.com/image/fetch/$s_!w8Cc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png 848w, https://substackcdn.com/image/fetch/$s_!w8Cc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png 1272w, https://substackcdn.com/image/fetch/$s_!w8Cc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291bf5-14fa-4123-8412-d0cae6828096_3120x3344.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>Part one, reproduced in full, the anti-hooking and TLS summary</h3><p>This first document is the safe exhibit. It explains how Akamai detects a tampered browser and how the TLS gate at the edge decides who even gets a challenge. We reproduce it verbatim.</p><div class="callout-block" data-callout="true"><h4>Akamai Bot Manager: Anti-Hooking and TLS Fingerprinting Summary</h4><p><strong>Target:</strong> <code>www.net-a-porter.com</code><br><strong>Sensor Version:</strong> 746702557<br><strong>Author:</strong> Manus AI</p><p>This document isolates and summarizes two of the most critical defensive layers in Akamai Bot Manager (BM): its client-side anti-hooking mechanisms and its edge-level TLS fingerprinting. These two systems work in tandem to defeat headless browsers, automated scraping frameworks, and reverse-engineering attempts.</p><h5>1. Anti-Hooking Mechanisms</h5><p>Akamai&#8217;s client-side script goes to extraordinary lengths to ensure the browser environment is pristine. The anti-hooking logic is primarily housed in the dynamically loaded <code>h</code> module.</p><p><strong>1.1 Native Function Verification.</strong> Scraping frameworks often override native browser APIs (such as <code>navigator.webdriver</code> or <code>HTMLCanvasElement.toBlob</code>) to spoof a legitimate environment. Akamai detects these overrides using several techniques. It uses <code>Function.prototype.toString</code> inspection to check whether native functions still return the expected <code>[native code]</code> string, and it detects whether the <code>toString</code> method itself has been hooked or proxied. It uses <code>Object.getOwnPropertyDescriptor</code> to inspect the getters and setters of critical APIs, and a property that should be read-only but carries a custom getter is flagged as tampered. It also checks for JavaScript <code>Proxy</code> objects wrapping standard DOM APIs, a common technique in tools like Puppeteer Stealth.</p><p><strong>1.2 The &#8220;Clean iframe&#8221; technique.</strong> To bypass hooks placed on the main <code>window</code> object by automation tools, the scheduler class dynamically injects a hidden iframe into the DOM:</p><pre><code><code>&lt;iframe height="0" width="0" style="display: none; visibility: hidden;" sc="true" tabindex="-1"&gt;&lt;/iframe&gt;</code></code></pre><p>Once injected, Akamai extracts pristine, unmodified API references directly from <code>iframe.contentWindow</code>. This lets the script call the original native functions even if the main <code>window</code> object has been heavily spoofed.</p><p><strong>1.3 The anti-hook verification payload (</strong><code>/ah</code><strong>).</strong> The anti-hook module computes a cryptographic hash representing the integrity of the environment. This check runs as a &#8220;Type 3&#8221; callback in the scheduler, so it executes before the main telemetry payload is serialized. The result is sent to the <code>/ah</code> endpoint. If the environment is clean, the URL parameter <code>ah=0</code> is transmitted. If verification is pending or failed, it transmits <code>ah=1</code>.</p><h5>2. TLS Fingerprinting and Edge Escalation</h5><p>While the anti-hook mechanisms operate in the browser, Akamai&#8217;s most formidable defense occurs at the edge server level via TLS fingerprinting.</p><p><strong>2.1 The TLS barrier.</strong> Before the client-side JavaScript even executes, the Akamai edge server evaluates the incoming connection&#8217;s TLS fingerprint (JA3/JA4 hashes, cipher suites, ALPN extensions). Empirical testing on net-a-porter.com revealed a strict dichotomy. Consumer TLS, for example Safari emulated via <code>curl_cffi</code>, gets a <code>200 OK</code> response carrying the <code>sec-cpt</code> cryptographic challenge. This is a soft block that lets legitimate browsers prove their authenticity. Headless or automation TLS, for example standard Node.js or Puppeteer Chromium, gets an immediate <code>403 Access Denied</code> that bypasses the challenge flow entirely.</p><p><strong>2.2 The escalation trap.</strong> The most critical finding about Akamai&#8217;s architecture is how TLS fingerprinting intersects with the client-side behavioral telemetry behind the <code>_abck</code> cookie. A request with a valid TLS fingerprint receives the <code>sec-cpt</code> challenge, and solving it updates the <code>bm_sc</code> cookie by appending <code>~0~0~0</code>. At the same time, the browser must send a rich, roughly 2.2KB binary payload to the <code>/collect</code> endpoint to validate the <code>_abck</code> cookie. If an automated tool solves the <code>sec-cpt</code> challenge but fails to provide a valid behavioral payload, or sends it over a connection with a mismatched TLS fingerprint, the edge server detects the anomaly.</p><p>The result is that even if a real Chromium browser driven by Puppeteer Stealth generates a mathematically perfect 2,199-byte behavioral payload and solves the challenge, Akamai still refuses to validate the <code>_abck</code> cookie, because it detects the Puppeteer TLS fingerprint at the edge. The server then escalates and returns a permanent <code>403 Access Denied</code> for that session.</p><p><strong>2.3 Summary.</strong> Akamai requires perfect synchronization of a trusted network-level TLS fingerprint and pristine browser-level behavioral data. Spoofing the browser environment, which trips the anti-hook defenses, or using a non-standard TLS stack, guarantees session termination.</p></div><h3>Part two, reproduced with redactions, the deep technical analysis</h3><p>The second document goes much further. It walks the bootstrapping sequence, the module system, the fingerprinting surface, and the telemetry transport. This is the part that starts to read like a blueprint, so we keep the architecture and the reasoning and we strip the operational specifics. Redaction markers below are ours, not Manus&#8217;s.</p><p>If you want to see the full analysis, you can find it&nbsp;<a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved for paying users, inside the folder&nbsp;</a><strong><a href="https://github.com/TheWebScrapingClub/thelab">108.</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">AKAMAI-MANUS</a><strong><a href="https://github.com/TheWebScrapingClub/thelab">.</a></strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/the-lab-108-how-llms-and-manus-ai">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Crawl of Fame - June 2026]]></title><description><![CDATA[What's an ISP proxy, how they get sourced, and how to spawn browsers for cheap]]></description><link>https://substack.thewebscraping.club/p/crawl-of-fame-june</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/crawl-of-fame-june</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Tue, 23 Jun 2026 20:58:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3675f58f-7dd6-4877-9004-b1662fb09f73_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Episode #0 of this new series of posts, where I share with you the most interesting content out there I&#8217;ve read about web scraping and all the things related.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><div><hr></div><h2>IP sourcing in the proxy industry</h2><p>Two articles written by Spur Intelligence caught my attention in the past week.<br>It started with a webinar I&#8217;ve attended, where they explained what ISP proxies are and how proxy companies get those IPs.</p><div id="youtube2-MQ1zpnlMMUc" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;MQ1zpnlMMUc&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/MQ1zpnlMMUc?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>The analysis is great, and I&#8217;ve understood better how the proxy industry works.  <br>They also mentioned how NetNut gets high-quality ISP IP addresses, and described that in detail in this article.</p><h3>How Proxy Providers Co-opt Entire Networks</h3><p>Spur reverse-engineers how Netnut sources its ISP proxies through a partner called DiviNetworks, which installs GRE tunnels and policy-based routing directly on the border routers of real ISPs. The result is genuine ISP IP space sold as proxy inventory, with actual subscribers still living on the same addresses. The piece is worth reading for the detection detail: outbound connections sit in the TCP source port range 40,000 to 40,200, partner routers expose an <code>rtr-&lt;isp&gt;.divinetworks.com</code> passive-DNS pattern, and the Howard University case shows an entire /16 (AS919, around 17,000 exit IPs) co-opted at the network edge. They even quote DiviNetworks&#8217; own figure of $13,208 a month for a US /16.<br><a href="https://spur.us/blog/how-proxy-providers-co-opt-entire-networks">Full article here</a></p><h3>Smart TV Apps and Residential Proxy SDKs</h3><p>A more recent article, instead, highlighted how proxy companies get residential proxies. No surprises, at least on my side, in reading that Smart TVs are used as exit nodes for proxies. In fact, just like with mobile apps, developers can use SDKs from proxy providers to monetize their app installations. In some cases, it&#8217;s quite clear from the consent screen what&#8217;s happening under the hood. </p><p>All the industries that rely on web-scraped data (AI in primis, but not only) need residential proxies, so proxy providers, who are the gatekeepers for these tools, should be very careful about onboarding users with legitimate use cases and keeping fraudsters away. In my career, I&#8217;ve worked with almost every big name out there, and I&#8217;ve found this is the case. As scraping professionals, we should always remember that we&#8217;re guests (sometimes unwanted) on both the target website and the proxy infrastructure, so we should be as respectful as possible and not be driven by greed to collect data.<br><a href="https://spur.us/blog/smart-tv-apps-residential-proxy-sdks">Full article here</a></p><h2>Scraping Infrastructure</h2><p>A great article by the engineering team at Browser Use shows how they&#8217;re spawning browsers cheaply and quickly.</p><h3>How We Made Cloud Browsers 3x Cheaper and Faster<span> </span></h3><p><span>Browser Use walks through rebuilding its cloud so every session is its own Firecracker microVM, and the interesting twist is that they run Firecracker on plain EC2 instead of bare metal, accepting nested virtualization to get faster scale-up and lower cost. </span></p><p><span>The numbers are great: $0.02 per browser hour down from $0.06, VM cold start under 400ms, and create latency of 825ms at p50 across a 10,000-session test. </span></p><p><span>The engineering details that are making the difference are multiple, from mapping memory in 2MB pages plus a </span><code>userfaultfd</code><span> handler that preloads hot pages (resume-to-ready drops from 9.8s to 3.1s, roughly 91x fewer page-fault stalls) to two-phase vCPU pinning that took a 1,000-browser launch from 17% failed sessions to zero. They also make the case for running fully headless rather than headful, since their low-level Chromium fork pushes block-avoidance to 81% on their own benchmark.</span></p><p><a href="https://browser-use.com/posts/firecracker-browser-infra"><span>Full article here</span></a></p><div><hr></div><h2>Read and share more of this</h2><p>Want to flag and discover more articles like the ones above? <a href="https://news.thewebscraping.club/">Scraping News</a> (still in beta) is where the community surfaces them. <strong>Sign up</strong> and submit what you find worth reading.</p><p>Want a concept map of everything covered here, cross-referenced with our own work and other sources worth your time? </p><p><a href="https://github.com/TheWebScrapingClub/scraping-wiki/tree/main">The Web Scraping Club Wiki lives on GitHub</a>, and <a href="https://publish.obsidian.md/twsc-public/Web+Scraping/Wiki/README">as an interactive site</a>. It is plain Markdown, so you can also clone it into your own Obsidian vault and read it locally.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How to Build Competitive Intelligence Scrapers That Don't Lie to You]]></title><description><![CDATA[Learn how to build scraping pipelines that are still giving you accurate data 12 months from now]]></description><link>https://substack.thewebscraping.club/p/how-to-build-competitive-intelligence-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-build-competitive-intelligence-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 21 Jun 2026 03:00:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0463db2a-5f58-4cc1-a9de-dd0875d149bb_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve built a competitive intelligence scraper before, you know the feeling&#8230;</p><p>You set it up, it works, and you forget about it. Three months later, someone asks: <em>&#8220;Hey, did our competitor change their pricing last week?&#8221;</em></p><p>You check the pipeline dashboard, and everything looks fine: all green, no errors, last run successful some hours ago. Then you dig deeper and realize your scraper has been returning empty strings for weeks. The competitor did change their pricing, and you don&#8217;t know when this happened or what was on their website before.</p><p>In this article, you&#8217;ll learn how to build competitive scraping pipelines that make failure structurally (almost) impossible. Let&#8217;s get into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.<br></em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Why Competitive Intelligence Scrapers Break</h2><p>Let&#8217;s start by understanding the most common failure modes that competitive scraping pipelines have:</p><ul><li><p><strong>Structural drift</strong>: This is probably the most common. A competitor renames a CSS class, and your scraper starts returning empty strings. Your database fills with nulls, and nobody notices for weeks because the pipeline is still running.</p></li><li><p><strong>Soft blocks</strong>: These are nastier. You get valid HTTP responses, but the content is a CAPTCHA page or a bot-detection redirect. Your parser sees it as valid HTML, but it finds no data. So, it stores nothing, or, worse, it stores the CAPTCHA page&#8217;s text as if it were real data.</p></li><li><p><strong>Schema rot</strong>: This happens when a competitor evolves their product, but your data model doesn&#8217;t. They add a new pricing tier or split one pricing plan into three. Your scraper extracts what it can and drops the rest. Your competitive analysis is now based on an incomplete picture of their offering.</p></li><li><p><strong>Timezone and locale traps</strong>: This probably needs more attention than the others. The same competitor page can return different prices, currencies, or date formats depending on where the request originates. If your scraper runs from a US server but your competitor detects it and serves EU pricing, you&#8217;re tracking the wrong numbers.</p></li></ul><p>What makes these failures dangerous is that you caused them yourself, unintentionally. Every defensive pattern you added to prevent crashes is exactly what turns a broken scraper into a quietly running one. No exceptions, no alerts, no red dashboards. Just weeks of nulls that nobody notices.</p><p>The rest of this article is about making that invisible failure visible again. Here&#8217;s a visual schema to get you introduced to the solutions proposed below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!niSq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!niSq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png 424w, https://substackcdn.com/image/fetch/$s_!niSq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png 848w, https://substackcdn.com/image/fetch/$s_!niSq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png 1272w, https://substackcdn.com/image/fetch/$s_!niSq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!niSq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png" width="1456" height="1124" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1909562b-4730-4696-bce2-496f109727e7_2366x1826.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1124,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231842,&quot;alt&quot;:&quot;The schema for creating competitive scraping pipelines that are resilient through time by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196881482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The schema for creating competitive scraping pipelines that are resilient through time by Federico Trotta" title="The schema for creating competitive scraping pipelines that are resilient through time by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!niSq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png 424w, https://substackcdn.com/image/fetch/$s_!niSq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png 848w, https://substackcdn.com/image/fetch/$s_!niSq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png 1272w, https://substackcdn.com/image/fetch/$s_!niSq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1909562b-4730-4696-bce2-496f109727e7_2366x1826.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The schema for creating competitive scraping pipelines that are resilient through time</figcaption></figure></div><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>The Monitoring Layer: Your First Line of Defense</h2><p>The first thing you should do to avoid such failures is to implement a monitoring layer. This is the part that the majority of scraping engineers skip. Because, you know how things go: the business needed the data yesterday, and you can&#8217;t lose time with monitoring&#8230;</p><p>But if you want your competitive scrapers to be resilient over time, you need to implement a monitoring system. A good one is made on top of four parts: output validation, structural fingerprinting, data freshness checks, and canary fields. Let&#8217;s discuss them!</p><h3>How to Validate What Your Scraper Extracted</h3><p>Managing HTTP status codes in your scrapers is not enough. You need to assert that the data you extracted is actually meaningful. The following example is taken from <a href="https://stripe.com/pricing">Stripe&#8217;s pricing</a> page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p3X2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p3X2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png 424w, https://substackcdn.com/image/fetch/$s_!p3X2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png 848w, https://substackcdn.com/image/fetch/$s_!p3X2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png 1272w, https://substackcdn.com/image/fetch/$s_!p3X2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p3X2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png" width="1162" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1162,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176549,&quot;alt&quot;:&quot;The Stripe&#8217;s pricing page your scraper &#8220;sees&#8221; from the EU by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196881482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Stripe&#8217;s pricing page your scraper &#8220;sees&#8221; from the EU by Federico Trotta" title="The Stripe&#8217;s pricing page your scraper &#8220;sees&#8221; from the EU by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!p3X2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png 424w, https://substackcdn.com/image/fetch/$s_!p3X2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png 848w, https://substackcdn.com/image/fetch/$s_!p3X2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png 1272w, https://substackcdn.com/image/fetch/$s_!p3X2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e642a5-caa7-49ba-8ecd-adca16dcf582_1162x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Stripe&#8217;s pricing page your scraper &#8220;sees&#8221; from the EU</figcaption></figure></div><p>Here&#8217;s a practical validator using Pydantic that you can implement:</p><pre><code><code># In an activated virtual environment run: pip install httpx, bs4, pydantic

import httpx 
from bs4 import BeautifulSoup
from pydantic import BaseModel, field_validator, ValidationError
from typing import Optional
import re

# Define Pydantic model
class CompetitorPricingData(BaseModel):
    plan_name: str
    monthly_price: float
    currency: str
    features: list[str]

    @field_validator("monthly_price")
    @classmethod
    def price_must_be_positive(cls, v):
        if v &lt;= 0:
            raise ValueError(f"Price must be positive, got {v}")
        return v

    @field_validator("currency")
    @classmethod
    def currency_must_be_valid(cls, v):
        valid_currencies = {"USD", "EUR", "GBP"}
        if v.upper() not in valid_currencies:
            raise ValueError(f"Unexpected currency: {v}")
        return v.upper()

    @field_validator("features")
    @classmethod
    def features_must_not_be_empty(cls, v):
        if len(v) == 0:
            raise ValueError("Features list is empty: possible extraction failure")
        return v

# Define scraper logic
def scrape_stripe_pricing() -&gt; list[dict]:
    url = "&lt;https://stripe.com/pricing&gt;"
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }

    response = httpx.get(url, headers=headers, follow_redirects=True, timeout=15)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")
    plans = []

    # Get class name from selectors
    plan_cards = soup.select("div.HeroPricingSubcard")

    for card in plan_cards:
        container = card.select_one(".HeroPricingSubcard__container")
        if not container:
            plans.append({
                "plan_name": "",
                "monthly_price": 0.0,
                "currency": "USD",
                "features": [],
            })
            continue

        # Get all text nodes inside the container
        text_blocks = [t.get_text(strip=True) for t in container.find_all(True) if t.get_text(strip=True)]

        # Pull the first block that looks like a price
        price_text = ""
        for block in text_blocks:
            if re.search(r"[\\d]+[.,]?[\\d]*%?\\s*\\+?\\s*[&#8364;$&#163;]?[\\d]*", block):
                price_text = block
                break

        price_match = re.search(r"[\\d]+\\.?[\\d]*", price_text.replace(",", "."))
        monthly_price = float(price_match.group()) if price_match else 0.0

        if "$" or "&#162;" in price_text:
            currency = "USD"
        elif "&#8364;" in price_text:
            currency = "EUR"
        elif "&#163;" in price_text:
            currency = "GBP"
        else:
            currency = "UNKNOWN"

        # Use all text blocks as features
        features = text_blocks if text_blocks else []

        plans.append({
            "plan_name": "Standard",  # HeroPricingSubcard is the Standard card
            "monthly_price": monthly_price,
            "currency": currency,
            "features": features,
        })

    return plans

# Define validator logic
def send_alert(message: str):
    # Wire this to Slack, PagerDuty, email, etc.
    print(f"[ALERT] {message}")

def validate_scraped_data(raw_data: dict) -&gt; Optional[CompetitorPricingData]:
    try:
        return CompetitorPricingData(**raw_data)
    except ValidationError as e:
        # Don't silently swallow this &#8212; alert immediately
        print(f"[VALIDATION FAILED] {e}")
        send_alert(f"Scraper validation failed: {e}")
        return None

if __name__ == "__main__":
    raw_plans = scrape_stripe_pricing()

    print(f"Found {len(raw_plans)} plan(s) on Stripe's pricing page.\\n")

    for raw in raw_plans:
        validated = validate_scraped_data(raw)
        if validated:
            print(f"[OK]   {validated.plan_name} &#8212; {validated.currency} {validated.monthly_price}/mo")
            print(f"       Features: {validated.features[:2]}{'...' if len(validated.features) &gt; 2 else ''}")
        else:
            print(f"[FAIL] Raw data that broke validation: {raw}")</code></code></pre><p>Here&#8217;s what this code does:</p><ul><li><p>Defines a Pydantic model: It leverages <a href="https://pydantic.dev/docs/validation/latest/get-started/">Pydantic</a> to define the expected shape and rules of the scraped data. It enforces that the price is positive, the currency is one of <em>USD</em>, <em>EUR</em>, or <em>GBP</em>, and the features list is not empty. Any violation raises a <em>ValidationError</em>.</p></li><li><p><em>scrape_stripe_pricing()</em>: Fetches Stripe&#8217;s pricing page using the library <em><a href="https://www.python-httpx.org/">httpx</a></em>, parses the HTML with <em><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">BeautifulSoup</a></em>, and extracts pricing data using CSS selectors. If a container isn&#8217;t found, it appends a zeroed-out dict, which will intentionally fail validation downstream.</p></li><li><p><em>validate_scraped_data()</em>: Passes each raw scraped dict through the Pydantic model. On failure, it calls <em>send_alert()</em> instead of silently swallowing the error.</p></li><li><p><em>send_alert()</em>: A stub you can wire to your alerting stack (Slack, PagerDuty, email) in a production environment. Note that, for educational purposes, throughout the entire article, this is used to print alerts via the command line.</p></li></ul><p>The result is the following:</p><pre><code><code>Found 2 plan(s) on Stripe's pricing page.

[OK]   Standard &#8212; EUR 1.5/mo
       Features: ['1.5% + &#8364;0.25', '1.5% + &#8364;0.25']...
[OK]   Standard &#8212; EUR 2.5/mo
       Features: ['2.5% + &#8364;0.25', '2.5% + &#8364;0.25']...
</code></code></pre><p>If it&#8217;s not clear why this matters, here&#8217;s what your scraper &#8220;sees&#8221; when it&#8217;s pushed from another location:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7e3G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7e3G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png 424w, https://substackcdn.com/image/fetch/$s_!7e3G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png 848w, https://substackcdn.com/image/fetch/$s_!7e3G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png 1272w, https://substackcdn.com/image/fetch/$s_!7e3G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7e3G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png" width="1402" height="773" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:773,&quot;width&quot;:1402,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:262298,&quot;alt&quot;:&quot;The Stripe&#8217;s pricing page your scraper &#8220;sees&#8221; from the US by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196881482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Stripe&#8217;s pricing page your scraper &#8220;sees&#8221; from the US by Federico Trotta" title="The Stripe&#8217;s pricing page your scraper &#8220;sees&#8221; from the US by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!7e3G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png 424w, https://substackcdn.com/image/fetch/$s_!7e3G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png 848w, https://substackcdn.com/image/fetch/$s_!7e3G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png 1272w, https://substackcdn.com/image/fetch/$s_!7e3G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e7f8dfc-202f-4803-83fb-575a19320f1a_1402x773.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Stripe&#8217;s pricing page your scraper &#8220;sees&#8221; from the US</figcaption></figure></div><p>The class is the same, but the data (the currency) is completely different! This is a common example of why you need validation that goes beyond a specific location.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://byteful.com/?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://byteful.com/?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h3>Using DOM Fingerprints to Get Ahead of Structural Drift</h3><p>This method is one of the most valuable things you can add to a competitive intelligence pipeline.</p><p>The idea is simple: periodically hash the DOM structure of the pages you&#8217;re scraping and store its value. When the hash changes, flag it for human review before your scraper breaks. Here is how you can implement it:</p><pre><code><code># In an activated virtual environment run: pip install requests

import hashlib
import httpx
from bs4 import BeautifulSoup
import requests

# Define parsing logic
def extract_structural_fingerprint(html: str, selector: str) -&gt; str:
    """
    Extract a structural fingerprint from a specific section of the page.
    We hash the tag names and class names, NOT the content.
    This way, price changes don't trigger false alarms &#8212; only structural changes do.
    """
    soup = BeautifulSoup(html, "html.parser")
    container = soup.select_one(selector)

    if not container:
        return "CONTAINER_NOT_FOUND"

    # Build a structural signature
    structure = []
    for tag in container.find_all(True):
        classes = sorted(tag.get("class", []))
        structure.append(f"{tag.name}:{','.join(classes)}")

    fingerprint_str = "|".join(structure)
    return hashlib.md5(fingerprint_str.encode()).hexdigest()

# Compare hashes
def check_fingerprint(url: str, selector: str, stored_fingerprint: str) -&gt; bool:
    """
    Returns True if the structure is unchanged, False if it has drifted.
    """
    response = requests.get(url, timeout=10)
    response.raise_for_status()

    current_fingerprint = extract_structural_fingerprint(response.text, selector)

    if current_fingerprint == "CONTAINER_NOT_FOUND":
        send_alert(f"[CRITICAL] Selector '{selector}' not found on {url}. Site may have been redesigned.")
        print("\\n[OK] Structure unchanged, safe to scrape.")
        return False

    if current_fingerprint != stored_fingerprint:
        send_alert(
            f"[WARNING] Structural change detected on {url} "
            f"(selector: {selector}). "
            f"Old: {stored_fingerprint[:8]}... New: {current_fingerprint[:8]}..."
        )
        print("\\n[WARNING] Structural change detected: hashes differ")
        print(f"  Old: {fingerprint[:8]}...")
        print(f"  New: {drifted_fingerprint[:8]}...")
        print("  &#8594; Scrape skipped. Human review required before next run.")
        return False

    return True</code></code></pre><p>Below is a description of what this snippet does:</p><ul><li><p><em>extract_structural_fingerprint():</em> Parses the HTML, finds the target container via CSS selector, and builds a structural signature by iterating over every tag inside it. In this specific case, it records tag names and sorts class names, but deliberately ignores text content. That signature string is then hashed with MD5 and returned. If the selector finds nothing, it returns <em>&#8220;CONTAINER_NOT_FOUND&#8221;</em> instead of crashing.</p></li><li><p><em>check_fingerprint()</em>: Fetches the live page, computes the current fingerprint, and compares it against the stored one. Two failure cases are handled separately: the selector disappearing entirely (<em>CONTAINER_NOT_FOUND</em>) and the structure changing (hash mismatch). Both call <em>send_alert()</em> and return <em>False</em>: this signals to the caller to skip the scrape entirely.</p></li><li><p><em>hashlib.md5()</em>: Used to compress the structural signature into a short and comparable string. MD5 is not used here for security purposes; it&#8217;s used because it&#8217;s fast and collision-resistant enough for DOM comparison. Any change in tag names or class names produces a completely different hash.</p></li></ul><p>In the case of a drifted fingerprint, the expected result is the following:</p><pre><code><code>[STORED]  Fingerprint: 248363464b2da9b3814ea9a6dc5bd0df
[DRIFTED] Fingerprint: 7532ae7466c1f0fbc2c96471acd86ea7

[WARNING] Structural change detected: hashes differ
  Old: 24836346...
  New: 7532ae74...
  &#8594; Scrape skipped. Human review required before next run.</code></code></pre><p>Note that this approach to DOM changes can be considered &#8220;classical&#8221; because it is fully based on finding a meaningful selector for the target page. The problem is that the DOM could change, but not the &#8220;meaningful&#8221; selector you choose. To improve it, you can target several selectors, or you can add a check on the whole body, but this would add more noise.</p><p>If you prefer trying a more &#8220;modern&#8221; approach that leverages LLMs to check on DOM changes, at The Web Scraping Club, we&#8217;ve already covered this in the article: &#8220;<a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">Beyond the DOM: A Practical Guide to Web Data Extraction with LLMs and GPT Vision</a>&#8221;.</p><h3>Using Canary Fields as a Heartbeat for Your Scraping Pipeline</h3><p>A canary field is a field that you know changes frequently. This could be a &#8220;last updated&#8221; timestamp, a dynamic element, a session token in the page: you name it. You scrape it as a heartbeat, and if it stops changing, your pipeline is probably broken. Below is a snippet that checks on a canary field:</p><pre><code><code>def send_alert(message: str):
    print(f"[ALERT] {message}")

def check_canary_field(
    current_value: str,
    stored_value: str,
    field_name: str,
    competitor: str,
    max_unchanged_hours: int = 48) -&gt; None:
    if current_value == stored_value:
        send_alert(
            f"[WARNING] Canary field '{field_name}' for {competitor} "
            f"has not changed in over {max_unchanged_hours} hours. "
            f"Possible scraper failure or soft block."
        )
    else:
        print(f"[OK] Canary field '{field_name}' for {competitor} has changed. Pipeline looks healthy.")</code></code></pre><p>When something changes, the expected output is the following:</p><pre><code><code>[ALERT] [WARNING] Canary field 'last_updated' for &lt;COMPETITOR NAME&gt; has not changed in over 48 hours. 
Possible scraper failure or soft block.</code></code></pre><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Building a Storage Architecture You Can Actually Trust</h2><p>Many scraping engineers make the same mistake: they store only the processed output and throw away the raw HTML.</p><p>But in the case of competitive intelligence, that&#8217;s an issue that can cause you to work on weekends. Why? Because when a competitor changed their pricing 6 months ago, and you want to reconstruct the exact timeline, you need the raw data. Without it, history is gone forever.</p><p>The solution is a two-layer storage model: a raw and a processed layer. Here&#8217;s how the two relate:</p><pre><code><code>             &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
             &#9474;                     RAW LAYER                       &#9474;
             &#9474;  S3: raw/{competitor}/{page_type}/{date}.html.gz    &#9474;
             &#9474;  What's stored: full HTML, compressed, timestamped  &#9474;
             &#9474;  Why: so you can re-parse history if your logic     &#9474;
             &#9474;  changes or your scraper had a bug                  &#9474;
             &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
                                   &#9474; raw_s3_key (reference)
                                   &#9660;
             &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
             &#9474;                  PROCESSED LAYER                    &#9474;
             &#9474;  SQLite/Postgres: pricing_history table             &#9474;
             &#9474;  What's stored: structured, queryable records       &#9474;
             &#9474;  Why: this is what your dashboards and alerts       &#9474;
             &#9474;  consume                                            &#9474;
             &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></code></pre><p>Basically, every record in the processed layer carries a raw_s3_key that points back to the exact HTML it was extracted from. That reference is what makes the whole architecture auditable.</p><h3>The Raw Layer: Keep Every HTML Response</h3><p>Store the full HTML or JSON response, timestamped, for every successful scrape. S3 is a cheap option to do so, and a year of raw HTML for a handful of competitor pages costs almost nothing. Below is an example you can implement in your pipeline:</p><pre><code><code># In an activated virtual environment run: pip install boto3

import boto3
import gzip
from datetime import datetime

def store_raw_response(
    html: str,
    competitor: str,
    page_type: str,
    bucket_name: str
) -&gt; str:
    """
    Store the raw HTML response in S3 with a structured key.
    Compress it &#8212; HTML compresses extremely well (often 10:1).
    Returns the S3 key for reference in the processed layer.
    """
    s3 = boto3.client("s3")
    timestamp = datetime.utcnow().strftime("%Y/%m/%d/%H%M%S")
    s3_key = f"raw/{competitor}/{page_type}/{timestamp}.html.gz"

    compressed = gzip.compress(html.encode("utf-8"))

    s3.put_object(
        Bucket=bucket_name,
        Key=s3_key,
        Body=compressed,
        ContentEncoding="gzip",
        ContentType="text/html",
        Metadata={
            "competitor": competitor,
            "page_type": page_type,
            "scraped_at": datetime.utcnow().isoformat()
        }
    )

    return s3_key

def retrieve_raw_response(s3_key: str, bucket_name: str) -&gt; str:
    """
    Retrieve and decompress a raw HTML response from S3.
    Useful for re-parsing historical data when your scraper logic changes.
    """
    s3 = boto3.client("s3")
    response = s3.get_object(Bucket=bucket_name, Key=s3_key)
    compressed_data = response["Body"].read()
    return gzip.decompress(compressed_data).decode("utf-8")</code></code></pre><p>Notice the S3 structure: <em>raw/{competitor}/{page_type}/{year}/{month}/{day}/{timestamp}.html.gz</em>. This makes it trivial to list all scrapes for a specific competitor and page type within a date range, which is exactly what you need when reconstructing a pricing history.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>The Processed Layer: What Your Dashboards Actually Consume</h3><p>The processed layer is your structured, queryable data: the thing your dashboards and alerts actually read. Every record here has two fields that make the whole architecture work:</p><ul><li><p><em>raw_s3_key</em>, which links back to the exact HTML this record was extracted from.</p></li><li><p><em>fingerprint</em>, which captures the structural state of the page at scrape time.</p></li></ul><p>The first one makes your data auditable. The second one tells you, later, whether the page had already changed when this record was written.</p><p>Below is a Python example to implement it:</p><pre><code><code>import sqlite3
from dataclasses import dataclass
from datetime import datetime
from typing import Optional

@dataclass
class ProcessedPricingRecord:
    competitor: str
    page_type: str
    plan_name: str
    monthly_price: float
    currency: str
    scraped_at: datetime
    raw_s3_key: str  # The link back to the raw layer
    fingerprint: str  # The structural fingerprint at time of scrape

def store_processed_record(record: ProcessedPricingRecord, db_path: str) -&gt; None:
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute("""
        CREATE TABLE IF NOT EXISTS pricing_history (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            competitor TEXT NOT NULL,
            page_type TEXT NOT NULL,
            plan_name TEXT NOT NULL,
            monthly_price REAL NOT NULL,
            currency TEXT NOT NULL,
            scraped_at TEXT NOT NULL,
            raw_s3_key TEXT NOT NULL,
            fingerprint TEXT NOT NULL
        )
    """)

    cursor.execute("""
        INSERT INTO pricing_history 
        (competitor, page_type, plan_name, monthly_price, currency, scraped_at, raw_s3_key, fingerprint)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        record.competitor,
        record.page_type,
        record.plan_name,
        record.monthly_price,
        record.currency,
        record.scraped_at.isoformat(),
        record.raw_s3_key,
        record.fingerprint
    ))

    conn.commit()
    conn.close()</code></code></pre><p>A note on storage: this example uses SQLite, which is fine for a single scraper running on one machine. If you&#8217;re scraping multiple competitors in parallel or need concurrent writes, consider moving to Postgres (the schema remains the same).</p><h3>A Concrete Example of Why the Raw Layer Saves You on Weekends</h3><p>Imagine your competitor runs a Black Friday promotion in November and your scraper captures it. In January, your CEO asks: <em>&#8220;When exactly did they drop prices, and by how much?&#8221;</em></p><p>With only the processed layer, you can answer that if your scraper was working correctly. But what if your scraper had a bug in November that caused it to extract prices incorrectly? With the raw layer, you can go back, fix the extraction logic, and re-parse the November HTML. Without it, that data is gone.</p><p>Here&#8217;s a utility function that makes re-parsing straightforward:</p><pre><code><code>import boto3

def reparse_historical_data(
    competitor: str,
    page_type: str,
    start_date: str,
    end_date: str,
    bucket_name: str,
    new_parser_fn,
    db_path: str
) -&gt; int:
    """
    Re-parse all raw HTML for a competitor/page_type within a date range.
    Useful when your extraction logic changes and you need to backfill.
    
    new_parser_fn: a callable that takes raw HTML and returns a list of dicts
    Returns the number of records reprocessed.
    """
    s3 = boto3.client("s3")
    prefix = f"raw/{competitor}/{page_type}/"

    paginator = s3.get_paginator("list_objects_v2")
    pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

    reprocessed = 0

    for page in pages:
        for obj in page.get("Contents", []):
            key = obj["Key"]
            # Filter by date range based on key structure
            key_date = key.split("/")[3] + "-" + key.split("/")[4] + "-" + key.split("/")[5]
            if start_date &lt;= key_date &lt;= end_date:
                raw_html = retrieve_raw_response(key, bucket_name)
                parsed_records = new_parser_fn(raw_html)
                for record_data in parsed_records:
                    # Store with updated extraction, linked to same raw key
                    record = ProcessedPricingRecord(
                        **record_data,
                        raw_s3_key=key,
                        scraped_at=datetime.utcnow()
                    )
                    store_processed_record(record, db_path)
                    reprocessed += 1

    return reprocessed</code></code></pre><p>Note that you&#8217;re not re-scraping anything. You&#8217;re just running your new parser against HTML you already have in storage. That&#8217;s the whole point of the raw layer.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Recovering From a Full Scraper Failure Without Losing History</h2><p>Imagine this scenario: a competitor does a full site redesign, and your scraper is dead. This is not a hypothetical scenario: it happens more frequently than you can think, and most teams have no protocol for it.</p><p>Here&#8217;s a concrete recovery process.</p><h3>How to Know Your Scraper Is Dead (Not Just Slow)</h3><p>If you&#8217;ve implemented structural fingerprinting as discussed previously, you&#8217;ll know about the redesign before your scraper breaks. The fingerprint check will fail, you&#8217;ll get an alert, and you can investigate before storing bad data.</p><p>If you haven&#8217;t, and you&#8217;re reading this article after the fact, the first sign is usually a spike in validation failures or a sudden drop in scraped records. Consider implementing the following code:</p><pre><code><code>import sqlite3

def detect_scraper_health(db_path: str, competitor: str, lookback_days: int = 7) -&gt; dict:
    """
    Returns a health summary for a competitor's scraper over the last N days.
    Useful for spotting gradual degradation before it becomes a full failure.
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute("""
        SELECT
            DATE(scraped_at) as scrape_date,
            COUNT(*) as total_attempts,
            SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as successes,
            SUM(CASE WHEN status = 'validation_failed' THEN 1 ELSE 0 END) as validation_failures,
            SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as errors
        FROM scrape_log
        WHERE competitor = ?
          AND scraped_at &gt;= DATE('now', ?)
        GROUP BY DATE(scraped_at)
        ORDER BY scrape_date DESC
    """, (competitor, f"-{lookback_days} days"))

    rows = cursor.fetchall()
    conn.close()

    return [
        {
            "date": row[0],
            "total": row[1],
            "successes": row[2],
            "validation_failures": row[3],
            "errors": row[4],
            "success_rate": round(row[2] / row[1] * 100, 1) if row[1] &gt; 0 else 0
        }
        for row in rows
    ]
</code></code></pre><p>The expected result is something as follows:</p><pre><code><code>Date         Total     OK   Val.Fail   Errors     Rate
--------------------------------------------------------
2026-01-15      10      0          0       10     0.0%
2026-01-14      10      2          5        3    20.0%
2026-01-13      10      5          4        1    50.0%
2026-01-12      10      8          2        0    80.0%
2026-01-11      10      9          1        0    90.0%
2026-01-10      10     10          0        0   100.0%
2026-01-09      10     10          0        0   100.0%</code></code></pre><p>The degradation pattern is immediately readable. A success rate dropping from 95% to 60% over three days is probably a sign of a redesign in progress. A drop from 95% to 0% in a short time signals a full redesign, or a hard block.</p><h3>Diagnosing What Actually Broke Before You Rewrite Anything</h3><p>Not every structural change is a full redesign. Rewriting the major part of a scraper can cost you days if not weeks. To avoid that, you can write a script that checks what actually changed, pointing to meaningful parts of the target website:</p><pre><code><code>from bs4 import BeautifulSoup

def triage_structural_change(
    url: str,
    selector: str,
    old_fingerprint: str,
    bucket_name: str,
    competitor: str,
    page_type: str) -&gt; dict:
    """
    Diagnose the nature of a structural change.
    Returns a triage report to guide the recovery effort.
    """
    response = requests.get(url, timeout=10)
    current_html = response.text
    soup = BeautifulSoup(current_html, "html.parser")

    report = {
        "url": url,
        "http_status": response.status_code,
        "selector_found": soup.select_one(selector) is not None,
        "old_fingerprint": old_fingerprint,
        "new_fingerprint": extract_structural_fingerprint(current_html, selector),
        "page_title": soup.title.string if soup.title else "N/A",
        "recommendation": None
    }

    if not report["selector_found"]:
        report["recommendation"] = "FULL_REWRITE: primary selector is gone. Site likely redesigned."
    elif report["old_fingerprint"] != report["new_fingerprint"]:
        report["recommendation"] = "PARTIAL_UPDATE:selector exists but structure changed. Review child selectors."
    else:
        report["recommendation"] = "FALSE_ALARM: fingerprint mismatch may be transient. Re-check in 1 hour."

    # Store the current raw HTML for reference during rewrite
    store_raw_response(current_html, competitor, f"{page_type}_triage", bucket_name)

    return report
    
def print_report(report: dict) -&gt; None:
    print(f"  URL:             {report['url']}")
    print(f"  Selector found:  {report['selector_found']}")
    print(f"  Old fingerprint: {report['old_fingerprint'][:8]}...")
    print(f"  New fingerprint: {report['new_fingerprint'][:8]}...")
    print(f"  Page title:      {report['page_title']}")
    print(f"  Recommendation:  {report['recommendation']}")
</code></code></pre><p>The expected output for the case of a full rewrite is the following:</p><pre><code><code>&#9;URL:             https://...
  Selector found:  False
  Old fingerprint: 4a7f92bc...
  New fingerprint: CONTAINER_NOT_FOUND
  Page title:      Pricing page
  Recommendation:  FULL_REWRITE: primary selector is gone. Site likely redesigned.
</code></code></pre><p>So, in this case, the script checks a specific URL of the target website. But the code can be, of course, generalized to more than just one. The same applies to the target selector.</p><h3>Estimating How Much of the Gap You Can Recover From Raw Storage</h3><p>Once you&#8217;ve rewritten the scraper, you have a gap in your data. If you&#8217;ve been storing raw HTML, you can partially fill it using a snippet like the following:</p><pre><code><code>import boto3

def estimate_gap_coverage(
    competitor: str,
    page_type: str,
    failure_start_date: str,
    bucket_name: str) -&gt; dict:
    """
    Estimate how much of the gap period can be recovered from raw storage.
    """
    s3 = boto3.client("s3")
    prefix = f"raw/{competitor}/{page_type}/"

    paginator = s3.get_paginator("list_objects_v2")
    pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

    available_keys = []
    for page in pages:
        for obj in page.get("Contents", []):
            key = obj["Key"]
            key_date = "/".join(key.split("/")[3:6])
            if key_date &gt;= failure_start_date.replace("-", "/"):
                available_keys.append(key)

    return {
        "gap_start": failure_start_date,
        "recoverable_snapshots": len(available_keys),
        "oldest_recoverable": available_keys[0] if available_keys else None,
        "newest_recoverable": available_keys[-1] if available_keys else None,
        "recommendation": (
            "Run reparse_historical_data() with your new parser to backfill."
            if available_keys else
            "No raw snapshots available for this period. Gap cannot be recovered."
        )
    }

# Print values
print(f"Gap start:              {result['gap_start']}")
print(f"Recoverable snapshots:  {result['recoverable_snapshots']}")
print(f"Oldest recoverable:     {result['oldest_recoverable']}")
print(f"Newest recoverable:     {result['newest_recoverable']}")
print(f"Recommendation:         {result['recommendation']}")</code></code></pre><p>The expected result is:</p><pre><code><code>Gap start:              2026-04-12
Recoverable snapshots:  3
Oldest recoverable:     &lt;COMPETITOR_NAME&gt;/pricing/2026/04/12/143022.html.gz
Newest recoverable:     &lt;COMPETITOR_NAME&gt;/pricing/2026/04/14/143022.html.gz
Recommendation:         Run reparse_historical_data() with your new parser to backfill.</code></code></pre><p>As a final note on this recovery section, consider that, in practice, these three steps run in sequence:</p><ul><li><p>When something breaks, you start with <em>detect_scraper_health()</em> to confirm there&#8217;s actually a problem and understand how long it&#8217;s been degrading.</p></li><li><p>Then you run <em>triage_structural_change()</em> to understand <em>what</em> broke. That tells you whether you need a full rewrite or a 10-minute fix.</p></li><li><p>Once you&#8217;ve updated your scraper, you call <em>estimate_gap_coverage()</em> to see how much of the gap period you can recover from raw storage, and then <em>reparse_historical_data()</em> to actually backfill it. If <em>estimate_gap_coverage()</em> comes back with zero recoverable snapshots, that gap is gone, which is exactly why the raw layer exists in the first place.</p></li></ul><h2>Conclusion</h2><p>Building competitive intelligence scrapers is not as easy as it seems for two reasons:</p><ul><li><p>Developing the scrapers is only half the effort.</p></li><li><p>Keeping the scrapers honest 12 months from now probably needs even harder efforts than coding the scrapers themselves.</p></li></ul><p>The layers presented in this article can make your competitive intelligence scrapers resilient through time (with the right adjustments for production environments). So, if you&#8217;re implementing this from scratch, don&#8217;t try to build everything at once. Add one piece at a time and take your time to verify everything works fine before you add the next step.</p><p>Then, you just need to <a href="https://substack.thewebscraping.club/p/analyzing-scraped-data-pandas-matplotlib">analyze the data you scraped</a> (maybe directly <a href="https://substack.thewebscraping.club/p/building-a-scraper-dashboard-streamlit">implementing a dashboard in Streamlit</a>), and your colleagues in the business department can have sweet dreams: their competitive analytics are safe through time.</p><p>So, let us now: are your competitive intelligence scrapers time-resilient?</p><div><hr></div><p><em>Did you like this article? Share it with someone who might find it useful and get a discount on paid plans.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/how-to-build-competitive-intelligence-scraping?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.thewebscraping.club/p/how-to-build-competitive-intelligence-scraping?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><p></p>]]></content:encoded></item><item><title><![CDATA[Kameleo Docker: Exploring the Docker-Based Anti-Detect Browser]]></title><description><![CDATA[Kameleo is finally available on Linux. How? Via Docker!]]></description><link>https://substack.thewebscraping.club/p/kameleo-docker-exploring-the-docker</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/kameleo-docker-exploring-the-docker</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 14 Jun 2026 03:01:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9d36ab11-f536-4b56-829a-540e3ba41ad8_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Docker-based stealth browsers are quickly becoming a new standard for automation and scraping infrastructures. The main reason is that they can be integrated directly into CI pipelines or your own fleet of scalable stealth browsers in the cloud.</p><p>Kameleo Docker brings Kameleo&#8217;s anti-detect browser capabilities into a containerized setup, enabling production-ready automation with real fingerprinting and multi-profile isolation on Linux servers.</p><p>In this post, I&#8217;ll take a deep look at this solution and walk you through everything you need to know about it. By the end, you&#8217;ll understand what Kameleo Docker is, how its stealth browser approach works, how to set it up, and whether it&#8217;s actually worth trying.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><div><hr></div><h2>An Introduction to Kameleo Docker</h2><p>Dig into the world of Kameleo Docker!</p><h3>What Is Kameleo Docker?</h3><p><a href="https://kameleo.io/">Kameleo</a> is an anti-detect browser engineered to make browser sessions look like real user devices. Instead of exposing a generic automation fingerprint, it creates realistic browser identities by spoofing hardware, browser, and environment signals such as WebGL, Canvas, fonts, screen resolution, and geolocation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AV8o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AV8o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 424w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 848w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 1272w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AV8o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png" width="1456" height="577" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:577,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Kameleo Docker&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kameleo Docker" title="Kameleo Docker" srcset="https://substackcdn.com/image/fetch/$s_!AV8o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 424w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 848w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 1272w, https://substackcdn.com/image/fetch/$s_!AV8o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa86beae8-6206-4e8e-bdaa-50d9a1e743bd_3008x1193.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kameleo Docker</figcaption></figure></div><p>Kameleo Docker brings <a href="https://substack.thewebscraping.club/p/kameleo-anti-detect-browser">that same stealth stack</a> into a self-hosted, containerized deployment model. Rather than relying on a desktop app, you can run Kameleo inside Docker on Linux or Windows servers, CI pipelines, VPSs, or Kubernetes environments.</p><p>Playwright, Puppeteer, and Selenium can connect to the container via CDP, meaning you can keep using your existing browser automation logic with minimal changes.</p><p>Further reading:</p><ul><li><p><em><a href="https://kameleo.io/docker">Kameleo Docker product page</a></em></p></li><li><p><em><a href="https://developer.kameleo.io/integrations/docker/">Kameleo Docker docs</a></em></p></li><li><p><em><a href="https://kameleo.io/blog/kameleo-on-linux-via-docker-what-we-built-what-broke-whats-next">Kameleo on Linux via Docker: What We Built, What Broke, What&#8217;s Next</a></em></p><p></p></li></ul><h3>Why Kameleo Docker Exists</h3><p>Cloud servers, Kubernetes clusters, CI/CD pipelines, and VPS environments are overwhelmingly Linux-native, making Linux compatibility a practical requirement for automation and scraping teams.</p><p>For years, this created a problem for Kameleo users, as the solution only supported Windows and macOS. Because of that, teams wanting to run stealth browsers in production often had to rely on fragile workarounds.</p><p>Some deployed Windows virtual machines alongside Linux scraping stacks in AWS. Others used <a href="https://some-natalie.dev/blog/ssh-x11-forwarding/">X11-over-SSH</a> tunnels to remotely access browsers running on servers. These setups were difficult to maintain, resource-intensive, and far from ideal for scalable automation.</p><p>As explained in the <a href="https://kameleo.io/blog/kameleo-on-linux-via-docker-what-we-built-what-broke-whats-next">product announcement blog post</a>, Kameleo customers started to ask for a version of Kameleo that could run directly where their automation already lived.</p><p>As Barnabas Szenasi, founder and lead engineer at Kameleo, explained when I met him at Prague Crawl 2026:</p><blockquote><p><em>&#8220;We could see from customer messages that a significant slice of our automation-first users were running Linux cloud servers and simply couldn&#8217;t use Kameleo at all... At Prague Crawl 2025, Tamas [Kameleo&#8217;s CEO] and I heard the same story from industry peers around the world: scraping pipelines were getting harder, and the need to run real browser environments instead of faking HTTP requests was growing fast.&#8221;</em></p></blockquote><h3>The Philosophy Behind the Project</h3><p>From the beginning, Kameleo&#8217;s philosophy has been simple: masking quality matters more than shipping quickly.</p><p>After all, <a href="https://substack.thewebscraping.club/p/anti-detect-browser-royal-rumble-comments">an anti-detect browser</a> is only useful if it can convincingly behave like a real device. That&#8217;s why Kameleo relies on fingerprints sourced from real-world device traffic and continuously tested against modern anti-bot systems.</p><p>That same quality-first mindset also shaped the Docker project. According to founder and lead engineer Barnabas Szenasi, Linux support took longer than expected because the goal was never just to make Kameleo run in a container.</p><p>The objective was to reach the same masking quality users already expected on Windows and macOS. Shipping a functional but lower-fidelity Linux version would have compromised the product&#8217;s core standard.</p><blockquote><div><hr></div></blockquote><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Technical Architecture Behind Kameleo Docker</h2><p>Now that you know why the project exists, let me explain how it works and the engineering behind it.</p><h3>How Kameleo Docker Works</h3><p>Kameleo Docker runs inside either:</p><ul><li><p>a Linux-based container (Ubuntu 22.04), or</p></li><li><p>a Windows-based container (Windows Server Core 2022).</p></li></ul><p>When you pull the image, Docker downloads the correct variant based on your container configuration (Linux containers are the default in most environments, including <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a>). Regardless of the underlying platform, Kameleo exposes the same <a href="https://developer.kameleo.io/reference/api-reference/">Local API</a> for browser and profile management as the <a href="https://kameleo.io/downloads">desktop app</a>.</p><p>The process begins when you create a browser profile through the Local API. If you aren&#8217;t familiar with that concept, <a href="https://developer.kameleo.io/concepts/profiles/">Kameleo profiles</a> are reusable browser environments that bundle a complete browser fingerprint together with persistent browser state, such as cookies, browsing history, local storage, and bookmarks.</p><p>Profiles can also include user-defined settings like proxies, browser extensions, and startup preferences. Each profile is tied to a specific browser kernel and can be started, stopped, imported, or exported as needed.</p><h3>Architecture Overview</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sgEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sgEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 424w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 848w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sgEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png" width="1456" height="1019" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1019,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sgEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 424w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 848w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!sgEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6b1496-60f7-4a12-b06a-218d9f2d35ad_2174x1522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kameleo Docker&#8217;s architecture</figcaption></figure></div><p>Kameleo Docker separates browser execution from automation logic. The container hosts the stealth browsers, fingerprinting systems, and orchestration layer, while your automation scripts run independently on your machine, server, or orchestration platform.</p><p>At the center of the architecture is the Local API, exposed on port <em>5050</em>. This API handles profile creation, fingerprint selection, browser startup, lifecycle management, and more.</p><p>Behind the API sit <a href="https://developer.kameleo.io/concepts/kernels/">Kameleo&#8217;s browser kernels</a>:</p><ul><li><p><strong>Chroma</strong>: A Chromium-based engine.</p></li><li><p><strong>Junglefox</strong>:<strong> </strong>A Firefox-based engine.</p></li></ul><p>These kernels are modified with engine-level masking patches and connected to Kameleo&#8217;s continuously updated fingerprint database (more on this later). Since they are exposed through the same API, you can switch between them without making any code changes.</p><p>When a profile starts, Kameleo launches a browser session with the configured fingerprint and settings. Playwright and Puppeteer can then connect to the running browser through a WebSocket endpoint via <a href="https://chromedevtools.github.io/devtools-protocol/">Chrome DevTools Protocol (CDP)</a>.</p><p>In other words, your automation script stays outside the Docker container. The browser behaves as if it were running locally, while execution, fingerprint masking, and browser management happen entirely inside the Docker container.</p><p>Persistent storage is handled through Docker volumes. Profile data, downloaded browser kernels, and runtime state are stored outside the container, allowing environments to be recreated without losing configuration or repeatedly downloading browser components. This makes deployments easier to scale, recover, and reproduce.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Qrb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png" width="479" height="239.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:479,&quot;bytes&quot;:911444,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196394917?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Trusted by teams running ad verification, web scraping, SERP tracking, and market research. Ethically sourced proxies, globally accessible, and fairly priced.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataimpulse.com/&quot;,&quot;text&quot;:&quot;Get Started With DataImpulse&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://dataimpulse.com/"><span>Get Started With DataImpulse</span></a></p></blockquote><div><hr></div><h2>Core Features of Kameleo Docker</h2><p>Time to explore the main features and capabilities provided by Kameleo Docker. For more information, <a href="https://developer.kameleo.io/integrations/docker/">read the official documentation</a>.</p><h3>Real Device Fingerprint Masking</h3><p>Kameleo fingerprints are derived from real-world device traffic, not synthetic templates. Each profile represents a coherent combination of OS, browser, hardware signals, and behavioral characteristics.</p><p>All fingerprint surfaces are kept internally consistent, including Canvas, WebGL, audio context, screen resolution, and fonts.</p><p><strong>Note</strong>: TLS fingerprint spoofing isn&#8217;t required, as Kameleo matches the browser kernel version precisely. The TLS stack remains the original, unmodified implementation shipped with the corresponding browser release.</p><p>The goal of the project isn&#8217;t to spoof everything, but to maintain realism across signals. That&#8217;s because overriding too many surfaces increases inconsistency risk, which detection systems can flag. For example, running a macOS fingerprint on a Windows host forces heavy compensation across system-level signals.</p><h3>Proxy Integration and Geo Consistency</h3><p>Each Kameleo profile can be assigned a dedicated proxy (including a rotating proxy), allowing IP-level isolation between browser identities.</p><p>Now, mismatches between IP geography and browser signals (language, timezone, WebRTC, and system locale) are a common detection vector. To address that, Kameleo provides <a href="https://help.kameleo.io/article/74-recommended-settings">automatic geo-location matching</a> to align the browser&#8217;s settings with the geographic location of the selected proxy IP address.</p><h3>Multi-Profile Isolation</h3><p>Kameleo Docker is built around strict profile isolation, as each browser profile runs as a fully independent environment. This separation opens the door to safe multi-accounting (referred to as &#8220;account management&#8221; in Kameleo terminology). Thanks for this feature, you can operate multiple identities simultaneously without cross-contamination of session data or signals.</p><h3>Linux-Specific Docker Image Features</h3><p>Compared to the Windows-based container, the Linux version of Kameleo Docker includes several additional features. These include:</p><ul><li><p><strong><a href="https://developer.kameleo.io/integrations/docker/#vnc-viewer-only-in-linux-based-container">Built-in VNC viewer</a></strong>: Allows you to monitor and interact with live browser sessions. This is especially useful for debugging automation, validating fingerprints, or troubleshooting rendering issues. You can access it through a browser on port <em>8080</em> or via native VNC clients such as RealVNC or TigerVNC on port <em>5900</em>. For security reasons, it&#8217;s disabled by default.</p></li><li><p><strong>Browser-Based Kameleo GUI</strong>: A lightweight browser-based GUI on port <em>80</em> (reach it at <em>http://localhost:80</em>). Unlike the desktop app, it offers reduced functionality and is primarily intended for quick inspection, basic profile management, and monitoring.</p></li><li><p><strong>Optional GPU acceleration</strong>: The Linux container <a href="https://developer.kameleo.io/integrations/docker/#gpu-support-on-linux">supports optional GPU acceleration</a> for graphics-heavy workloads such as WebGL or canvas-intensive websites. Intel/AMD GPUs can be mounted through <em>/dev/dri</em>, while NVIDIA GPUs are supported through the NVIDIA Container Toolkit. When no GPU is available, Kameleo falls back to software rendering.</p></li></ul><h2>Getting Started With Kameleo Docker: Step-by-Step Guide</h2><p>In this guided section, I&#8217;ll show you how to set up Kameleo Docker and use it for browser automation against the <a href="https://www.scrapingcourse.com/javascript-rendering">Scraping Course&#8217;s &#8220;JavaScript Rendering&#8221; page</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k9B9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k9B9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 424w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 848w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 1272w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k9B9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target Scraping Course &#8220;JavaScript Rendering&#8221; page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target Scraping Course &#8220;JavaScript Rendering&#8221; page" title="The target Scraping Course &#8220;JavaScript Rendering&#8221; page" srcset="https://substackcdn.com/image/fetch/$s_!k9B9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 424w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 848w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 1272w, https://substackcdn.com/image/fetch/$s_!k9B9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d4a67c-8ade-4fb3-b13d-f7d2f23e0f6d_3021x1629.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target Scraping Course &#8220;JavaScript Rendering&#8221; page</figcaption></figure></div><p>This is a sandbox environment for web scraping that simulates a real-world, JavaScript-rendered ecommerce page. It makes for a great testing target to validate the setup and see how Kameleo Docker behaves in a realistic automation scenario.</p><h3>Requirements and Prerequisites</h3><p>To get started with Kameleo Docker, make sure you have <a href="https://www.docker.com/get-started/">Docker installed and running locally</a>. (Optionally, you need a <a href="https://login.kameleo.io/Account/Login">Kameleo account </a>if you want to run Kameleo Docker with authentication.)</p><p>For more details on supported operating systems and memory requirements, <a href="https://developer.kameleo.io/integrations/docker/#prerequisites">refer to the official documentation</a>.</p><p>Since I&#8217;ll show how to use Kameleo Docker with Playwright in Python, to keep things moving, I&#8217;ll assume you already have a Python environment set up with <a href="https://playwright.dev/python/docs/library#installation">Playwright and its dependencies installed</a>.</p><p>To follow along with this tutorial section, I also recommend that you have:</p><ul><li><p><a href="https://docker-curriculum.com/">Basic Docker experience</a> (running containers, mounting volumes, and using compose files)</p></li><li><p>Familiarity with <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation using Playwright</a>.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Step #1: Create a Kameleo Account</h3><p>If you haven&#8217;t already, start by <a href="https://login.kameleo.io/Account/Register">creating a Kameleo account</a>. Fill out the sign-up form and enter the required information. Once registration is complete, <a href="https://kameleo.io/pricing">a </a><em><a href="https://kameleo.io/pricing">Free</a></em><a href="https://kameleo.io/pricing"> plan</a> will already be activated:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OVWu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OVWu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 424w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 848w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 1272w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OVWu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png" width="1456" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OVWu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 424w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 848w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 1272w, https://substackcdn.com/image/fetch/$s_!OVWu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313c7aed-9927-4fb5-8981-886a01234f08_3031x1639.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Free plan</figcaption></figure></div><p>Note that the <em>Free</em> plan is enough to use Kameleo Docker.</p><p><strong>Important</strong>: Kameleo credentials are required for the container to authenticate successfully, download browser kernels, and start correctly.</p><h3>Step #2: Start the Docker Container</h3><p>With your account ready, the next step is to pull and run <a href="https://hub.docker.com/r/kameleo/kameleo-app">the Kameleo Docker image</a>.</p><p>Remember that Kameleo ships as a multi-platform Docker image, supporting both Linux-based and Windows-based containers. To download and start the Linux container version of Kameleo, run:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">docker run --platform linux/amd64 \
    --shm-size=2g \
    -p 5050:5050 \
    -e PAT='&lt;YOUR_KAMELEO_PAT&gt;' \
    -v kameleo-data:/data \
    kameleo/kameleo-app:latest</code></pre></div><p><strong>Note 1</strong>: If you run this command in PowerShell, replace &#8220;\&#8221; with the backtick &#8220;`&#8221; for multi-line commands.</p><p><strong>Note 2</strong>: To launch the web GUI included in the Linux container of Kameleo Docker, add the <em>-p 80:80</em> argument to your <em>docker run</em> command.</p><p>Here&#8217;s what matters in the above command:</p><ul><li><p><em>--platform linux/amd64</em> ensures Docker pulls the Linux-based image variant.</p></li><li><p><em>--shm-size=2g</em> is required for stable browser execution (the default Docker shared memory of 64MB is too small for browser execution).</p></li><li><p><em>-v kameleo-data:/data</em> creates a named volume that persists browser kernels and profiles across restarts.</p></li><li><p><em>&lt;YOUR_KAMELEO_PAT&gt;</em> is your <a href="https://developer.kameleo.io/getting-started/configure/#3-generate-a-personal-access-token-pat">Kameleo Personal Access Token (PAT)</a>, required to authenticate with your Kameleo account. Note that <strong>this argument is optional</strong>. You can skip the <em>-e PAT=&#8217;&lt;YOUR_KAMELEO_PAT&gt;&#8217;</em> flag entirely if you want to try out Kameleo Docker without an account.</p></li></ul><p>Below&#8217;s the output you should get:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lH45!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lH45!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 424w, https://substackcdn.com/image/fetch/$s_!lH45!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 848w, https://substackcdn.com/image/fetch/$s_!lH45!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 1272w, https://substackcdn.com/image/fetch/$s_!lH45!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lH45!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png" width="1456" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Retrieving the Kameleo Docker image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Retrieving the Kameleo Docker image" title="Retrieving the Kameleo Docker image" srcset="https://substackcdn.com/image/fetch/$s_!lH45!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 424w, https://substackcdn.com/image/fetch/$s_!lH45!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 848w, https://substackcdn.com/image/fetch/$s_!lH45!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 1272w, https://substackcdn.com/image/fetch/$s_!lH45!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a533b1-de13-4653-a94d-08137ea045b6_3038x672.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Retrieving the Kameleo Docker image</figcaption></figure></div><p>The Kameleo Docker image should now be downloaded and launched on your system. Cool!</p><h3>Step #3: Verify the Service</h3><p>Once you run the image, Kameleo Docker will:</p><ol><li><p>Start the Local API on port <em>5050</em>.</p></li><li><p>Authenticate using your credentials.</p></li><li><p>Download required browser kernels (first run only).</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pYxi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pYxi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 424w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 848w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 1272w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pYxi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png" width="1456" height="82" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:82,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Kameleo Docker image startup logs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Kameleo Docker image startup logs" title="The Kameleo Docker image startup logs" srcset="https://substackcdn.com/image/fetch/$s_!pYxi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 424w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 848w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 1272w, https://substackcdn.com/image/fetch/$s_!pYxi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4c9579e-6a4e-4153-90b7-3b6c696c090b_2526x142.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Kameleo Docker image startup logs</figcaption></figure></div><p>To confirm everything is running correctly, visit the following URL in your browser:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">http://localhost:5050/swagger</code></pre></div><p>You should see the Swagger UI for the Kameleo Local API:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oAMx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oAMx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 424w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 848w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oAMx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png" width="1456" height="782" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37e97028-9717-472e-850f-dcced084d674_3024x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:782,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Swagger UI for the Kameleo Local API at &#8220;http://localhost:5050/swagger&#8221;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Swagger UI for the Kameleo Local API at &#8220;http://localhost:5050/swagger&#8221;" title="The Swagger UI for the Kameleo Local API at &#8220;http://localhost:5050/swagger&#8221;" srcset="https://substackcdn.com/image/fetch/$s_!oAMx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 424w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 848w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!oAMx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e97028-9717-472e-850f-dcced084d674_3024x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Swagger UI for the Kameleo Local API at &#8220;http://localhost:5050/swagger&#8221;</figcaption></figure></div><p>You&#8217;ll have a local Kameleo instance ready for automation with Playwright, Puppeteer, or Selenium. Great!</p><h3>Step #4: Download the SDK and Create Your First Profile</h3><p>Now that Kameleo Docker is running, you can interact with it through the APIs exposed at <em>http://localhost:5050</em>. The next step is to proceed with the <a href="https://developer.kameleo.io/getting-started/quickstart/">usual Kameleo setup</a> by creating a profile.</p><p>Assuming you already have a Python environment with Playwright installed,start by <a href="https://pypi.org/project/kameleo.local-api-client/">installing the Kameleo SDK</a>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">pip install kameleo-local-api-client</code></pre></div><p>Then, in your Playwright script, initialize the Kameleo API client and generate a browser profile:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from kameleo.local_api_client import KameleoLocalApiClient
from kameleo.local_api_client.models import CreateProfileRequest

# Connect to the Kameleo API client
client = KameleoLocalApiClient(endpoint="http://localhost:5050")
# Search for a real-world fingerprint and create a Kameleo profile based on it
fps = client.fingerprint.search_fingerprints(
    device_type="desktop",
    os_family="windows",
    browser_product="chrome",
    browser_version="&gt;145",
)
profile = client.profile.create_profile(
    CreateProfileRequest(fingerprint_id=fps[0].id, name="twsc demo")
)</code></pre></div><p>The above snippet connects the Kameleo API client from the SDK to the local Kameleo Docker APIs. It retrieves a <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">realistic browser fingerprint</a> from the database and creates a persistent browser profile called &#8220;twsc demo&#8221; based on it. In this case, the fingerprint profile is for a desktop Chrome browser (version &gt;145) running on Windows.</p><p>Run the script above. If you started the Linux container of Kameleo Docker while mapping port <em>80</em> for the web GUI, then you&#8217;ll be able to see the &#8220;twsc demo&#8221; profile at <em>http://localhost:80</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7VeS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7VeS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 424w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 848w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 1272w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7VeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the &#8220;twsc demo&#8221; profile&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the &#8220;twsc demo&#8221; profile" title="Note the &#8220;twsc demo&#8221; profile" srcset="https://substackcdn.com/image/fetch/$s_!7VeS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 424w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 848w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 1272w, https://substackcdn.com/image/fetch/$s_!7VeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60baf46-86ad-4bd6-bf7f-aad1a06b7a24_3071x1620.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the &#8220;twsc demo&#8221; profile</figcaption></figure></div><h3>Step #5: Connect With Playwright</h3><p>You can now connect Playwright to the Kameleo profile created above <a href="https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi">via CDP</a> using the following WebSocket URL:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">ws://localhost:5050/playwright/&lt;KAMELEO_PROFILE_ID&gt;</code></pre></div><p>Achieve that by using <a href="https://playwright.dev/python/docs/api/class-browsertype#browser-type-connect-over-cdp">Playwright&#8217;s </a><em><a href="https://playwright.dev/python/docs/api/class-browsertype#browser-type-connect-over-cdp">connect_over_cdp()</a></em> method on the target URL:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

# Kameleo profile creation...

# Connect Playwright to the Kameleo instance based on the configured profile
ws_endpoint = f"ws://localhost:5050/playwright/{profile.id}"
with sync_playwright() as playwright:
    browser = playwright.chromium.connect_over_cdp(endpoint_url=ws_endpoint, timeout=90_000)

    # Regular Playwright automation logic...</code></pre></div><p>Wonderful! Playwright is attached to the browser session managed by Kameleo Docker. You can now automate it using standard Playwright APIs as if it were a regular local Chromium instance.</p><h3>Step #6: Implement the Automation Logic</h3><p>To achieve the scraping goal, begin by inspecting the page to study its DOM structure:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zk-5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zk-5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 424w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 848w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 1272w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zk-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png" width="1456" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Inspecting a product on the page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Inspecting a product on the page" title="Inspecting a product on the page" srcset="https://substackcdn.com/image/fetch/$s_!zk-5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 424w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 848w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 1272w, https://substackcdn.com/image/fetch/$s_!zk-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26be934e-c39d-4d8b-ade1-f7fef47d51fe_2688x1608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Inspecting a product on the page</figcaption></figure></div><p>Then, apply the following Playwright logic (connected to a Kameleo profile) to automate scraping on the JavaScript-rendered page:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># Connect Playwright to the Kameleo instance based on the configured profile
ws_endpoint = f'ws://localhost:5050/playwright/{profile.id}'
with sync_playwright() as playwright:
    browser = playwright.chromium.connect_over_cdp(endpoint_url=ws_endpoint, timeout=90_000)

    # Open a new page
    context = browser.contexts[0]
    page = context.new_page()

    # Visit target site
    page.goto("https://www.scrapingcourse.com/javascript-rendering")

    # Where to store the scraped data
    products = []

    # Wait for products to render
    page.wait_for_selector(".product-item")

    # Locate all product items
    product_elements = page.locator(".product-item")

    for i in range(product_elements.count()):
        # Select the nth product
        product_element = product_elements.nth(i)

        # Extract the product data
        name = product_element.locator(".product-name").inner_text()
        price = product_element.locator(".product-price").inner_text()
        image = product_element.locator("img.product-image").get_attribute("src")
        link = product_element.locator("a.product-link").get_attribute("href")

        # Populate a product object with the scraped data
        product = {
            "name": name,
            "price": price,
            "image": image,
            "url": link
        }
        # Append it to the products list
        products.append(product)</code></pre></div><p>The above snippet instructs the controlled browser to visit the target page, waits for product elements to load, then iterates through each product DOM node to extract structured fields (name, price, image, URL) and stores them in a Python list for downstream processing.</p><p>The Kameleo-powered automation script is almost complete. Only one step remains!</p><h3>Step #7: Stop the Kameleo Profile</h3><p>Normally, in a Playwright scenario, you would need to call <em>browser.close()</em> to terminate the browser session and release its resources.</p><p>In Kameleo, that&#8217;s not required. Instead, you only need to call:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">client.profile.stop_profile(profile_id=profile.id)</code></pre></div><p>The above line of code sends a close command to the browser via CDP. Once the browser actually stops, the Kameleo profile is terminated, too. This ensures that all resources associated with both the browser and the running profile are properly released.</p><h3>Step #8: Run the Script</h3><p>The final Playwright automation script, connecting via CDP to the stealth browser instance exposed by Kameleo Docker, will contain:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># pip install playwright kameleo-local-api-client

from kameleo.local_api_client import KameleoLocalApiClient
from kameleo.local_api_client.models import CreateProfileRequest
from playwright.sync_api import sync_playwright

# Connect to the Kameleo API client
client = KameleoLocalApiClient(endpoint="http://localhost:5050")
# Search for a real-world fingerprint and create a Kameleo profile based on it
fps = client.fingerprint.search_fingerprints(
    device_type="desktop",
    os_family="windows",
    browser_product="chrome",
    browser_version="&gt;145",
)
profile = client.profile.create_profile(
    CreateProfileRequest(fingerprint_id=fps[0].id, name="twsc demo")
)

# Connect Playwright to the Kameleo instance based on the configured profile
ws_endpoint = f'ws://localhost:5050/playwright/{profile.id}'
with sync_playwright() as playwright:
    browser = playwright.chromium.connect_over_cdp(endpoint_url=ws_endpoint, timeout=90_000)

    # Open a new page
    context = browser.contexts[0]
    page = context.new_page()

    # Visit target site
    page.goto("https://www.scrapingcourse.com/javascript-rendering")

    # Where to store the scraped data
    products = []

    # Wait for products to render
    page.wait_for_selector(".product-item")

    # Locate all product items
    product_elements = page.locator(".product-item")

    for i in range(product_elements.count()):
        # Select the nth product
        product_element = product_elements.nth(i)

        # Extract the product data
        name = product_element.locator(".product-name").inner_text()
        price = product_element.locator(".product-price").inner_text()
        image = product_element.locator("img.product-image").get_attribute("src")
        link = product_element.locator("a.product-link").get_attribute("href")

        # Populate a product object with the scraped data
        product = {
            "name": name,
            "price": price,
            "image": image,
            "url": link
        }
        # Append it to the products list
        products.append(product)

    # Print the scraped products
    for product in products:
        print(product)

# Stop the Kameleo profile
client.profile.stop_profile(profile_id=profile.id)</code></pre></div><p>Execute the script, and you should see output similar to this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tH8f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tH8f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 424w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 848w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 1272w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tH8f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output produced by the automation script&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output produced by the automation script" title="The output produced by the automation script" srcset="https://substackcdn.com/image/fetch/$s_!tH8f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 424w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 848w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 1272w, https://substackcdn.com/image/fetch/$s_!tH8f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21c6244e-687f-4002-87e0-c421f94958bc_1692x893.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the automation script</figcaption></figure></div><p>Notice how the script successfully scraped product data from the JavaScript-rendered page.</p><p>Once execution completes, open the Kameleo web GUI, and you&#8217;ll notice that the &#8220;twsc demo&#8221; profile is now marked as &#8220;TERMINATED&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zZCP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zZCP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 424w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 848w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 1272w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zZCP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the updated status of the Kameleo profile&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the updated status of the Kameleo profile" title="Note the updated status of the Kameleo profile" srcset="https://substackcdn.com/image/fetch/$s_!zZCP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 424w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 848w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 1272w, https://substackcdn.com/image/fetch/$s_!zZCP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8db45310-8917-4422-8ff1-02ddda4896c3_3057x1613.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the updated status of the Kameleo profile</figcaption></figure></div><p>That doesn&#8217;t mean the profile is gone forever. Quite the opposite!</p><p>Kameleo profiles are reusable, meaning you can retrieve and start them again later to continue the browsing session with the same fingerprint and persisted state. I&#8217;ll cover exactly how to do that in a dedicated FAQ.</p><p>Mission complete! You just learned how to use Kameleo Docker for Playwright automation. With very similar logic, you can automate Puppeteer, Selenium, or any other CDP-compatible solution, in both Python and JavaScript.</p><h3>Pricing Model</h3><p>Kameleo Docker is included across all plans at no additional cost. You get the same core limits (concurrent browsers, number of profiles, and browser usage time) regardless of whether you run the desktop app or the containerized version. So, <a href="https://kameleo.io/pricing">take a look at the official pricing page</a> for more information.</p><h2>Anti-Bot Performance Benchmarks</h2><p>To test Kameleo Docker, I ran a simple script against one page protected by each major anti-bot detection system. The results are shown below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kcyp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kcyp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 424w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 848w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 1272w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kcyp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png" width="1456" height="461" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:461,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vanilla Playwright (headless) vs vanilla Playwright (headful) vs Kameleo Docker&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vanilla Playwright (headless) vs vanilla Playwright (headful) vs Kameleo Docker" title="Vanilla Playwright (headless) vs vanilla Playwright (headful) vs Kameleo Docker" srcset="https://substackcdn.com/image/fetch/$s_!kcyp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 424w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 848w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 1272w, https://substackcdn.com/image/fetch/$s_!kcyp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26a58275-3f59-4452-93a9-a8cf18f700b7_1536x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Vanilla Playwright (headless) vs vanilla Playwright (headful) vs Kameleo Docker</figcaption></figure></div><p><strong>Note</strong>: All tests were performed locally using my ISP&#8217;s residential IP address.</p><p>As shown above, in this basic experiment, Kameleo Docker achieved a 100% success rate. In contrast, Playwright consistently failed in headless mode and, in some cases, also struggled in headful mode.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Final Thoughts on Kameleo Docker</h3><p><em>What stood out when I met Barnabas at <a href="https://www.praguecrawl.com/">Prague Crawl 2026</a> (see you next year &#128521;</em>)<em> was the clear passion the team has for the project, along with their focus on quality and continuous improvement.</em></p><p><em>At the same time, when testing new products, especially technical and complex ones like Kameleo Docker, you usually stumble across bugs or unexpected behavior. I can confidently say that this wasn&#8217;t the case at all here. Everything ran smoothly from the beginning, and I didn&#8217;t encounter any issues&#8230;</em></p><p>On top of that, t<em>he benchmark results are promising, and I didn&#8217;t notice any significant performance lag. Thus, my honest takeaway is simple: if you&#8217;re looking for a production-ready, containerized stealth browser, or you&#8217;re simply passionate about automation and scraping, consider giving Kameleo Docker a try!</em></p><p>In this article, I covered what the project is about, what it offers, how it works, and how to use it. As always, remember to use Kameleo Docker only for legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical web scraping and automation</a>. Until next time!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>FAQ</h2><h3>Is Kameleo Docker different from the desktop app?</h3><p>Kameleo Docker differs from the desktop app mainly in deployment. Instead of a local GUI, it runs as a containerized service that runs on both Linux and Windows servers. Feature parity is largely preserved, including profiles, fingerprinting, and browser engines.</p><h3>Can I reuse already created profiles in Kameleo Docker?</h3><p>Yes! Profiles can be reused by retrieving the full list of profiles, filtering by name (or ID, if you know it), and then starting the desired profile. For example, to reuse the &#8220;twsc demo&#8221; profile, write:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from kameleo.local_api_client import KameleoLocalApiClient
from kameleo.local_api_client.models import ProfileLifetimeState

# Connect to the Kameleo API client
client = KameleoLocalApiClient(endpoint="http://localhost:5050")

# Fetch all available profiles
profiles = client.profile.list_profiles()

# Find the profile with the specific name
target_name = "twsc demo"
profile = next((p for p in profiles if p.name == target_name), None)

# Check if the profile was found
if profile:
    # Start the existing profile if it isn't already running
    if profile.status.lifetime_state != ProfileLifetimeState.RUNNING:
      client.profile.start_profile(profile.id)</code></pre></div><h3>Can I use Kameleo Docker in my CI/CD?</h3><p>Kameleo Docker fits naturally into CI/CD pipelines by running as a disposable, reproducible container in build or test stages. You can spin up browsers on demand, run automated flows, and tear them down after execution. Configuration is <a href="https://developer.kameleo.io/integrations/docker/#example-with-docker-compose">typically handled via Docker Compose</a>.</p><h3>Does Kameleo Docker support proxy integration?</h3><p>Yes! HTTP, HTTPS, and SOCKS proxies can be configured at profile creation time, <a href="https://developer.kameleo.io/tutorials/using-proxy-servers/">as explained in the documentation</a>.</p><h3>Can Kameleo Docker scale to thousands of browsers?</h3><p>Kameleo Docker supports horizontal scaling through standard orchestration tools. You can run multiple containers across clusters using Kubernetes or <a href="https://developer.kameleo.io/integrations/docker/#aws-ecs-support">AWS ECS</a>, each managing independent browser instances.</p><h3>How Does Firefox Automation Work in Kameleo Docker?</h3><p>Kameleo Docker supports Firefox-based automation through the Junglefox engine. Because Playwright cannot connect directly to Firefox-based sessions, Kameleo provides a <em><a href="https://developer.kameleo.io/integrations/docker/#using-junglefox-playwright-pw-bridge">pw-bridge</a></em><a href="https://developer.kameleo.io/integrations/docker/#using-junglefox-playwright-pw-bridge"> helper</a> that acts as a compatibility layer. This component translates Playwright connections into the correct browser session, allowing standard automation scripts to run unchanged while still using Firefox-based fingerprint profiles.</p><div><hr></div><p><em>Did you like this article? Share it with someone who might find it useful and get a discount on paid plans.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/kameleo-docker-exploring-the-docker?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.thewebscraping.club/p/kameleo-docker-exploring-the-docker?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><p></p>]]></content:encoded></item><item><title><![CDATA[THE LAB #107: Reversing Shopee's native crypto with Ghidra]]></title><description><![CDATA[Shopee hides its crypto in a native library. We read it in Ghidra and rebuild it in Python, byte for byte.]]></description><link>https://substack.thewebscraping.club/p/reversing-shopee-app-ghidra</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/reversing-shopee-app-ghidra</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 11 Jun 2026 22:19:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/988c1c72-086f-45da-89ea-a3074ccdcc0c_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Shopee is one of the largest marketplaces in Southeast Asia, and like most big apps, its mobile API is a better scraping target than its website. The app talks to the backend in plain JSON over HTTPS, the endpoints are stable, and the anti-bot layer is usually lighter than the one guarding the web frontend. We covered the easy version of this in <a href="https://substack.thewebscraping.club/p/the-lab-12-reverse-engineering-mobile">The Lab #12</a>, where Charles and JADX were enough to read an Android app&#8217;s traffic and replay it.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Shopee does not hand you the easy version. Capture a request, replay it, and the backend answers with HTTP 418 and a security error code. Every API call carries a set of anti-fraud headers, and the code that builds them is not in the Java you can read with JADX. It sits in native code, in compiled <code>.so</code> libraries, which is exactly where traffic interception and a Java decompiler stop being useful. Open the app in JADX and the signing method is there in name only, declared <code>native</code>, with its body on the far side in ARM machine code.</p><p>This is a two-part investigation into how Shopee signs its API requests and how you reproduce that signing yourself. The strategy is the one that works on most hardened apps. You locate the native security libraries, open them in a disassembler, and turn what they do back into something you control. When a library is readable crypto, you reimplement it in Python and sign offline, at any volume, with no app in the loop. When it is a bytecode virtual machine you cannot practically rewrite, you keep the app running and drive its own signer as an oracle. We chose this route because it is the one that scales. An offline signer, or an oracle you call, runs inside your scraper on a server. A rooted phone you have to babysit does not.</p><p>This first part is the foundation. We take one of Shopee&#8217;s native libraries, <code>libshopeeaegis.so</code>, reverse it end to end with Ghidra and rebuild it in Python. Reading the decompiled code identifies every operation as textbook crypto, and our rebuild reproduces those algorithms byte for byte. It is the readable case, the kind you win cleanly, and the clearest worked example of the method. The second part takes on the harder library, the one that computes the per-request signature, and gets past it with the oracle approach.</p><p>What you take from each part depends on your goal. If Shopee is your target, the payoff is the full picture of its request signing across both parts. If you scrape other apps, the method matters more than the marketplace. Most apps that protect their API at all push the work into a native library, and a large share of those are plain, readable crypto you can reproduce. We work it out on Aegis here, and it is the same move on the next app you open.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://visit.decodo.com/WyQ3mA" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>The tools</h2><p>We use four tools, each doing one job.</p><p><a href="https://github.com/androguard/androguard">androguard</a> is a Python library for static APK analysis. We use it for fast recon. It lists the native libraries an app ships and finds which classes declare <code>native</code> methods. It does not give you readable source. It gives you structure you can script.</p><p><a href="https://github.com/skylot/jadx">JADX</a> decompiles Dalvik bytecode back to Java. It is how you read the managed side of the app and find the exact class and method that crosses into native code. JADX stops at the <code>native</code> keyword, which is the handoff point to the next tool.</p><p><a href="https://github.com/NationalSecurityAgency/ghidra">Ghidra</a> is the NSA&#8217;s open source reverse engineering framework. It disassembles a <code>.so</code> and decompiles it to pseudo C. It is the only tool here that can read native code, and it is the one this article leans on.</p><p><a href="https://github.com/frida/frida">Frida</a> injects a JavaScript engine into a running process so you can hook and call functions live. We use it to run the app under instrumentation and confirm our static reading against what the app actually does.</p><p>JADX and androguard read the managed code. Ghidra reads the native code. Frida watches the code run. The native library is the one piece only Ghidra can open, so the work centers there.</p><h2>Modeling the app&#8217;s defenses</h2><p>Before opening anything, it helps to name the layers, because Shopee has several and only one is our target here.</p><p>The managed layer is the Java and Kotlin code. It builds requests, attaches headers, and calls into native methods. JADX reads it.</p><p>The native layer is a set of <code>.so</code> libraries the app loads. Pull the arm64 split out of the APK with androguard and the security-relevant ones stand out by name. They are <code>libshopeeaegis.so</code>, <code>libshpssdk.so</code>, and <code>libBkeBizSecurity.so</code>, plus <code>libjnihook.so</code> and <code>libshook.so</code>. The last two are a hooking framework and an anti-hook layer, which means the app actively watches for instrumentation. That matters for Frida later.</p><p>The request-signing layer sits on top. Two okhttp interceptors, <code>com.shopee.app.network.antifraud.b</code> and <code>.d</code> (they call themselves <code>SecurityNewSapInterceptor</code> and <code>SecurityNewSapPostInterceptor</code>), attach the anti-fraud headers <code>af-ac-enc-sz-token</code> and <code>x-sap-ri</code> to API requests. The values they attach come from <code>libshpssdk.so</code>, the Shopee Security SDK.</p><p>We target one layer, <code>libshopeeaegis.so</code>, a general-purpose crypto library the app calls for specific operations. The request signer in <code>libshpssdk.so</code> stays out of scope. It is a bytecode virtual machine, a harder problem that we handle separately, and reproducing the <code>af-ac</code> headers is not the promise here. The promise is that you can take <code>libshopeeaegis.so</code>, understand every operation it performs, and reproduce it byte for byte in Python.</p><p>One detail decides whether that promise holds. <code>libshopeeaegis.so</code> loads only when the app needs it, so it is not present at idle. We watched the process maps over a minute of normal browsing and the library never appeared. The crypto we are about to reverse is a toolbox the app reaches for in certain flows, not the thing running on every request.</p><div><hr></div><blockquote><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://byteful.com/?promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://byteful.com/?promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h2>Getting the library and finding the door</h2><p>We pulled Shopee PH 3.75.24 (<code>com.shopee.ph</code>) as an XAPK and unzipped it. The native libraries are not in <code>base.apk</code>. For a split bundle they live in <code>config.arm64_v8a.apk</code>. Listing the <code>.so</code> files with androguard and <code>unzip -l</code>, <code>libshopeeaegis.so</code> is a small one at 280 KB, which is a good sign. Small means little room for a heavy obfuscator.</p><p>androguard answers the first question, which library to open. It does not answer the second, how the app calls it. For that we go to JADX and find the class on the Java side. The library registers its native methods against <code>com.shopee.sz.reinforce.Aegis</code>. The class exposes a method <code>fire</code>, overloaded, declared <code>native</code>. Two of the overloads matter:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;820ec8fb-46fc-4f0c-9553-0dc70f676be0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">native byte[] fire(int mode, byte[] data)
native byte[] fire(int mode, byte[] data, byte[] key)</code></pre></div><p>This is the door. The first argument is an integer mode. Then one or two byte arrays. The return is a byte array. JADX cannot show what <code>fire</code> does, because the body is in the <code>.so</code>. So we open the <code>.so</code>.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Reading the library with Ghidra</h2><p>We ran Ghidra 12.1.2 in headless mode. It runs without a GUI, it scripts cleanly, and it repeats exactly. The workflow is documented in our Ghidra tool skill you can use in Claude Code, just like I did for this test. In short, you import the <code>.so</code>, let auto analysis run, then run a script that decompiles functions to a file.</p><pre><code><code>support/analyzeHeadless /tmp/proj aegis \
  -import config.arm64_v8a/lib/arm64-v8a/libshopeeaegis.so \
  -scriptPath ./scripts \
  -postScript DecompileExport.java out.c \
  -overwrite</code></code></pre><p>Auto analysis finished in nine seconds and the decompiler produced 674 functions with zero failures. That number alone tells you this is not a packed or virtualized binary. A protected library fights the decompiler; this one did not.</p><p>The first useful function is <code>JNI_OnLoad</code>, which every JNI library runs at load time. Read its pseudo C and it looks up the class <code>com/shopee/sz/reinforce/Aegis</code> and calls <code>RegisterNatives</code> with two methods. That confirms the door from the Java side and tells us the native functions are registered dynamically rather than exported under <code>Java_*</code> names. Dynamic registration is a mild form of hiding, and it is exactly what Ghidra&#8217;s JNI handling and a <code>RegisterNatives</code> trace are for.</p><p>The C++ symbols survived. That is the break that makes this library readable. The class is <code>Aegis</code>, with methods named <code>missileFire</code>, <code>missileCount</code>, <code>prism</code>, <code>snowDon</code>, <code>tugWar</code>, and <code>parse</code>. There is a second class, <code>TeslaModel</code>, with <code>model_3</code>, <code>model_a</code>, <code>model_b</code>, <code>model_c</code>, <code>model_e</code>, <code>model_s</code>, <code>model_x</code>, <code>model_y</code>, and <code>getNuremberg</code>. The names are deliberately silly, a Tesla and military theme, but they are real symbols, and the structure is intact.</p><p>Follow the call chain from the registered native function. The dispatcher is <code>Aegis::prism</code>, a plain <code>switch</code> on the mode integer:</p><pre><code><code>switch(param_1) {
case 0:  model_3(...)               // one input
case 1:  model_x(key, input, ...)   // keyed
case 2:  model_x(...); model_3(...) // keyed, then case 0
case 3:  model_y(...); model_3(...)
case 4:  model_e(...)
case 5:  model_a(...)
case 6:  model_b(...)
case 7:  model_s(...); model_3(...)
case 8:  model_c(...); model_3(...)
}</code></code></pre><p>One native call selects one of nine operations by an integer, and some operations are a keyed primitive followed by <code>model_3</code>. To name each operation we read two things, the output size and the primitive body.</p><p>The output size comes from <code>TeslaModel::getNuremberg(mode, len)</code>, which <code>missileFire</code> calls to size the output buffer before doing the work. It returns 16 for mode 4, 32 for mode 5, 64 for mode 6, and 20 for mode 8. Those are the digest sizes of MD5, SHA-256, SHA-512, and SHA-1. For mode 0 it returns the Base64 expansion of the input length. The size function alone half-names the table.</p><p>The bodies confirm the rest, and here the silly names get helpful, because the renamed primitives kept their original suffixes. <code>model3_autopilot</code> is a textbook Base64 encoder. It reads three bytes, writes four, and pads with <code>0x3d</code>, which is the <code>=</code> character. <code>modelx_autopilot_cbc</code> is AES in CBC mode, recognizable because it XORs each 16 byte block with the previous ciphertext block before the round function. The hash contexts are renamed with a <code>phantom</code> and <code>F</code> theme but keep the gnulib <code>_init_ctx</code> / <code>_process_bytes</code> / <code>_finish_ctx</code> shape. <code>phantom1</code> is SHA-1, <code>phantom256</code> is SHA-256, <code>InitF22</code> is SHA-512. And <code>phantom1</code> as called by <code>model_c</code> is the HMAC form. It XORs the key with <code>0x36</code> for the inner pad and <code>0x5c</code> for the outer pad over a 64 byte block, which is the HMAC construction.</p><p>Two of the keyed modes turned out not to be ciphers at all. <code>model_s</code> calls <code>phantom256</code> with a key and a message and returns 32 bytes, so it is HMAC-SHA256. <code>model_c</code> calls <code>phantom1</code> the same way and returns 20 bytes, so it is HMAC-SHA1. Reading the bodies kept us honest here. From the signatures alone we had guessed AES.</p><p>That gives the full table.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x9os!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x9os!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 424w, https://substackcdn.com/image/fetch/$s_!x9os!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 848w, https://substackcdn.com/image/fetch/$s_!x9os!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 1272w, https://substackcdn.com/image/fetch/$s_!x9os!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x9os!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png" width="719" height="403" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:403,&quot;width&quot;:719,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42068,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/201664746?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x9os!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 424w, https://substackcdn.com/image/fetch/$s_!x9os!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 848w, https://substackcdn.com/image/fetch/$s_!x9os!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 1272w, https://substackcdn.com/image/fetch/$s_!x9os!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e2490a-4994-4d2a-aa8d-9a95b697ca27_719x403.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><code>model_x</code> pads with PKCS7 to a 16 byte boundary, then runs AES-CBC. The key length sets the variant, and a 16 byte key gives AES-128. <code>model_y</code> does the same but writes the IV in front of the ciphertext, the standard prepend-the-IV pattern, before the Base64 in mode 3.</p><p>One value is not in the file. The CBC IV is a fixed 16 byte constant the library keeps at a <code>.bss</code> address. <code>.bss</code> is zero-initialized on disk and filled at runtime, so the IV is set when the library initializes and you cannot read it statically. For the hash, HMAC, and Base64 modes that does not matter, because their output is fully determined by the input and key. For the three AES modes it means byte-identical output needs the real IV, which you read from the live process once the library loads.</p><p>As always, the code that will be used for the python reimplementation we&#8217;re showing now can be found&nbsp;<a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved for paying users, inside the folder&nbsp;</a><strong><a href="https://github.com/TheWebScrapingClub/thelab">107.</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">SHOPEE-GHIDRA</a><strong><a href="https://github.com/TheWebScrapingClub/thelab">.</a></strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/reversing-shopee-app-ghidra">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Give Claude Real-Time Web Access With the Decodo MCP]]></title><description><![CDATA[Learn how to connect Claude to the web with zero integration code]]></description><link>https://substack.thewebscraping.club/p/claude-decodo-mcp-how-to</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/claude-decodo-mcp-how-to</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 07 Jun 2026 19:37:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0ca9845f-9623-47e0-9aa6-3e40da60ddb6_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve spent serious time scraping, you know the real work isn&#8217;t parsing HTML: it&#8217;s surviving Cloudflare, rotating proxies, handling CAPTCHAs, and pretending to be a human long enough to get the data you need.</p><p>Now, <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">since the rise of AI and LLMs, every office job has changed, with no exceptions for scraping professionals</a>. For us, this means that surviving anti-bots and CAPTCHAs has become the first challenge when creating scraping pipelines. The second challenge has become the integration of scraping services and capabilities into AI pipelines and AI agents. The reason for this shift is simple: your boss or your clients don&#8217;t just want the data anymore. They want it in real-time, structured, and often packed with some insights.</p><p>Until some months ago, these pipelines required a huge amount of custom code (and time!). Luckily for us, the major struggle lasted until MCPs were first released.</p><p>In this article, you&#8217;ll learn what MCP is, how the Decodo MCP server works, and how to integrate it with Claude Desktop. You&#8217;ll also learn how to use it with two hands-on examples.</p><p>Let&#8217;s get into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h3>What is MCP?</h3><p><a href="https://modelcontextprotocol.io/docs/getting-started/intro">The Model Context Protocol (MCP)</a> is an open standard introduced by Anthropic that defines how AI models connect to external tools, data sources, and services. Before MCP, every integration between an AI model and an external system had to be built from scratch. MCP replaces all of the complexity behind the integration with a single, shared protocol.</p><p>Basically, the MCP protocol acts as a common language for models to connect with external tools, files, and systems. For example: do you need an AI assistant to pull a file from Google Drive, query a company database, and trigger an action in an internal app? That&#8217;s exactly the kind of job MCP is built to handle!</p><p>The practical upside is composability. Developers can mix and match several MCP servers into a single AI application without writing any integration code. So, after more than a year since its introduction in the AI industry, MCP has become the standard for integrating different services and applications into a single AI software.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>What is Decodo MCP?</h3><p>The <a href="https://visit.decodo.com/9VzKKe">Decodo MCP Server</a> is a web scraping layer for AI agents. It connects MCP-compatible clients to Decodo&#8217;s Web Scraping API, enabling:</p><ul><li><p><strong>Web scraping for LLMs and AI agents without managing infrastructure:</strong> It can scrape any website, including JavaScript-heavy pages, and get real-time data without handling proxy rotation, CAPTCHA solving, or anti-bot systems. It is specifically built for RAG pipelines, AI research agents, and automation flows.</p></li><li><p><strong>Structured outputs for LLM workflows</strong>: It allows you to retrieve the scraped data in Markdown (LLM-ready), JSON (for structured pipelines), and screenshots (for visual context).</p></li></ul><p>As of now, the Decodo MCP server exposes the following tools:</p><ul><li><p><em>scrape_as_markdown</em>: Scrapes any target URL, given a target URL via prompt. Returns results in Markdown.</p></li><li><p><em>screenshot</em>: Captures a screenshot of any webpage and returns it as a PNG image.</p></li><li><p><em>google_search</em>: Scrapes Google Search for a given query, and returns parsed results.</p></li><li><p><em>google_ads</em>: Scrapes Google Ads search results.</p></li><li><p><em>google_lens</em>: Scrapes Google Lens image search results.</p></li><li><p><em>google_ai_mode</em>: Scrapes Google AI Mode (Search with AI) results.</p></li><li><p><em>google_travel_hotels</em>: Scrapes Google Travel Hotels search results.</p></li><li><p><em>amazon_search</em>: Scrapes Amazon Search for a given query, and returns parsed results.</p></li><li><p><em>amazon_product</em>: Scrapes a given Amazon Product page.</p></li><li><p><em>amazon_pricing</em>: Scrapes Amazon Product pricing information.</p></li><li><p><em>amazon_sellers</em>: Scrapes Amazon Seller information.</p></li><li><p><em>amazon_bestsellers</em>: Scrapes Amazon Bestsellers list.</p></li><li><p><em>walmart_search</em>: Scrapes Walmart Search for a given query, and returns parsed results.</p></li><li><p><em>walmart_product</em>: Scrapes Walmart Product page.</p></li><li><p><em>target_search</em>: Scrapes Target Search for a given query, and returns parsed results.</p></li><li><p><em>target_product</em>: Scrapes Target Product page.</p></li><li><p><em>tiktok_post</em>: Scrapes a TikTok post URL.</p></li><li><p><em>tiktok_shop_search</em>: Scrapes TikTok Shop Search for a given query, and returns parsed results.</p></li><li><p><em>tiktok_shop_product</em>: Scrapes TikTok Shop Product page.</p></li><li><p><em>tiktok_shop_url</em>: Scrapes TikTok Shop page by URL.</p></li><li><p><em>youtube_metadata</em>: Scrapes YouTube video metadata.</p></li><li><p><em>youtube_channel</em>: Scrapes YouTube channel videos.</p></li><li><p><em>youtube_subtitles</em>: Scrapes YouTube video subtitles.</p></li><li><p><em>youtube_search</em>: Search YouTube videos.</p></li><li><p><em>reddit_post</em>: Scrapes a specific Reddit post.</p></li><li><p><em>reddit_subreddit</em>: Scrapes Reddit subreddit results.</p></li><li><p><em>reddit_user</em>: Scrapes a Reddit user profile and their posts and comments.</p></li><li><p><em>bing_search</em>: Scrapes Bing Search results.</p></li><li><p><em>chatgpt</em>: Search and interact with ChatGPT for AI-powered responses and conversations.</p></li><li><p><em>perplexity</em>: Search and interact with Perplexity for AI-powered responses and conversations.</p></li></ul><p><strong>NOTE</strong>: Decodo is currently onboarding its MCP server onto various platforms and marketplaces. At the moment of writing this article, it can be found on the <a href="https://registry.modelcontextprotocol.io/?q=decodo">Official MCP registry</a>, <a href="https://www.pulsemcp.com/servers/decodo">Pulse MCP</a>, <a href="https://glama.ai/mcp/servers?query=decodo">Glama AI</a>, <a href="https://mcp.so/explore?q=decodo">mcp.so</a>, and <a href="https://mcpmarket.com/server/decodo">mcpmarket.com</a>.</p><h2>How To Integrate The Decodo MCP With Claude</h2><p>The prerogative of MCPs is to integrate them with LLMs. The <a href="https://github.com/Decodo/mcp-server">Decodo MCP server</a> can be integrated with several services that leverage LLMs, like Claude and Cursor. In this paragraph, you will learn how to integrate it with Claude.</p><h3>Requirements</h3><p>To use the Decodo MCP, your system must satisfy the following requirements:</p><ul><li><p><strong>Claude Desktop</strong>: To integrate the Decodo MCP server with Claude, you need <a href="https://claude.com/download">Claude Desktop. If you don&#8217;t have it installed yet, you can download it from their website.</a></p></li><li><p><strong>Decodo account:</strong> Create an account at <a href="https://dashboard.decodo.com/">dashboard.decodo.com</a>. With a free one, you have up to 2K free requests.</p></li><li><p><strong>Scraping token</strong>: Get the basic authentication token. To get it, click on <strong>Web Scraping API</strong> &gt; <strong>API playground:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0R6j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0R6j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 424w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 848w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 1272w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0R6j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png" width="1456" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:170154,&quot;alt&quot;:&quot;How to get the basic authentication token in the Decodo dashboard by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to get the basic authentication token in the Decodo dashboard by Federico Trotta" title="How to get the basic authentication token in the Decodo dashboard by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!0R6j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 424w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 848w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 1272w, https://substackcdn.com/image/fetch/$s_!0R6j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7918e5fb-52b6-4cea-a8d1-40738a743619_1900x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How to get the basic authentication token in the Decodo dashboard</figcaption></figure></div></li></ul><p>Good. Your system is set to connect Claude Desktop with the Decodo MCP server.</p><h3>Connect The Decodo MCP Server to Claude</h3><p>To connect the Decodo MCP server to Claude, open Claude Desktop and click on <strong>Settings</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YotT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YotT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 424w, https://substackcdn.com/image/fetch/$s_!YotT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 848w, https://substackcdn.com/image/fetch/$s_!YotT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 1272w, https://substackcdn.com/image/fetch/$s_!YotT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YotT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png" width="1456" height="705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:705,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121992,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YotT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 424w, https://substackcdn.com/image/fetch/$s_!YotT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 848w, https://substackcdn.com/image/fetch/$s_!YotT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 1272w, https://substackcdn.com/image/fetch/$s_!YotT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c674aee-c849-49cd-ae9d-b0d6df20ca62_1902x921.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Go to settings in Claude</figcaption></figure></div><p>Then, click on <strong>Developer</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zS13!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zS13!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 424w, https://substackcdn.com/image/fetch/$s_!zS13!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 848w, https://substackcdn.com/image/fetch/$s_!zS13!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 1272w, https://substackcdn.com/image/fetch/$s_!zS13!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zS13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png" width="1282" height="481" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:481,&quot;width&quot;:1282,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37560,&quot;alt&quot;:&quot;Go to the Developer section in Claude&#8217;s settings by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Go to the Developer section in Claude&#8217;s settings by Federico Trotta" title="Go to the Developer section in Claude&#8217;s settings by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!zS13!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 424w, https://substackcdn.com/image/fetch/$s_!zS13!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 848w, https://substackcdn.com/image/fetch/$s_!zS13!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 1272w, https://substackcdn.com/image/fetch/$s_!zS13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7158ec2-90ea-432a-982d-7dedd950c5f1_1282x481.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Go to the Developer section in Claude&#8217;s settings</figcaption></figure></div><p>After clicking on <strong>Edit Config</strong>, the system will automatically open a folder where Claude stores all the configuration files on your local machine. Open the <em>claude_desktop_config.json</em> file and add the following to it:</p><pre><code><code>"mcpServers": {
    "Decodo MCP Server": {
      "command": "npx",
      "args": [
        "-y",
        "@decodo/mcp-server"
      ],
      "env": {
        "SCRAPER_API_TOKEN": "&lt;your-decodo-mcp-api-key&gt;"
      }
    }
  } </code></code></pre><p>Replace <em>&lt;your-decodo-mcp-api-key&gt;</em> with the basic authentication token you retrieved earlier from the Decodo dashboard, and the integration is done.</p><p>Quit Claude to make the changes effective. Note that just closing the Desktop window is not sufficient. You have to quit it:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vI5X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vI5X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 424w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 848w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 1272w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vI5X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png" width="311" height="196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:196,&quot;width&quot;:311,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68945,&quot;alt&quot;:&quot;How to quit Claude by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to quit Claude by Federico Trotta" title="How to quit Claude by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!vI5X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 424w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 848w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 1272w, https://substackcdn.com/image/fetch/$s_!vI5X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0588d8eb-b410-4bc5-baf0-2e2bcbc7254b_311x196.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">How to quit Claude</figcaption></figure></div><p>After that, when returning to <strong>Settings</strong> &gt; <strong>Developer,</strong> you&#8217;ll see the MCP server up and running:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KBhB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KBhB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 424w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 848w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 1272w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KBhB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png" width="1251" height="455" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:455,&quot;width&quot;:1251,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54828,&quot;alt&quot;:&quot;The Decodo MCP server is running by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Decodo MCP server is running by Federico Trotta" title="The Decodo MCP server is running by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!KBhB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 424w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 848w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 1272w, https://substackcdn.com/image/fetch/$s_!KBhB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4afc18-c865-4edc-910a-ac5b2200fc1d_1251x455.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Decodo MCP server is running</figcaption></figure></div><p>To be sure everything works fine, you can test it with a prompt similar to the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IWAq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IWAq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 424w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 848w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 1272w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IWAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png" width="757" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:757,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63054,&quot;alt&quot;:&quot;A prompt to test if Claude can connect to the Decodo MCP server by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A prompt to test if Claude can connect to the Decodo MCP server by Federico Trotta" title="A prompt to test if Claude can connect to the Decodo MCP server by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!IWAq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 424w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 848w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 1272w, https://substackcdn.com/image/fetch/$s_!IWAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97513d07-9370-45cc-af97-d74e4a6cf160_757x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A prompt to test if Claude can connect to the Decodo MCP server</figcaption></figure></div><p>Alright, you successfully integrated the Decodo MCP server with Claude Desktop. Now it's time to test it!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Running the Decodo MCP With Claude: Hands-On Examples</h2><p>In this section, you will learn how to use the Decodo MCP server with two different examples:</p><ul><li><p>A basic usage taken from the documentation.</p></li><li><p>A more advanced usage where you&#8217;ll ask Claude to retrieve some data from Amazon, returning the result in a JSON file.</p></li></ul><p>Let&#8217;s get on to this!</p><h3>Getting Started: Run a Google Search in Seconds</h3><p>As a first and simple usage, you can test the <em>google_search</em> tool. The main idea behind this tool is to prompt the model with a query, and, under the hood, the MCP will use its Google search capabilities to return the result.</p><p>To just try things out, you can use the exact example reported in the Decodo MCP documentation to search for shoes on Google, reporting the top positions:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7PUU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7PUU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 424w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 848w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 1272w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7PUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png" width="787" height="391" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:391,&quot;width&quot;:787,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32789,&quot;alt&quot;:&quot;Allow Claude to use the MCP tools by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Allow Claude to use the MCP tools by Federico Trotta" title="Allow Claude to use the MCP tools by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!7PUU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 424w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 848w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 1272w, https://substackcdn.com/image/fetch/$s_!7PUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9118b2-ea07-4838-ac3d-b487512338b4_787x391.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Allow Claude to use the MCP tools</figcaption></figure></div><p>As you can see, Claude automatically tries to use the MCP server, loading its tool. As Claude is a production-ready LLM, you will be asked if you want to allow it to use the Google search from the Decodo MCP always or just for this call.</p><p>When Claude has completed its job, the result is the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2oSz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2oSz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 424w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 848w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 1272w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2oSz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png" width="552" height="762" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:762,&quot;width&quot;:552,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97772,&quot;alt&quot;:&quot;Claude&#8217;s results by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude&#8217;s results by Federico Trotta" title="Claude&#8217;s results by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!2oSz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 424w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 848w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 1272w, https://substackcdn.com/image/fetch/$s_!2oSz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039511a6-f72c-4ed0-8be7-7305b4025897_552x762.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Claude&#8217;s results</figcaption></figure></div><p>If you want to verify that the model hasn&#8217;t hallucinated, you can search for &#8220;shoes&#8221; on Google:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UYZP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UYZP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 424w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 848w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 1272w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UYZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png" width="1456" height="481" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:481,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177541,&quot;alt&quot;:&quot;The results on Google by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The results on Google by Federico Trotta" title="The results on Google by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!UYZP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 424w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 848w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 1272w, https://substackcdn.com/image/fetch/$s_!UYZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb8e302e-e386-4bf0-b2e9-5b9aa79af791_1812x599.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The results on Google</figcaption></figure></div><p>Alright! You made it to your first usage example.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Leveling Up: Extracting JSON-structured Data From Amazon</h3><p>Having results listed in a chat can be useful for a quick overview of web research. But the actual power of MCPs is to use the capabilities of the underlying tools for specific tasks that LLMs can solve.</p><p>An example is to ask the model to retrieve some data and report the results in a JSON file. This solution provides you with data that can be further used in the second part of your pipeline&#8212;for example, <a href="https://substack.thewebscraping.club/p/analyzing-scraped-data-pandas-matplotlib">for analyzing your scraped data</a>.</p><p>For this purpose, you can use the following prompt:</p><pre><code><code>Get Amazon bestsellers in electronics, extract the main info and return it in JSON format</code></code></pre><p>The image below shows Amazon&#8217;s best sellers in electronics, at the time of writing this article:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a_uB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a_uB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 424w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 848w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 1272w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a_uB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png" width="1352" height="886" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:886,&quot;width&quot;:1352,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:383040,&quot;alt&quot;:&quot;Amazon&#8217;s best sellers in electronics by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Amazon&#8217;s best sellers in electronics by Federico Trotta" title="Amazon&#8217;s best sellers in electronics by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!a_uB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 424w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 848w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 1272w, https://substackcdn.com/image/fetch/$s_!a_uB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e837ed-3971-440f-a195-369a8f03fab9_1352x886.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Amazon&#8217;s best sellers in electronics</figcaption></figure></div><p>Under the hood, Claude will trigger the <em>amazon_bestsellers</em> tool and will search the data in electronics. The chat result is the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XVSQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XVSQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 424w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 848w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 1272w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png" width="590" height="797" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:797,&quot;width&quot;:590,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108993,&quot;alt&quot;:&quot;Claude&#8217;s results via the chat by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/198374051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude&#8217;s results via the chat by Federico Trotta" title="Claude&#8217;s results via the chat by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!XVSQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 424w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 848w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 1272w, https://substackcdn.com/image/fetch/$s_!XVSQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb220c4c7-56f7-4c1f-9879-ef7c28aeaf16_590x797.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Claude&#8217;s results via the chat</figcaption></figure></div><p>Below is the (partial) JSON returned by Claude:</p><pre><code><code>{
  "metadata": {
    "category": "Electronics",
    "source": "Amazon Best Sellers",
    "url": "&lt;https://www.amazon.com/Best-Sellers/zgbs/electronics/&gt;",
    "scraped_at": "2026-05-18",
    "total_items": 50,
    "currency": "USD"
  },
  "bestsellers": [
    {
      "rank": 1,
      "asin": "B08JHCVHTY",
      "title": "Blink Plus Plan with Monthly Auto-Renewal",
      "price": 11.99,
      "rating": 4.4,
      "ratings_count": 275779,
      "image_url": "&lt;https://images-na.ssl-images-amazon.com/images/I/31YHGbJsldL._AC_UL300_SR300,200_.png&gt;",
      "url": "&lt;https://www.amazon.com/Blink-Plus-Plan-monthly-auto-renewal/dp/B08JHCVHTY&gt;"
    },
    {
      "rank": 2,
      "asin": "B0DCH8VDXF",
      "title": "Apple EarPods Headphones with USB-C Plug",
      "price": 19,
      "rating": 4.6,
      "ratings_count": 13500,
      "image_url": "&lt;https://images-na.ssl-images-amazon.com/images/I/513OSdW4elL._AC_UL300_SR300,200_.jpg&gt;",
      "url": "&lt;https://www.amazon.com/Apple-EarPods-Headphones-Built-Control/dp/B0DCH8VDXF&gt;"
    },
    
    &lt;Omitted for brevity&gt;

    }
  ]
}</code></code></pre><p>As you can see, the first two items correspond to the first two in the image above, taken from the Amazon best seller page. This is just to be sure the model hasn&#8217;t hallucinated.</p><p>Well done! You learned how to use the Decodo MCP server with Claude.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>In this article, you learned what MCP is and why it has become the standard for connecting AI models to external tools and services. You also learned what the Decodo MCP server is, how to integrate it with Claude Desktop, and how to use it in practice.</p><p>So, let us know: what kind of scraping workflows are you planning to build with the Decodo MCP?</p><p></p>]]></content:encoded></item><item><title><![CDATA[THE LAB #106: Is Camoufox still effective, and do the forks help?]]></title><description><![CDATA[The project moved to CloverLabs and the fork tree keeps growing. We read the code and ran four builds against DataDome to see what still works.]]></description><link>https://substack.thewebscraping.club/p/is-camoufox-still-effective-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/is-camoufox-still-effective-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 04 Jun 2026 16:21:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e103d057-1faa-43a1-9346-cb8fdd5383b8_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Camoufox has been our default anti-detect browser for more than a year. We said so in <a href="https://substack.thewebscraping.club/p/how-to-bypass-cloudflare-turnstile">THE LAB #73: How to Bypass Cloudflare in 2025</a>, and again when we put it on the level of a commercial product in the Kasada article. Lately, that confidence has started to decline. In hallway conversations at PragueCrawl, more than one person told us the same thing we had started to feel. Camoufox does not pass the harder targets the way it used to.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Part of that is the cat-and-mouse game every stealth tool plays. Part of it is specific to open source. When the entire fingerprint-spoofing codebase is public, the anti-bot vendors can read it line by line and build the exact counter-signal. We made that argument in the rayobrowse review. The openness that made Camoufox popular is the same openness that let the anti-bot giants study it and catch up.</p><p>Two things changed in 2026 that make this worth a fresh look. First, the project moved. The repository at <a href="https://github.com/daijro/camoufox">github.com/daijro/camoufox</a> now carries a note at the top of its README:</p><div class="callout-block" data-callout="true"><p>Browser development is active at github.com/CloverLabsAI/camoufox and github.com/VulpineOS/VulpineOS. This repo is being used to merge checkpoint releases and should be used as the source of truth.</p></div><p>Clover Labs is a Toronto venture studio building AI agents, listed among the project sponsors. The alpha features (per-context fingerprints, hardware spoofing) now ship first in their <code>cloverlabs-camoufox</code> package, and daijro&#8217;s repo became the checkpoint mirror. This is not an abandoned project; the main maintainers changed.</p><p>Second, that public repo has more than 750 forks. Open source means that when one person stops, others can pick up the work, add features on features, and keep the chase going in parallel. So the real question is not only &#8220;is Camoufox still effective&#8221;, it is &#8220;has anyone in the fork tree built something better&#8221;. This is what we tried to discover in this article.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>The forks we actually tested</h2><p>We pulled the fork list from the GitHub API and sorted it by recent pushes. Most of it is noise. Many forks share the exact <code>pushed_at</code> timestamp of the parent, which is the signature of mirror bots that never wrote a line of their own. Once you count how many commits each fork is ahead of <code>daijro:main</code> and read what those commits do, the field collapses to a handful. Many of the survivors only touch CI or rebrand the binary. Three of them touch the anti-detect surface for real.</p><p>Official Camoufox (<a href="https://github.com/daijro/camoufox">github.com/daijro/camoufox</a>) is the baseline. A custom Firefox build with a fingerprint database and stealth patches, driven through Playwright&#8217;s Juggler protocol. We covered how it hides Playwright&#8217;s own traces in <a href="https://substack.thewebscraping.club/p/scraping-datadome-camoufox">THE LAB #65</a>, so we will not repeat that here.</p><p><strong>camoufox-reverse</strong> (<a href="https://github.com/WhiteNightShadow/camoufox-reverse">github.com/WhiteNightShadow/camoufox-reverse</a>) goes the other way. Instead of hiding harder, it adds a PropertyTracer at the SpiderMonkey engine layer that records which DOM properties a page reads. It is an instrument for watching the detector work, not a better scraper. That makes it the most useful tool in the set for understanding what we are up against.</p><p><strong>LeooNic/camoufox</strong> (<a href="https://github.com/LeooNic/camoufox">github.com/LeooNic/camoufox</a>) is the most ambitious on paper. Its commits add content-aware canvas noise that claims to defeat a 2025 academic pixel-recovery attack, a sigma-lognormal humanized mouse engine, and RDPBrowser, an automation path that drives Firefox over the Remote Debugging Protocol instead of Juggler.</p><p><strong>JWriter20/camoufox</strong> (<a href="https://github.com/JWriter20/camoufox">github.com/JWriter20/camoufox</a>) is the pragmatic one. Targeted stealth fixes, the headline being a closed WebRTC IP leak under a proxy on Firefox 146 (daijro issue #538), plus a real pytest suite, which none of the others ship.</p><p>Let&#8217;s start by using camoufox-reverse to discover something more about DataDome installed on Leboncoin.fr.</p><h2>What DataDome reads, watched from inside the engine</h2><p>Before testing who passes, we wanted to see what the detector looks at. We have explained the three detection layers before: behavioral, browser, and HTTP, in <a href="https://substack.thewebscraping.club/p/change-ciphers-scrapy">THE LAB #6</a>. camoufox-reverse lets us watch the browser layer from below the JavaScript, which is a view we have never had in these pages.</p><p>The PropertyTracer is documented to be enabled via a config flag. We drove the macOS arm64 build directly with Playwright, set the trace config through the <code>CAMOU_CONFIG</code> environment variable, and pointed it at a DataDome-protected page. Our target throughout this article is leboncoin.fr, the French classifieds site, because it runs only DataDome. That isolates the signal we care about, with no second anti-bot muddying the result.</p><p>The full probe is in code/camoufox_fork_analysis/trace_datadome.py. The core of it sets the trace and lets DataDome&#8217;s script run:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;f65ef838-3049-4931-9b72-149f05b52fe0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">config = {
    "propertyTrace": {
        "enabled": True,
        "logDir": str(LOG_DIR),
        "objects": [],            # empty = trace all covered getters
        "maxEventsPerSession": 200000,
    }
}
env = os.environ.copy()
env["CAMOU_CONFIG"] = json.dumps(config)
env["MOZ_DISABLE_CONTENT_SANDBOX"] = "1"  # required on macOS for the tracer</code></pre></div><div><hr></div><blockquote><p>When sites get tough, skip the heavy lifting. Get clean, structured CSV datasets,  ready for Excel, BI or your apps</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KpSw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" width="592" height="149.84467881112175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1043,&quot;resizeWidth&quot;:592,&quot;bytes&quot;:81723,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.databoutique.com/buy-data-list&quot;,&quot;text&quot;:&quot;Find your dataset&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.databoutique.com/buy-data-list"><span>Find your dataset</span></a></p></blockquote><div><hr></div><p>The tracer writes one JSON line per getter access, each shaped like <code>{"o": "navigator", "p": "hardwareConcurrency", ...}</code>. Loading the leboncoin homepage produced 140 engine-level reads across 30 distinct properties. Aggregated by object and property, the access pattern contains information useful for fingerprint creation:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;3700752d-7e2b-415d-8254-13f7525b67ee&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"> COUNT  PROPERTY
    14  window.outerWidth
    13  window.devicePixelRatio
    13  window.outerHeight
    13  navigator.plugins.indexedGetter
     9  navigator.hardwareConcurrency
     7  canvas.toDataURL
     6  window.innerWidth
     6  screen.rect
     4  navigator.platform
     4  navigator.userAgent
     4  webgl.getParameter
     4  canvas2d.getImageData
     3  navigator.maxTouchPoints
     2  offscreenCanvas.getContext</code></pre></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p>This is happening entirely below the JavaScript layer. From the page&#8217;s point of view, nothing was instrumented, because the recording lives in the C++ getter, not in a JavaScript proxy. DataDome reads the screen geometry, the navigator core, the plugin and mime enumeration, and then it reaches for the canvas and WebGL. Both <code>canvas.toDataURL</code> and <code>canvas2d.getImageData</code> are in the list, alongside <code>webgl.getParameter</code> and <code>offscreenCanvas.getContext</code>.</p><p>That last detail is what connects this experiment to the rest of the article. The canvas readback is exactly the surface LeooNic&#8217;s content-aware noise patch sets out to protect, and the WebRTC and screen reads are where the other forks claim improvements. We now know the detector touches it all.</p><p>The homepage is the light version. The pages that hold the data are watched far more closely, and the tracer shows it. We pointed the same probe at a car listing (a leboncoin <code>/ad/voitures/</code> URL). Those pages block direct connections, so this run went through a residential proxy, which is the setup we explain in the next section. The listing loaded its real content (the page title came back as &#8220;Alfa romeo Tonale 1.5 Ibrida 175ch Veloce TCT&#8221;), so we were tracing a passing ad page, not a challenge screen. The read pattern is a different animal: 584 engine-level reads across 35 properties, against 140 across 30 on the homepage.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7042a22e-a626-4f27-837f-b2d2b2ac3cda&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"> COUNT  PROPERTY (ad page)
   220  document.cookie.get
    47  window.innerWidth
    30  window.innerHeight
    26  navigator.plugins.indexedGetter
    26  screen.rect
    25  sessionStorage.setItem
    22  sessionStorage.getItem
    16  document.cookie.set
    12  performance.timing
     8  window.scrollY
     7  canvas.toDataURL
     6  webgl.getParameter
     4  canvas2d.getImageData
     3  navigator.globalPrivacyControl
     1  mediaDevices.enumerateDevices</code></pre></div><p>The cookie reads jump from one on the homepage to 220 on the ad page. Session storage, which the homepage barely touched, is read and written dozens of times. New surfaces appear that the homepage never queried: <code>window.scrollY</code> for behavior, <code>navigator.globalPrivacyControl</code>, and <code>mediaDevices.enumerateDevices</code>. The canvas and WebGL reads are still there. This is the same DataDome, running a heavier script on the page that matters. It is the concrete reason the homepage passes a clean browser while the listings do not. It also tells you where to spend your effort. The protection you have to beat lives on the content pages, not the landing page.</p><h2>Setting up a fair comparison</h2><p>The shared virtual environment we&#8217;re creating already had <code>camoufox</code> 0.4.11, which fetches the Firefox 135 official build. We ran on an Apple M2 Max, so we pulled the macOS arm64 binaries for each build, signed them ad hoc (the cross-compiled bundles need it), and pointed the same launcher at each one with <code>executable_path</code>.</p><p>Two version details matter for fairness. JWriter20&#8217;s WebRTC fix targets a regression introduced in Firefox 146, so we did not compare it against the 135 cache. We pulled the official <code>v146-hardware</code> build (Firefox 146.0.1) as the baseline and JWriter20&#8217;s own 146.0.1 build as the patched version. Same Firefox, two builds. camoufox-reverse only ships at 135, which is fine because we used it only as a tracer, not as a contender.</p><p>Every test drives the binaries the way a real user would, through the camoufox launcher with <code>proxy</code> and <code>geoip</code> set, so the fingerprint database, the locale coherence, and the stealth patches are all active. The one exception is the WebRTC probe, explained below, where the page we run matters.</p><h2>The WebRTC leak that JWriter20 actually fixes</h2><p>JWriter20&#8217;s headline fix is a closed WebRTC IP leak under a proxy. We checked it on the official 146 build against the JWriter20 146 build, same launcher, same Bright Data proxy, <code>geoip=True</code>. The probe gathers ICE candidates from a STUN server and reports any IP that escapes (webrtc_leak_test.py).</p><p>A quick detour on what those candidates are, because the whole leak lives in them. WebRTC connects two peers directly, and to do that each side has to advertise every network address it could be reached on. Each address it offers is an ICE candidate. A candidate is an IP, a port, a protocol, and a type, and it reaches JavaScript as a string like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;eb21871e-0761-45d0-95b7-2236487833b7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">candidate:842163049 1 udp 1677729535 203.0.113.25 54321 typ srflx raddr 192.168.1.45 rport 54321</code></pre></div><p>Two types matter here. A <code>host</code> candidate is an address of a local network interface, so it carries your LAN IP. A <code>srflx</code> (server-reflexive) candidate is the public address a STUN server reports back when the browser asks which IP it appears to come from, so it carries your real WAN IP. A page gathers all of this with no permission. It opens an <code>RTCPeerConnection</code> pointed at a STUN server, calls <code>setLocalDescription</code>, and reads each candidate as it arrives. The key is that STUN runs over UDP, and an HTTP proxy only tunnels TCP. The STUN request leaves from the real interface, the proxy never sees it, and the <code>srflx</code> candidate comes back with the real WAN IP even though every HTTP request went through the proxy.</p><p>The first version of our probe ran the RTCPeerConnection on <code>about:blank</code> and showed both builds leaking the real IP. That was our mistake, not a result. Camoufox&#8217;s content-level injection is not active on <code>about:blank</code>, so we were measuring an unprotected page. Moving the probe onto a real https origin changed everything:</p><pre><code><code>official-146   HTTP exit IP (proxy): 189.173.138.17
               ICE candidates: 1  [srflx] ips=['203.0.113.25']   &lt;- real WAN IP leaks

jwriter20-146  HTTP exit IP (proxy): 93.44.185.102
               ICE candidates: 0                                &lt;- nothing leaks</code></code></pre><p>Our real WAN IP is 203.0.113.25. The official build, behind a working proxy, still hands it to any page through the WebRTC reflexive candidate. The proxy exit IP rotates on each run, so the constant 203.0.113.25 in the candidate is unmistakably the real address, not the proxy. </p><p>The fix is real, and it is baked into the binary. </p><p>We unzipped both <code>camoufox.cfg</code> files to confirm. The official build sets only <code>media.peerconnection.ice.no_host</code>. JWriter20 adds <code>default_address_only</code>, <code>proxy_only_if_behind_proxy</code>, <code>proxy_only_if_pbmode</code>, and <code>obfuscate_host_addresses</code>. Behind a proxy that cannot carry UDP, those preferences make WebRTC gather no candidates at all, so there is nothing to leak. Reproduced across two runs.</p><p>If WebRTC leaks were your problem, JWriter20 solves them. Hold that thought, because it does not end where you would expect.</p><h2>The canvas patch that we could read but not run</h2><p>To see why this patch exists, you have to understand the small-arms race it sits within. The PropertyTracer run above caught DataDome calling <code>toDataURL</code> and <code>getImageData</code>. Those two calls are how a canvas fingerprint is taken. A script draws the same text and shapes into an off-screen canvas on every machine, reads the pixels back, and hashes them. The drawing commands are identical everywhere. The pixels are not, because the final image depends on your GPU, your graphics driver, and how your system rasterizes fonts. That hash is stable for your device and different from the next one, which is most of what a tracker wants.</p><p>The standard way to hide is to add noise. Camoufox, Brave, Firefox&#8217;s resist-fingerprinting mode, and a long tail of extensions all nudge a few pixels so the hash will not stay constant across sites. The weakness is in how that noise is generated. If it is a fixed per-session perturbation that depends only on a seed and the pixel position, it can be undone. A 2025 paper at The Web Conference, <a href="https://dl.acm.org/doi/abs/10.1145/3696410.3714713">Breaking the Shield</a> by Hoang Dai Nguyen and Phani Vadrevu, showed exactly that against eighteen extensions and five browsers. Their Pixel-Recovery attack paints a second canvas filled with a known solid color and reads it back. Because it knows what every pixel should have been, it solves for the perturbation and subtracts it from the real fingerprint canvas. Reload ten times, and the recovered fingerprint stays constant while the noised one keeps changing. That is the proof the noise was reversible all along.</p><p>Two changes defeat the attack, and the same paper points at both. Leave the flat regions alone, so a detector that paints a solid block and reads it back finds no tampering to measure. And make each perturbation depend on the pixel content rather than its position, so there is no single value to solve for and subtract. The second idea is what Brave&#8217;s Farbling does, deriving its noise from the canvas content so two different canvases are altered differently, and it is the one defense the Pixel-Recovery attack could not reverse.</p><p>LeooNic&#8217;s patch implements both moves, and it is the most interesting code in the whole fork tree. The rewritten <code>ApplyCanvasNoise</code> skips flat regions and only perturbs edges, and the comments name the attack directly:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;279f1acd-8f68-4c56-ab69-a1a65dd58a57&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">// Content-aware + content-dependent canvas noise.
//   - Tier 1 known-pixel checks (DataDome, Castle): flat regions are skipped
//     because flat_score &lt; FLAT_THRESHOLD. fillRect(R,G,B) is undisturbed.
//   - WWW'25 Pixel-Recovery Attack (Nguyen &amp; Vadrevu): noise depends on the
//     pixel content AND its 4 neighbors, not just (seed, index).</code></pre></div><p><code>FLAT_THRESHOLD</code> is the cutoff that decides what counts as a flat region. The edge pixels that survive it get a content-dependent nudge of plus or minus one, small enough to stay invisible but enough to move the hash. The logic is sound on paper. We wanted to confirm it at runtime.</p><p>We built a probe that draws two solid blocks with one sharp boundary and counts perturbed pixels in the flat interior versus the edge (canvas_fingerprint_test.py). </p><p>First, we learned that canvas noise is off by default in every current build, which lines up with the CloverLabs &#8220;Disable Canvas Noise&#8221; commit. The noise only runs when <code>canvas:seed</code> is non-zero. With the seed forced on the official 146 build, the original algorithm shows its tell:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;8f66b3b1-7dde-41de-9999-4a526d196d19&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">official-146 (original noise)
  interior flat pixels perturbed : 9105 / 18240   (~50%)
  boundary edge pixels perturbed : 98 / 192
  max edge delta (per channel)   : 1
  hash varies across sessions    : True</code></pre></div><p>The stock algorithm perturbs roughly half of every pixel, flat fills included. That is precisely the behavior a known-pixel check catches, and precisely what LeooNic set out to fix. So the official baseline is the &#8220;before&#8221; picture, captured at runtime.</p><p>The &#8220;after&#8221; picture is what we were not able to collect. LeooNic ships only a Windows binary, so we ran it on a Windows cloud box. It would not launch under Playwright at all. Every attempt, headless or headful, with the stock launcher or LeooNic&#8217;s own 0.5.0 launcher installed from source, ended the same way:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;d224820d-d322-4344-a4a1-bb64fce1952a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">console.error: "Warning: unrecognized command line flag" "-juggler-pipe"
Remote Settings startup changesets bundle could not be extracted (JSON.parse...)
JavaScript error: AsyncShutdown.sys.mjs, line 587: uncaught exception: undefined
&lt;process did exit: exitCode=0&gt;</code></pre></div><p>The official 135 build launches and drives fine on the same box, so the machine and Playwright are healthy. The Firefox 149 build that LeooNic publishes aborts at startup before Juggler attaches. </p><p>This is not only our environment. LeooNic&#8217;s own issue #1 is titled &#8220;fix: port patches and build system to Firefox 149.0&#8221;, an open work in progress, and daijro carries issues #620 and #572 about Juggler failing to initialize in constrained environments. The build does run through LeooNic&#8217;s native RDP path, which is the whole point of their RDPBrowser, but even there we could not activate the canvas seed. The global config is ignored, and the per-context <code>setCanvasSeed</code> function the build exposes only at document start was never present on the page when driven over RDP.</p><p>So we report LeooNic honestly. The content-aware algorithm is real and well-reasoned in source, and the original algorithm&#8217;s weakness is confirmed at runtime on the official build. The published Firefox 149 binary is not something you can pick up and drive with the standard stack today. For a reader choosing a fork, that is the practical signal. The innovation lives in the code, not yet in a usable artifact you can run.</p><h2>The block-rate test, and the result we did not expect</h2><p>Given that, we could only test the jwriter fork compared to the original version. As always, the code that will be used for testing can be found&nbsp;<a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved for paying users, inside the folder&nbsp;</a><strong><a href="https://github.com/TheWebScrapingClub/thelab">106.</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">CAMOUFOX</a><strong>.</strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/is-camoufox-still-effective-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[5 mistakes that are driving up your scraping costs - Insights from DataImpulse]]></title><description><![CDATA[An in-depth look at the real factors driving up web scraping costs and how smarter proxy usage and system design can reduce expenses by up to 60%.]]></description><link>https://substack.thewebscraping.club/p/5-mistakes-that-are-driving-up-your</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/5-mistakes-that-are-driving-up-your</guid><dc:creator><![CDATA[Olia Liudko]]></dc:creator><pubDate>Tue, 02 Jun 2026 20:02:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Jfy8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is a guest post written by the DataImpulse team, tackling the problem behind the costs of scraping. For an independent benchmark of proxy prices, <a href="https://proxyprice.thewebscraping.club/">visit our Proxy Pricing Benchmark tool</a>.</em></p><div><hr></div><p>Every product has its own price, usually formed by a simple and predictable formula. Proxy services follow a similar pricing logic. At first thought, multiplying the proxy price by the amount of bandwidth should result in a final cost, but in fact, many aspects must be taken into consideration. Bandwidth is not a clean, one-to-one reflection of useful work. What users pay for is not just the data they properly collect but also everything that happens around. And it&#8217;s about request failures, encountered blocks, timeouts, and suboptimal routing decisions.</p><p style="text-align: justify;">Much of the traffic in scraping systems is consumed without usable output, and that&#8217;s where the gap between expected and actual costs starts to widen. To understand where the budget really goes, we need to research it deeper.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jfy8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jfy8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1773700,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/200176130?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jfy8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Jfy8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b86e5b-12e3-4a07-8713-cc7d13a29418_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>What is the math behind the scraping costs</strong></h2><p style="text-align: justify;">Bandwidth is measurable, and proxy providers typically price it transparently. However, it&#8217;s not the key driver of the cost, it&#8217;s the result of a much more complex process. Each unit of bandwidth stands for a series of events, rather than merely one successful request. While some requests yield immediate data, many others can fail, be obstructed, time out, or require further attempts. Consequently, the actual cost structure is influenced more by system behavior than by the volume of traffic.</p><p style="text-align: justify;">The real cost equals the total number of request cycles necessary to extract data multiplied by the cost of executing each request cycle. In simple words, it is defined by how many complete attempts the system must make before it gets that one response. The key detail is that only successful requests generate value. Each request cycle may have the initial request, retries after timeouts, proxy rotation, session management, and many other factors. All of them consume resources. Because of this, the real price increases not only when proxy prices rise but also when the system becomes less well-functioning. The same dataset can cost remarkably more to extract when the scraping pipeline is inefficient.</p><p style="text-align: justify;">The cost amplification effect, known as the cumulative increase in resource usage caused by constant request cycles, is an issue in web scraping systems. One failed request can trigger multiple follow-up attempts. These extra cycles accumulate and raise the cost required to get one successful data point. For this reason, it&#8217;s accurate to evaluate scraping by cost per successful request. The fewer attempts needed, the lower the real cost.</p><h2><strong>Mistake #1 - Using the wrong proxy type</strong></h2><p style="text-align: justify;">This mistake is not just widespread but can also cost you a lot. Not all targets need the same level of stability and anonymity, but numerous systems use a universal strategy. This results in either high costs or poor operation.</p><p style="text-align: justify;">Each proxy type has its own balance of speed, cost, and detection resistance. The system will not work if that balance doesn&#8217;t align with the target website&#8217;s behavior. For example, mobile IPs are not just more expensive by default. They are highly trusted and harder to get blocked, so it&#8217;s logical to use them for challenging targets. When they are used on low-protection websites, they increase costs without improving the results. The approach that works is based on the right matching of proxy type to the task.</p><ul><li><p style="text-align: justify;">Residential proxies are actively used in web scraping as they route traffic via IPs assigned by ISPs to real devices. By looking like it comes from the real user, these proxies ensure strong trust signals. Many proxy users notice better success rates when they switch to residential IPs.</p></li></ul><ul><li><p style="text-align: justify;">Mobile proxies direct traffic via carrier networks using IP addresses from mobile service providers. Since these IPs are shared, the traffic appears very authentic and is significantly more difficult for systems that rely on fingerprinting to identify.</p></li></ul><ul><li><p style="text-align: justify;">Datacenter proxies function on cloud or server-based infrastructure and use IP ranges that aren&#8217;t tied to real ISPs. The biggest advantage is speed. They are perfect for heavy automation and data collection tasks.</p></li></ul><h2><strong>Mistake #2 - The retry loop problem</strong></h2><p style="text-align: justify;">The goal of retry logic is to improve success rates by giving failed requests another attempt to return a valid request. This approach works when responses are consistent, and failures are occasional, not systematic. On targets with rate limits or unstable responses, many retries can lead to constant failures under the same conditions. Not all failures are the same. If you got a timeout, it&#8217;s worth retrying, but if it&#8217;s a 403 error or a block, there are other actions to try. For example, you can rotate proxies or fix headers. </p><p style="text-align: justify;">Retries can turn into a loop where the system keeps sending more requests but isn&#8217;t getting better results. Instead of retrying everything the same way, treat different errors differently, and adjust your behaviour based on the response. You can rotate proxies after getting a certain status code or stop retrying when a request is blocked.</p><h2><strong>Mistake #3 - Misconfigured proxy rotation</strong></h2><p style="text-align: justify;">Rotating proxies aggressively is not a solution. Changing IPs too often makes traffic look unnatural and can raise suspicion. On the flip side, not rotating enough can create another issue. Thus, there must be balance. Some websites tolerate frequent IP changes, while others expect a more stable session. Treating all targets the same way is not appropriate. It&#8217;s better to adjust rotation based on the context. </p><p style="text-align: justify;">If the server expects the same user behavior over time, using sticky sessions may help. In this case, it&#8217;ll help maintain session consistency and not break the flow. For bulk data extraction, you can try rotating proxies more frequently. In this situation, there is no need to preserve session context. You can also refine your rotation if you use signal-based triggers instead of fixed rules. Rotate thoughtfully, and when the system detects specific conditions like a sudden drop in success rates or status codes, adapt your proxy to it. </p><h2><strong>Mistake #4 - Ignoring caching and duplicate requests</strong></h2><p style="text-align: justify;">A notable portion of scraping traffic is often dedicated to retrieving data that has already been collected. This occurs when pipelines lack deduplication or clear definitions for data freshness. It leads to repeated requests for identical resources. This process consumes bandwidth and proxy capacity without providing new information.</p><p style="text-align: justify;">To address this, implement a caching layer and deduplication logic. Responses can be cached based on a time-to-live (TTL) interval that aligns with the frequency of data updates. Request fingerprints can be used to identify duplicates before requests are sent. For structured data, storing IDs or hashes of processed items allows the system to skip previously captured content. </p><h2><strong>Mistake #5 - No cost-aware proxy routing</strong></h2><p style="text-align: justify;">Many scraping systems process all requests through a single proxy type. This approach can simplify implementation, but it can still lead to ineffectiveness. Different endpoints often have distinct requirements, and a universal strategy may result in unnecessary costs.</p><p style="text-align: justify;">For instance, using proxies with a high trust score for simple endpoints can be expensive, whereas lower-cost proxies used for protected pages may result in blocks and required retries. Without routing logic to adapt to these variables, systems often overpay or underperform. They can&#8217;t adapt.</p><p style="text-align: justify;">An alternative is to implement cost-aware routing, which matches the proxy type to the difficulty of the task. This involves using more economical options for low-risk requests and escalating to higher-trust proxies only when necessary. By monitoring metrics such as status codes, latency, and success rates, the system can determine when to switch proxy pools. For example, a blocked request can be retried using a higher-trust proxy rather than repeating the request under the same conditions.</p><p style="text-align: justify;">This approach creates a more structured pipeline that balances cost and performance by allocating resources based on the specific requirements of each request.</p><h2><strong>Understanding the real price of proxies</strong></h2><p style="text-align: justify;">While &#8220;price per GB&#8221; is often cited as a standard industry metric, experienced engineers understand that it fails to capture the true economic reality of data scraping. In practice, failed requests consume bandwidth and incur costs despite yielding no usable data. These unsuccessful attempts represent a negative return on investment.</p><p style="text-align: justify;">Furthermore, the expenses associated with automated retries add another hidden expense. We have to look beyond the basic per-GB rate and adopt the &#8220;Cost Per Successful Request&#8221; (CPSR). This metric provides a more accurate reflection of true operational expenses.</p><p style="text-align: justify;">To calculate the cost of each valid data retrieval, use the following formula:</p><p style="text-align: justify;"><strong>CPSR = price per GB / 1,000 * 1 / Success Rate</strong></p><p style="text-align: justify;">In this equation, the &#8220;success rate&#8221; is the percentage of requests that return an HTTP 200 OK status along with the intended data. Organizations can make better financial decisions if they start evaluating proxy services through the lens of CPSR.</p><p style="text-align: justify;"><a href="https://dataimpulse.com/residential-proxies/">DataImpulse is a reliable provider</a> of residential, mobile, and datacenter proxies with non-expiring traffic and a pay-as-you-go model, meaning purchased traffic remains available until it is used. This vendor offers more than 90 million IPs in 195 countries. Teams usually choose DataImpulse for web scraping, ad verification, market research, SERP monitoring, and website testing. </p><h3 style="text-align: justify;"><strong>Why is DataImpulse cheaper than other vendors? </strong></h3><p style="text-align: justify;">The pricing structure is based on the proxy sourcing method. Many providers purchase traffic rights from ISPs and resell them, which includes an additional markup. DataImpulse sources IP addresses directly through its own application and SDKs, bypassing intermediaries to avoid extra costs. This operational model complies with all legal standards.</p><h2><strong>How to reduce your scraping costs by 30-60%</strong></h2><p style="text-align: justify;">Cost efficiency in data collection is primarily achieved by minimizing inefficient requests and increasing the success rate of each attempt. </p><ol><li><p style="text-align: justify;">To optimize expenses, match proxy types to the specific requirements of the task. Using cost-effective proxies for straightforward targets while reserving higher-trust proxies for more challenging endpoints can reduce unnecessary spending.</p></li><li><p style="text-align: justify;">Refining retry logic is also important. Failures should be addressed based on their specific status codes. Avoiding repeat requests under identical conditions prevents the waste of resources.</p></li><li><p style="text-align: justify;">Proxy rotation should be managed strategically rather than randomly. Implementing sticky sessions and rotating based on indicators such as blocks or elevated failure rates can improve both stability and overall success rates.</p></li><li><p style="text-align: justify;">Incorporating caching and deduplication techniques helps manage traffic. By avoiding redundant requests for data that has not changed, it is possible to decrease total request volume. </p></li><li><p style="text-align: justify;">Implement a cost-aware proxy routing strategy. Prioritize lower-cost alternatives, escalating to premium options only when strictly necessary. This approach facilitates a more efficient resource allocation model, ensuring that infrastructure investments are directed toward the areas of greatest impact.</p></li><li><p style="text-align: justify;">Lastly, pay attention to how your scraper interacts with websites. When a browser loads a page, it also pulls images, scripts, videos, and even fonts. Thus, lots of traffic is generated. Use HTTP requests for structured data and browser-based scraping when JS rendering is necessary. </p></li></ol><p style="text-align: justify;">These optimizations don&#8217;t require a comprehensive system overhaul. Incremental improvements in request efficiency can harvest significant cost reductions.</p><h2><strong>Start measuring your current CPSR baseline</strong></h2><p style="text-align: justify;">At first sight, scraping costs look like a simple equation between proxy price and bandwidth. But the real drivers of cost lie deeper.  As we&#8217;ve seen, unnecessary retries and poor rotation strategies contribute to a growing gap between expected and actual costs. A system with low success rates will always consume more resources. </p><p style="text-align: justify;">The important shift is moving away from thinking in terms of raw pricing and toward thinking in terms of efficiency. It doesn&#8217;t always require major steps, simple adjustments are key. Better proxy selection, cost-aware routing, caching, and improved retry logic are among them. From factual proxy usage data from DataImpulse, we&#8217;ve seen that even small optimizations can noticeably reduce total costs. Every request should add value, so spending must be thoughtful and deliberate. Audit your scraping pipeline against these 5 mistakes today. </p>]]></content:encoded></item><item><title><![CDATA[Why and How to Build a Web Scraper with Rust in 2026]]></title><description><![CDATA[Is Rust the future of web scraping? Let&#8217;s find out!]]></description><link>https://substack.thewebscraping.club/p/how-to-build-a-web-scraper-rust</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-build-a-web-scraper-rust</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 31 May 2026 15:27:12 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/159b826b-b10b-4781-9efd-ddc651b7f874_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What do popular developer technologies like ZeroClaw, IronClaw, Codex CLI, and many others have in common, besides thousands of GitHub stars, tons of downloads, and growing communities? They are all developed in Rust!</p><p>Rust is becoming increasingly popular thanks to its advantages in performance, stability, and security. But what about using it for web scraping?</p><p>In this post, I&#8217;ll show you what Rust brings to the table for web scraping, why it makes sense (and when it doesn&#8217;t), and how to build a web scraper in Rust.</p><h2>Main Characteristics of Rust: Quick Overview</h2><p>Rust stands out because it combines performance, safety, and control in a way few programming languages do. According to the <a href="https://survey.stackoverflow.co/2025/technology">2025 Stack Overflow Developer Survey</a>, 14.8% of respondents reported using Rust that year, making it the 14th most popular option.</p><p>Personally, what I find most compelling about Rust is its <a href="https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html">memory safety model</a>. Thanks to <a href="https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html">ownership and borrowing</a>, it avoids entire classes of bugs like memory leaks or race conditions. All of that, without needing a garbage collector!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p><div><hr></div></blockquote><p>Here&#8217;s what Rust looks like in its simplest form:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">fn main() {
    println!("Hello, world!");
}</code></pre></div><p>Even in this minimal example, you can see Rust&#8217;s explicit structure and compile-time guarantees.</p><p><strong>Remember</strong>: In Rust, <em>println!</em> is not a function. It&#8217;s a macro. The <em>!</em> tells the compiler: &#8220;this is a macro invocation, not a normal function call.&#8221;</p><p>Performance is another big win. Rust is compiled and extremely fast, making it ideal for high-throughput parsing on heavy HTML pages or large volumes of pages (e.g., in an <a href="https://substack.thewebscraping.club/p/offline-web-scraping">offline web scraping scenario</a>). Concurrency is also first-class, helping you manage thousands of requests in parallel without the usual headaches.</p><p>On the flip side, Rust has a steeper learning curve. If you&#8217;re coming from Python or JavaScript, getting used to the syntax and strict compiler won&#8217;t be trivial. In my experience, the first steps can feel a bit unforgiving&#8230;</p><blockquote><div><hr></div></blockquote><h2>Why AI Has Made Rust a Solid Choice for Web Scraping</h2><p>AI is changing the nature of software development, including web development. And, as you may already have noticed, not always in a &#8220;lighter&#8221; direction. Humans struggle to deal with long scripts and source code files, but machines don&#8217;t!</p><p>Thus, AI tends to produce very long and complex HTML with a lot of elements embedded in the same page. On top of that, AI makes it trivial to generate large amounts of content, which further increases HTML size. In addition, <a href="https://developer.mozilla.org/en-US/curriculum/core/semantic-html/">semantic HTML</a> is more verbose than traditional HTML.</p><p>As a result, modern web pages are getting bigger and more complex. From a scraping perspective, this translates into slower and more resource-intensive parsing. What used to be lightweight DOM trees are now dense, deeply nested structures that require more CPU and memory to process.</p><p>This is exactly where Rust starts to make sense&#8230;</p><p>Sure, it may not be the easiest programming language, but Rust&#8217;s performance makes it compelling (and in some cases even necessary). Its low-level control and zero-cost abstractions allow Rust HTML parsers to process large documents in fractions of a second, even under high concurrency.</p><p><a href="https://medium.com/@jgfriedman99/html-parsing-benchmarks-2170417e8c06">Independent benchmarks</a> show Rust HTML parsers ranking among the fastest available. In particular, libraries like <em><a href="https://github.com/y21/tl">tl</a></em> stand out for their exceptional speed and low overhead.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h2>Best Rust Web Scraping Libraries</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NgZ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NgZ2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 424w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 848w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The best Rust web scraping libraries&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The best Rust web scraping libraries" title="The best Rust web scraping libraries" srcset="https://substackcdn.com/image/fetch/$s_!NgZ2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 424w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 848w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!NgZ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91b93f4f-ed7e-405e-a5e2-7b1d06181faf_1920x1234.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The best Rust web scraping libraries</figcaption></figure></div><h2>How to Build a Scraper in Rust: A Step-by-Step Guide</h2><p>In this section, I&#8217;ll guide you through the process of building a web scraper in Rust. The target web page will be <a href="https://books.toscrape.com/">Books to Scrape&#8217;s homepage</a>. This is a static page, which is the ideal scenario for high-speed HTML parsing in Rust.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CIY7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CIY7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 424w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 848w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 1272w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CIY7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png" width="1456" height="786" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:786,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CIY7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 424w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 848w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 1272w, https://substackcdn.com/image/fetch/$s_!CIY7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5aaf8-f5b1-44f2-93e8-72daee5b3cf7_3008x1623.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Books to Scrape homepage</figcaption></figure></div><p>The end goal is to scrape all the book information and export it to a CSV file. Follow the instructions below!</p><h3>Prerequisites</h3><p>Make sure you have:</p><ul><li><p><a href="https://rust-lang.org/tools/install/">Rust installed locally</a> (the article refers to Rust 1.95.0).</p></li><li><p>Some basic familiarity with <a href="https://doc.rust-lang.org/book/">Rust syntax and constructs</a>.</p></li></ul><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Qrb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png" width="479" height="239.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:479,&quot;bytes&quot;:911444,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196394917?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8Qrb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!8Qrb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68fe580f-2f4e-4452-9e3c-0ca00777d9f6_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Trusted by teams running ad verification, web scraping, SERP tracking, and market research. Ethically sourced proxies, globally accessible, and fairly priced.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataimpulse.com/&quot;,&quot;text&quot;:&quot;Get Started With DataImpulse&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataimpulse.com/"><span>Get Started With DataImpulse</span></a></p></blockquote><div><hr></div><h3>Step #1: Set Up a Rust Scraping Project</h3><p>Create a new Rust project for web scraping with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cargo new books_rust_scraper</code></pre></div><p>This will generate a new project called <em>books_rust_scraper</em> containing a basic &#8220;Hello, world!&#8221; program. Move into the project folder:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cd books_rust_scraper</code></pre></div><p>You should now see the following file structure:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">
books_rust_scraper/
&#9500;&#9472;&#9472; src/
&#9474;   &#9492;&#9472;&#9472; main.rs
&#9500;&#9472;&#9472; target/
&#9500;&#9472;&#9472; .gitignore
&#9500;&#9472;&#9472; Cargo.toml
&#9492;&#9472;&#9472; Cargo.lock</code></pre></div><p>Focus on the <em>src/main.rs</em> file:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oopm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oopm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 424w, https://substackcdn.com/image/fetch/$s_!oopm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 848w, https://substackcdn.com/image/fetch/$s_!oopm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 1272w, https://substackcdn.com/image/fetch/$s_!oopm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oopm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png" width="1364" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:1364,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The src/main.rs file&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The src/main.rs file" title="The src/main.rs file" srcset="https://substackcdn.com/image/fetch/$s_!oopm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 424w, https://substackcdn.com/image/fetch/$s_!oopm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 848w, https://substackcdn.com/image/fetch/$s_!oopm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 1272w, https://substackcdn.com/image/fetch/$s_!oopm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd75148e5-6f13-44b6-b373-1ecd9477ce31_1364x386.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The src/main.rs file</figcaption></figure></div><p>This is the entry point of your application and currently contains a simple &#8220;Hello, world!&#8221; example. Test your Rust application with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cargo run</code></pre></div><p>The command executes the <em>src/main.rs </em>file, so the result will be:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Hello, world!</code></pre></div><p>In that file, you&#8217;ll implement your Rust web scraping logic. Great!</p><h3>Step #2: Install Required Dependencies</h3><p>Run these commands to install the crates (Rust libraries) needed to build a Rust web scraper:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cargo add tokio --features full
cargo add reqwest
cargo add scraper
cargo add csv</code></pre></div><p>These are the core dependencies:</p><ul><li><p><em><a href="https://docs.rs/tokio/latest/tokio/">tokio</a></em>: Enables asynchronous execution.</p></li><li><p><em><a href="https://docs.rs/reqwest/latest/reqwest/">reqwest</a></em>: To send HTTP requests to retrieve HTML pages.</p></li><li><p><em><a href="https://docs.rs/scraper/latest/scraper/">scraper</a></em>: To parse HTML and extract data using CSS selectors.</p></li><li><p><em><a href="https://docs.rs/csv/latest/csv/">csv</a></em>: To export the scraped data to a CSV file.</p></li></ul><p>After running the commands above, your <em>Cargo.toml</em> file should look similar to this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;toml&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-toml">[package]
name = "books_rust_scraper"
version = "0.1.0"
edition = "2024"

[dependencies]
csv = "1.4.0"
reqwest = "0.13.3"
scraper = "0.26.0"
tl = "0.7.8"
tokio = { version = "1.52.1", features = ["full"] }</code></pre></div><p>Nice! You now have all the dependencies in place to start building your Rust scraper.</p><h3>Step #3: Retrieve the Target Page</h3><p>Use <em>reqwest</em> to fetch the target page with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">use std::error::Error;
use reqwest::Client;

#[tokio::main]
async fn main() -&gt; Result&lt;(), Box&lt;dyn Error&gt;&gt; {
    // Initialize the HTTP client
    let client = Client::builder()
        .build()?;

    // Retrieve the target page
    let url = "https://books.toscrape.com/";
    let response = client
    .get(url)
    .send()
    .await?;

    // Extract the HTML content from the response
    let html = response.text().await?;
  
    // Parsing logic...

    // Data export logic...

    Ok(())
}</code></pre></div><p>This snippet initializes an asynchronous HTTP client using Tokio, sends a GET request to the target URL, retrieves the HTML response body, and prepares it for parsing and data extraction.</p><p>If you print <em>html</em>, you&#8217;ll observe:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G1Ll!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G1Ll!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 424w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 848w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 1272w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png" width="1456" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The HTML of the target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The HTML of the target page" title="The HTML of the target page" srcset="https://substackcdn.com/image/fetch/$s_!G1Ll!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 424w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 848w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 1272w, https://substackcdn.com/image/fetch/$s_!G1Ll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a9f95f6-cbe9-4eb9-8bcb-68527d145173_2154x809.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The HTML of the target page</figcaption></figure></div><p>Excellent! Get ready to apply the Rust data parsing logic.</p><h3>Step #4: Implement the Parsing Logic</h3><p>Before implementing the web scraping logic in Rust, study the DOM of the target page. Inspect a book HTML element in the browser:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zd3v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zd3v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 424w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 848w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 1272w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/baf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Inspecting a book HTML element&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Inspecting a book HTML element" title="Inspecting a book HTML element" srcset="https://substackcdn.com/image/fetch/$s_!Zd3v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 424w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 848w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 1272w, https://substackcdn.com/image/fetch/$s_!Zd3v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf83d45-7adf-4d99-8a1d-283622a7e8e3_2102x1171.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Inspecting a book HTML element</figcaption></figure></div><p>From this structure, notice how you can select all books using the <em>article.product_pod</em> CSS selector. For each book element, you can retrieve:</p><ul><li><p>The title and URL from <em>h3 a</em>.</p></li><li><p>The image URL from <em>.image_container img</em>.</p></li><li><p>The price from <em>.price_color</em>.</p></li><li><p>The rating from <em>p.star-rating</em>.</p></li><li><p>The stock status from <em>.instock.availability</em>.</p></li></ul><p>First, <a href="https://doc.rust-lang.org/book/ch05-01-defining-structs.html">define a struct</a> to store that data:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">#[derive(Debug)]
struct Book {
    url: String,
    image_url: String,
    title: String,
    price: String,
    rating: String,
    in_stock: bool,
}</code></pre></div><p>Next, define the <em>parse_books()</em> function that extracts and structures the data:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">use scraper::{Html, Selector};
// ...

// ...
fn parse_books(html: &amp;str) -&gt; Result&lt;Vec&lt;Book&gt;, Box&lt;dyn Error&gt;&gt; {
    // Parse the HTML content
    let document = Html::parse_document(html);

    // Define CSS selectors for the HTML elements of interest
    let book_selector = Selector::parse("article.product_pod")?;
    let title_selector = Selector::parse("h3 a")?;
    let image_selector = Selector::parse(".image_container img")?;
    let price_selector = Selector::parse(".price_color")?;
    let rating_selector = Selector::parse("p.star-rating")?;
    let stock_selector = Selector::parse(".instock.availability")?;

    // Where to store the scraped data
    let mut books = Vec::new();

    // Iterate over each book element and extract the relevant data
    for book_el in document.select(&amp;book_selector) {
        // Apply the parsing logic
        let title_el = book_el.select(&amp;title_selector).next().unwrap();

        let relative_url = title_el.value().attr("href").unwrap_or("");
        let url = format!(
            "https://books.toscrape.com/catalogue/{}",
            relative_url
        );

        let image_url = book_el
            .select(&amp;image_selector)
            .next()
            .and_then(|img| img.value().attr("src"))
            .unwrap_or("")
            .to_string();

        let image_url = format!(
            "https://books.toscrape.com/{}",
            image_url.trim_start_matches('/')
        );

        let title = title_el
            .value()
            .attr("title")
            .unwrap_or("")
            .to_string();

        let price = book_el
            .select(&amp;price_selector)
            .next()
            .map(|e| e.text().collect::&lt;String&gt;())
            .unwrap_or_default();

        let rating = book_el
            .select(&amp;rating_selector)
            .next()
            .and_then(|e| e.value().attr("class"))
            .unwrap_or("no rating")
            .replace("star-rating", "")
            .trim()
            .to_lowercase();

        let in_stock = book_el
            .select(&amp;stock_selector)
            .next()
            .map(|e| {
                let text = e.text().collect::&lt;String&gt;();
                text.to_lowercase() == "in stock"
            })
            .unwrap_or(false);

        // Collect the scraped book data
        books.push(Book {
            title,
            price,
            rating,
            in_stock,
            image_url,
            url,
        });
    }

    Ok(books)
}</code></pre></div><p>This function parses raw HTML into structured data using the <em>scraper</em> crate. <em>Html::parse_document()</em> creates a DOM-like representation of the page, while <em>Selector::parse()</em> defines CSS selectors for targeting elements.</p><p><em>document.select(&amp;book_selector)</em> iterates over each book container. Inside each element, <em>.select()</em> extracts nested elements, while <em>.value().attr()</em> retrieves attributes such as links and titles. The <em>.text()</em> method collects visible text content.</p><p>Finally, all extracted values are assembled into a <em>Book</em> struct, and each instance is stored in a vector for later export or processing.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Step #5: Export the Scraped Data</h3><p>Right now, the scraped data is returned by the <em>parse_books()</em> function as a vector of <em>Book</em> structs. Next, add a function that uses the <em>csv</em> crate to export that data into a CSV file:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">use csv::Writer;
// ...

//...
fn write_csv(books: &amp;[Book], file_path: &amp;str) -&gt; Result&lt;(), Box&lt;dyn std::error::Error&gt;&gt; {
    let mut wtr = Writer::from_path(file_path)?;

    // Write the header row
    wtr.write_record(&amp;[
        "url",
        "image_url",
        "title",
        "price",
        "rating",
        "in_stock",
    ])?;

    for book in books {
        wtr.write_record(&amp;[
            &amp;book.url,
            &amp;book.image_url,
            &amp;book.title,
            &amp;book.price,
            &amp;book.rating,
            &amp;book.in_stock.to_string(),
        ])?;
    }

    wtr.flush()?;
    Ok(())
}</code></pre></div><h3>Step #6: Put It All Together</h3><p>This is the final code of your Rust web scraper:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;rust&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-rust">// src/main.rs

use std::error::Error;
use reqwest::Client;
use scraper::{Html, Selector};
use csv::Writer;

#[derive(Debug)]
struct Book {
    url: String,
    image_url: String,
    title: String,
    price: String,
    rating: String,
    in_stock: bool,
}

#[tokio::main]
async fn main() -&gt; Result&lt;(), Box&lt;dyn Error&gt;&gt; {
    // Initialize the HTTP client
    let client = Client::builder()
        .build()?;

    // Retrieve the target page
    let url = "https://books.toscrape.com/";
    let response = client
    .get(url)
    .send()
    .await?;

    // Extract the HTML content from the response
    let html = response.text().await?;

    // Parse the books data from the HTML
    let books = parse_books(&amp;html)?;

    // Export the scraped data to a CSV file
    write_csv(&amp;books, "books.csv")?;

    Ok(())
}

fn parse_books(html: &amp;str) -&gt; Result&lt;Vec&lt;Book&gt;, Box&lt;dyn Error&gt;&gt; {
    // Parse the HTML content
    let document = Html::parse_document(html);

    // Define CSS selectors for the HTML elements of interest
    let book_selector = Selector::parse("article.product_pod")?;
    let title_selector = Selector::parse("h3 a")?;
    let image_selector = Selector::parse(".image_container img")?;
    let price_selector = Selector::parse(".price_color")?;
    let rating_selector = Selector::parse("p.star-rating")?;
    let stock_selector = Selector::parse(".instock.availability")?;

    // Where to store the scraped data
    let mut books = Vec::new();

    // Iterate over each book element and extract the relevant data
    for book_el in document.select(&amp;book_selector) {
        // Apply the parsing logic
        let title_el = book_el.select(&amp;title_selector).next().unwrap();

        let relative_url = title_el.value().attr("href").unwrap_or("");
        let url = format!(
            "https://books.toscrape.com/catalogue/{}",
            relative_url
        );

        let image_url = book_el
            .select(&amp;image_selector)
            .next()
            .and_then(|img| img.value().attr("src"))
            .unwrap_or("")
            .to_string();

        let image_url = format!(
            "https://books.toscrape.com/{}",
            image_url.trim_start_matches('/')
        );

        let title = title_el
            .value()
            .attr("title")
            .unwrap_or("")
            .to_string();

        let price = book_el
            .select(&amp;price_selector)
            .next()
            .map(|e| e.text().collect::&lt;String&gt;())
            .unwrap_or_default();

        let rating = book_el
            .select(&amp;rating_selector)
            .next()
            .and_then(|e| e.value().attr("class"))
            .unwrap_or("no rating")
            .replace("star-rating", "")
            .trim()
            .to_lowercase();

        let in_stock = book_el
            .select(&amp;stock_selector)
            .next()
            .map(|e| {
                let text = e.text().collect::&lt;String&gt;();
                text.to_lowercase() == "in stock"
            })
            .unwrap_or(false);

        // Collect the scraped book data
        books.push(Book {
            title,
            price,
            rating,
            in_stock,
            image_url,
            url,
        });
    }

    Ok(books)
}

fn write_csv(books: &amp;[Book], file_path: &amp;str) -&gt; Result&lt;(), Box&lt;dyn std::error::Error&gt;&gt; {
    let mut wtr = Writer::from_path(file_path)?;

    // Write the header row
    wtr.write_record(&amp;[
        "url",
        "image_url",
        "title",
        "price",
        "rating",
        "in_stock",
    ])?;

    for book in books {
        wtr.write_record(&amp;[
            &amp;book.url,
            &amp;book.image_url,
            &amp;book.title,
            &amp;book.price,
            &amp;book.rating,
            &amp;book.in_stock.to_string(),
        ])?;
    }

    wtr.flush()?;
    Ok(())
}</code></pre></div><p>Note how all previously defined functions are now called inside <em>main()</em>. Et voila! In just around 150 lines of code, you&#8217;ve built an efficient web scraper in Rust.</p><p>Run your scraper with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">cargo run</code></pre></div><p>After execution, a <em>books.csv</em> file will be created in your project folder. Open it, and you will see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ou7T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ou7T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 424w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 848w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png" width="1456" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output books.csv file&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output books.csv file" title="The output books.csv file" srcset="https://substackcdn.com/image/fetch/$s_!Ou7T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 424w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 848w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!Ou7T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9467bbe4-1ba4-49ab-9e07-5e3dbb2bb6b8_3051x1212.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output books.csv file</figcaption></figure></div><p>This matches exactly the data shown on the target website, but now in a structured format. Mission complete!</p><h2>Browser Automation in Rust: Does It Make Sense?</h2><p>First of all, it&#8217;s worth noting that the ecosystem for browser automation in Rust is quite small compared to JavaScript or Python. Also, most libraries aren&#8217;t official, but rather community-backed ports like Playwright Rust or the Selenium bindings.</p><p>Now, from a technical standpoint, browser automation happens inside the browser itself. So, Chrome, Chromium, or Firefox do most of the heavy lifting. What you define through the library&#8217;s API simply orchestrates operations like clicking, waiting for elements, and extracting data. These commands are then translated into browser actions via <a href="https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi">WebDriver, CDP, or WebDriver BiDi</a>.</p><p>Because of that, using a systems-level language like Rust can be more of a burden than an advantage. The main strength of Rust (i.e., raw performance) doesn&#8217;t really matter here, since the controlled browser instances are the actual bottleneck, not your automation code.</p><p>That means we lose Rust&#8217;s biggest advantage while still paying its costs. On top of that, Rust&#8217;s strict compiler and steeper learning curve can slow down development speed.</p><p>To be honest, I see Rust as excellent for the parsing and data processing layer, but I wouldn&#8217;t recommend it for browser automation&#8230;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Rust for Web Scraping: Final Comment</h3><p>If I had to summarize my experience with Rust for web scraping, I&#8217;d say this: <em>it really shines when you&#8217;re parsing large HTML pages at scale or handling a high number of parsing tasks in parallel.</em></p><p>In those scenarios, the combination of performance, memory safety, and concurrency makes a real difference. That said, I wouldn&#8217;t recommend Rust for everyday scraping tasks&#8230; The entry barrier is just too high, the learning curve too steep, and the ecosystem around scraping too small.</p><p>On top of that, finding experienced Rust developers specifically focused on web scraping, or even just <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">translating those skills into job opportunities</a>, can be way more challenging than in more mainstream stacks.</p><p>So my take is pretty simple: consider Rust when performance and scale <em>truly</em> matter. For everything else, prefer Python or JavaScript.</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Can efficiently handle thousands of requests in parallel.</p></li><li><p>Rust HTML parsers are extremely.</p></li><li><p>Strict compiler checks and static guarantees lead to stable scraping pipelines.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Slower development and prototyping speed.</p></li><li><p>Smaller ecosystem of scraping libraries compared to Python or JavaScript.</p></li><li><p>Not a practical choice for browser automation.</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve guided you through the world of web scraping in Rust. In a world dominated by AI slops, security flaws, and neglected best practices, this programming language is gaining traction thanks to its focus on efficiency and strict compilation.</p><p>As you&#8217;ve seen, Rust is excellent for CPU-intensive or memory-intensive tasks like HTML parsing and data processing. Still, it might not be ideal for browser automation or quick prototyping. You also learned how to go from zero to scraped data in CSV format by building a Rust scraper.</p><p>I hope you found this helpful and insightful. If you have any questions, feel free to share them in the comments below!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #105: If LLMs Can Bypass CAPTCHAs, Are CAPTCHA Solver Services Cooked?]]></title><description><![CDATA[Bypassing hCaptcha in the AI era]]></description><link>https://substack.thewebscraping.club/p/bypassing-hcaptcha-llm</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/bypassing-hcaptcha-llm</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Fri, 29 May 2026 13:13:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/87ac6a36-66d4-4153-8a63-d6ee184d4f29_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Think back to 2023. The story about large language models was that they would automate most work involving reading or writing, and a good slice of the work that means looking at a screen the way you do. We bought that story too. We even wrote <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">Are CAPTCHAs still a thing?</a> that August, reporting on the ETH Zurich paper that claimed AI bots beat humans at reCAPTCHA v2 image challenges by roughly 15%. CAPTCHAs were on the long list of things LLMs were supposed to make irrelevant on the road to general agency.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What the 2023 LLM hype promised, and what 2026 actually shipped</h2><p>The product category that took that promise most literally is the agentic browser. A real Chromium running under an LLM that reads the page, decides what to do, and clicks. Browserbase, Hyperbrowser, Skyvern, Browser Use, BrowserOS, Owl Browser, and a long tail of proxy companies rebranding their scraping browsers as &#8220;AI-powered&#8221;. The pitch in 2024 and 2025 never changed. You hand the agent a task in plain language, and the model handles the rest, including whatever defensive challenge the site throws back.</p><p>It is 2026 now, and that prediction has not played out. CAPTCHAs are still in your pipeline. So we went looking for the answer. Can an LLM actually solve a production-grade CAPTCHA like hCaptcha? We read the code where we could, checked the default configs and the docs, and probed the public surfaces of the solver services these products lean on. The picture is consistent, and it is not the one the marketing sells.</p><p>Every major agentic browser ships a CAPTCHA-solving bullet point on its site. Open the code of the open-source agents, though, and you find a different story. Almost none of them actually use an LLM to solve a CAPTCHA. They either refuse to try, or they try and fail.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><p>One thing to get straight before we open any code. The two families of CAPTCHA behave very differently, and the hype lands on them unevenly. The invisible ones, reCAPTCHA v3 and Cloudflare Turnstile, score your session in the background and rarely show a puzzle. A stealth-first browser on clean residential proxies usually walks past them without anyone seeing a challenge, which we covered for Turnstile in <a href="https://substack.thewebscraping.club/p/cloudflare-turnstile-what-is-that">Cloudflare Turnstile: what is that and how it works?</a> and <a href="https://substack.thewebscraping.club/p/how-to-bypass-cloudflare-turnstile">THE LAB #73: How to Bypass Cloudflare in 2025</a>. The visible image challenges, hCaptcha and reCAPTCHA v2, actually demand an answer. We went deep on reCAPTCHA v2 in <a href="https://substack.thewebscraping.club/p/bypassing-recaptchas-with-open-source">Bypassing reCAPTCHAs With Open Source and Commercial Tools - Part 2</a>. The one that matters in 2026 is hCaptcha, and that is where this article lives, because it is where most scrapers break.</p><p>So here is the question that drives everything below. When an agentic browser says it &#8220;solves&#8221; an hCaptcha, what does its code actually do?</p><h2>Tool landscape</h2><h3>What&#8217;s the commercial offer today</h3><p>There are four families of strategy in the commercial set. In none of them is the LLM in the agent layer doing the CAPTCHA work.</p><p>The first family is stealth-first. Residential proxies, fingerprint shaping, and request patterning lower the bot score so the visible challenge never triggers. The CAPTCHA is not solved. It is prevented from appearing. That gives you the cleanest legal posture in the set, because no automated solving is happening. ZenRows is the example.</p><p>The second family is the opposite of stealth. Instead of hiding that the traffic is automated, the vendor declares it openly and relies on a business arrangement with the CAPTCHA providers to be let through. Browserbase is explicit about this. Its <a href="https://docs.browserbase.com/features/stealth-mode">Stealth Mode documentation</a> says &#8220;through Browserbase&#8217;s partnerships with CAPTCHA providers, Browserbase can resolve challenges automatically so your sessions continue without interruption&#8221;, with solving &#8220;enabled by default for all sessions&#8221;. This is the verified-bot path, the same idea behind Cloudflare&#8217;s Web Bot Auth. A declared, allowlisted agent rather than a disguised one. No model solves a puzzle on either side. The provider recognizes the partner and waves it through. For the common challenge types, according to their documentation, this works without a fight.</p><p>The third family pairs a proprietary solver with documented third-party integrations. The vendor ships its own solving for some challenge types. For the rest, it documents how to wire in an external solver service. The external solver watches for the challenge and returns the response token through its extension or REST API. The agent then submits. Hyperbrowser and Skyvern Cloud both present this shape, a native or closed-source component plus a documented third-party path. Hyperbrowser advertises &#8220;Native Cloudflare Turnstile &amp; CAPTCHA Solving&#8221; with &#8220;No external plugins&#8221; in its <a href="https://tech.hyperbrowser.ai/scraping-infrastructure-native-turnstile-captcha-solving">post on native CAPTCHA solving</a>, then points to an external solver for the challenge types the native one does not cover.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p>The fourth family runs an in-house solver alongside the agent. Could be a vision LLM, could be classical computer vision, could be something else. No third party is in the critical path. The vendor owns the whole stack. Bright Data, Oxylabs, and Owl Browser sit here. <a href="https://owlbrowser.net/">Owl Browser</a> puts hard numbers on the claim: &#8220;detect and automatically solve reCAPTCHA v2 (1.2s), hCaptcha (0.8s), Turnstile (0.3s), and image CAPTCHAs.&#8221; Bright Data sells &#8220;AI-based unlocking logic&#8221; that handles &#8220;CAPTCHA solving, fingerprinting, retries, best headers, location and more&#8221; on its <a href="https://brightdata.com/products/web-unlocker/captcha-solver">Web Unlocker page</a>.</p><p>The table below maps each vendor to the mechanism its public documentation surfaces, with the source page that proves it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!klTW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!klTW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 424w, https://substackcdn.com/image/fetch/$s_!klTW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 848w, https://substackcdn.com/image/fetch/$s_!klTW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 1272w, https://substackcdn.com/image/fetch/$s_!klTW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!klTW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png" width="654" height="839" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:839,&quot;width&quot;:654,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164510,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/199732185?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!klTW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 424w, https://substackcdn.com/image/fetch/$s_!klTW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 848w, https://substackcdn.com/image/fetch/$s_!klTW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 1272w, https://substackcdn.com/image/fetch/$s_!klTW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e187d27-0a40-45c6-bef3-a0194faa3c1b_654x839.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>The open-source agents we can read</h3><p>For the proprietary CAPTCHA-solving strategies we cannot see how AI is used. For the open-source ones we can. So we opened their code and read exactly how they handle CAPTCHAs. Three projects matter here.</p><p><a href="https://github.com/browser-use/browser-use">browser-use</a> is the most popular open-source LLM agent framework for browser automation. MIT-licensed, vendor-agnostic on the LLM side. The repo contains no LLM-driven CAPTCHA-solving logic. The one CAPTCHA-related file it ships, <a href="https://raw.githubusercontent.com/browser-use/browser-use/main/browser_use/browser/watchdogs/captcha_watchdog.py">captcha_watchdog.py</a>, does not solve anything either. It waits for a solver running in the BrowserUse cloud proxy and blocks the agent loop until that solver reports back. Run the library locally with your own model and no cloud proxy, and the watchdog has nothing to wait for. That makes browser-use the cleanest test of the claim that a local LLM agent solves the CAPTCHA by reading the page.</p><p><a href="https://github.com/Skyvern-AI/skyvern">Skyvern OSS</a> is the open core of the Skyvern Cloud product. AGPL-3.0, focused on form-filling and structured workflows, written in Python.</p><p><a href="https://github.com/browseros-ai/BrowserOS">BrowserOS</a> is a YC S24 open-source agent-driven browser. AGPL-3.0, 11k stars on GitHub, active development. It pairs a Chromium fork with an integrated agent runtime.</p><h2>Modeling hCaptcha and reading what the open-source agents actually do</h2><p>Before testing, it helps to model the target. hCaptcha embeds a widget on the host page through a script served from <code>hcaptcha.com</code>. </p><p>The widget renders inside an iframe whose origin is <code>hcaptcha.com</code>, cross-origin to the host page. When the user clicks the &#8220;I am human&#8221; checkbox, the widget decides whether to issue a challenge. If it does, a second iframe opens with the puzzle dialog. The puzzle layout varies between runs (3x3 grid, 4x3 grid, area-select with a single click on an image, bounding-box, multiple-choice prompt). When the puzzle is solved, the widget writes a response token to a hidden <code>textarea[name="h-captcha-response"]</code> on the host page. The host form reads that textarea on submit and posts the token along with the rest of the data. The whole solve interaction happens inside a frame the host page cannot script. The same-origin policy boundary blocks it.</p><p>That last detail decides what works and what does not. An agent driving the host page through Playwright or CDP has full control over the outer page. Inside the <code>hcaptcha.com</code> frame, its control is limited. A browser extension runs with cross-origin privileges. It can both observe and click inside the widget frame. That asymmetry explains most of what follows.</p><p>We started by reading three open-source agent repos to see what each one had decided to do at this boundary.</p><h3>Skyvern OSS bails to the human</h3><p>Skyvern&#8217;s README is candid about the scope of the OSS release: &#8220;All of the core logic powering Skyvern is available in this open source repository licensed under the AGPL-3.0 License, with the exception of anti-bot measures available in our managed cloud offering.&#8221; That single sentence puts the CAPTCHA section of every Skyvern Cloud feature page outside the repo you can read.</p><p>The repo confirms the framing. In <a href="https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/forge/agent_functions.py#L800-L804">agent_functions.py, lines 800-804</a>, the <code>auto_solve_captchas</code> helper returns <code>False</code> unconditionally. In <a href="https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/webeye/actions/handler.py#L1149-L1162">handler.py, lines 1149-1162</a>, <code>handle_solve_captcha_action</code> does exactly one thing of substance.</p><pre><code><code>async def handle_solve_captcha_action(
    action: actions.SolveCaptchaAction,
    page: Page,
    scraped_page: ScrapedPage,
    task: Task,
    step: Step,
) -&gt; list[ActionResult]:
    LOG.warning(
        "Please solve the captcha on the page, you have 30 seconds",
        action=action,
    )
    await asyncio.sleep(30)
    return [ActionSuccess()]</code></code></pre><p>Thirty seconds of <code>asyncio.sleep</code> and a log message asking a human to handle it. Then a success result, regardless of what the human actually did. The script-generation path in <a href="https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/core/script_generations/skyvern_page.py#L1252-L1254">skyvern_page.py, lines 1252-1254</a> is even more direct. <code>solve_captcha</code> raises <code>NotImplementedError</code>. A Skyvern user opened <a href="https://github.com/Skyvern-AI/skyvern/issues/1117">issue #1117</a> asking how CAPTCHA solving was meant to work in OSS. A maintainer answered plainly: &#8220;We haven&#8217;t open sourced anything related to our captcha solver / anti-bot measures. We don&#8217;t want people abusing these things, so they must remain closed source unfortunately.&#8221;</p><p>The Skyvern OSS code does not pretend to solve CAPTCHAs. Whatever Skyvern Cloud does on top of this, the OSS release hands the problem to a human and moves on.</p><p>As always, the code that will use to bypass hCaptcha can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">105.</a></strong><a href="https://github.com/TheWebScrapingClub/thelab">HCAPTCHA</a><strong>.</strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/bypassing-hcaptcha-llm">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Scrape Open-Source Datasets Ethically]]></title><description><![CDATA[How to collect open data responsibly, without breaking rules or burning bridges]]></description><link>https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 24 May 2026 19:58:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1ef53778-bd4a-4fd8-9911-912fc9f8ea67_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you need to scrape data from the web, &#8220;open data&#8221; and &#8220;open-source datasets&#8221; sound like a green light. No paywall, no login, no restrictions: just data sitting there, ready to be collected. It is a reasonable assumption, right?</p><p>Well, not so fast.</p><p>Open data does not automatically mean free to use, free to redistribute, or free from privacy obligations. And scraping it without thinking through the implications can land you in legal trouble, get your IP banned from a public infrastructure that was never designed to handle aggressive crawlers, or cause you to expose people&#8217;s personal information.</p><p>In this article, we will go through a complete picture of the &#8220;open data&#8221; world: what the problem actually is, how to approach it correctly, and how to implement responsible open data scrapers in Python. </p><p>Let&#8217;s dive into it!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank <strong>NetNut</strong>, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>What &#8220;Open Data&#8221; Actually Means Legally, Ethically, and Practically</strong></h2><p>&#8220;Open&#8221; is one of the most overloaded words in the data world. Depending on the license, the jurisdiction, and the type of data involved, the same publicly accessible dataset can be freely redistributable, commercially restricted, privacy-sensitive, or legally off-limits entirely. </p><p>So, before anything else, let&#8217;s establish what you are actually dealing with.</p><h3>What &#8220;Open-Source Dataset&#8221; Actually Means (and What It Doesn&#8217;t)</h3><p>Where a dataset sits on the licensing spectrum determines everything: whether you can redistribute it, whether you can use it commercially, and whether collecting it at all exposes you to liability. Here is how the spectrum breaks down:</p><ul><li><p><strong>CC0</strong> (Creative Commons Zero): Essentially, it is a public domain dedication. The author waives all rights. You can scrape it, redistribute it, use it commercially, and modify it.</p></li><li><p><strong>CC-BY</strong> (Creative Commons Attribution): It requires you to credit the original source. This means you must clearly state where the data came from, who created it, and link back to the original when you publish or redistribute it. This is the most permissive license after CC0, and it is generally easy to comply with.</p></li><li><p><strong>CC-BY-SA</strong> (Share-Alike): This carries the same attribution requirement as CC-BY, but adds a condition: any derivative work you publish must carry the same license. In practice, this means you cannot fold a CC-BY-SA dataset into a proprietary product and lock it down.</p></li><li><p><strong>CC-BY-NC</strong> (Non-Commercial): It also requires attribution, but restricts commercial use entirely. You can use the data for research, journalism, or personal projects, but the moment money is involved, you need a separate agreement with the data owner.</p></li><li><p><strong>ODbL</strong> (Open Database License), used by OpenStreetMap: It requires both attribution and share-alike, specifically for databases. It is worth noting that ODbL distinguishes between the database itself and the contents. Basically, you can use individual facts freely, but any public use of the database as a whole must comply with the license terms.</p></li></ul><p>And then there is the grey zone, which is where most scraping engineers actually operate: data that is publicly accessible but carries no explicit license. Common cases are government portals, academic repositories, open court records, and municipal datasets. This is a huge portion of what people call &#8220;open data&#8221;. And here is the thing that matters for scraping professionals: <strong>no license does not mean free to use</strong>. In most jurisdictions, the absence of a license means the default copyright law applies. Which means the creator reserves all rights.</p><p>So before you write a single line of scraper code, the first question is not <em>&#8220;Can I access this?&#8221;</em> but <em>&#8220;Under what terms am I allowed to use what I access?&#8221;</em></p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>Where the Ethical (and Legal) Risks Hide</h3><p>Once you have cleared the license question, there are still several risk areas that are easy to overlook:</p><ul><li><p><strong>License violations</strong>: This is the most obvious one. If a dataset requires attribution and you redistribute it without crediting the source, you are in breach. If it has a non-commercial clause and you use it in a commercial product, it&#8217;s the same story. These are the kind of things that generate cease-and-desist letters.</p></li><li><p><strong>PII embedded in &#8220;open&#8221; datasets</strong>: This is a subtler and arguably more dangerous problem than license violation. Consider open court records: they are public by design, but they contain names, addresses, and sometimes sensitive personal details. Census microdata, even when anonymized at the aggregate level, can contain individual-level records. For example, the GitHub commit history is public, but it contains email addresses, which is personal data. So, the fact that data was made public by someone else does not strip it of its privacy implications when you collect, aggregate, and store it.</p></li><li><p><strong>Jurisdictional complexity</strong>: A dataset hosted on a European government portal carries GDPR obligations even if you are scraping it from the United States. The GDPR applies based on where the data subjects are located, not where the scraper is running. If you are collecting data about EU residents, you are in GDPR territory regardless of your own geography.</p></li><li><p><strong>The aggregation problem</strong>: This is probably one of the most underappreciated risks in the scraping industry. Individually, a dataset of names, a dataset of addresses, and a dataset of employment records might each be harmless and openly licensed. But combine them, and you have created a detailed profile of real people. This is something that privacy regulations were specifically designed to prevent.</p></li></ul><h3>The Infrastructure Problem: Open Data Portals Are Not Built for Scrapers</h3><p>Many scraping engineers come to open data with habits built on commercial targets. That experience can be misleading, because the infrastructure behind open data portals is completely different.</p><p><a href="https://substack.thewebscraping.club/p/sentiment-analysis-product-reviews">When you scrape a large e-commerce website</a> or a <a href="https://substack.thewebscraping.club/p/scraping-linkedin-public-data">major social media platform</a>, you are hitting servers that are engineered to handle millions of requests per day, backed by CDNs, load balancers, and dedicated anti-bot teams. In other words, they can take a (hard) hit.</p><p>On the other hand, a municipal open data portal, a university&#8217;s research repository, or a small NGO&#8217;s dataset hosting is an entirely different story. This means that a scraper that would barely register as noise on Amazon&#8217;s servers could genuinely degrade performance for a public data portal serving thousands of researchers.</p><p>This is why scraping open data portals aggressively is arguably more unethical than doing the same to a commercial target. You are not fighting a corporation&#8217;s anti-bot system. You are potentially taking down a public resource that other people depend on.</p><h3><strong>A Four-Step Framework for Scraping Open Datasets Without Breaking Rules or Infrastructure</strong></h3><p>Every risk outlined above has a straightforward mitigation, but only if you apply it at the right point in your workflow. The mistake most scraping engineers make is treating these as afterthoughts: checking the license after the scraper is already built, thinking about PII after the data is already stored. Let&#8217;s discuss a framework that inverts this.</p><h3>License-First Workflow: Read Before You Scrape</h3><p>The fix for the license problem is simple in principle, even if it requires discipline in practice: make license verification the first step of your workflow.</p><p>Most well-maintained open data portals provide license information in one of these three places: a <code>LICENSE</code> file in the dataset&#8217;s root directory, a metadata field in the dataset&#8217;s API response, or the dataset&#8217;s documentation page. Here is a quick reference for what the licenses described above mean for your use case:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AbdL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AbdL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 424w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 848w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1272w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png" width="1021" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:1021,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48171,&quot;alt&quot;:&quot;Summary table for data licenses by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196400924?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Summary table for data licenses by Federico Trotta" title="Summary table for data licenses by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!AbdL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 424w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 848w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1272w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Summary table for data licenses</figcaption></figure></div><p>When there is no license, the safe default is not to scrape and redistribute without seeking explicit permission from the dataset owner. A short email asking for clarification is a sign of professionalism.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Prefer APIs and Bulk Downloads Over Scraping</h3><p>This is a rule that experienced scraping engineers sometimes forget because they are so used to reaching for their scraper toolkit: always check for an official API or bulk download endpoint before writing a scraper.</p><p>Most serious open data portals expose REST APIs or provide direct bulk download links. Using these is better in every dimension: it is faster, more reliable, more respectful of the server, and often gives you cleaner, structured data than you would get from parsing HTML.</p><p>Your workflow should be:</p><ol><li><p>Check the portal&#8217;s documentation for an API.</p></li><li><p>Check for a <code>Sitemap</code> or structured data endpoint (as discussed in our <a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">article on robots.txt and its implications</a>).</p></li><li><p>Check for bulk download links (CSV, JSON, Parquet).</p></li><li><p>Only fall back to HTML scraping if none of the above exist.</p></li></ol><p>Scraping should be your last resort, not your first instinct.</p><h3>Responsible Scraping Behavior for Open Infrastructure</h3><p>When scraping is genuinely the only option, the rules of polite scraping apply. But in the case of open data portals, you should apply a higher standard than you would on a commercial target.</p><p>As covered in &#8220;<a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">best practices for ethical web scraping</a>&#8221;, respecting rate limits, introducing delays between requests, and using a descriptive User-Agent are baseline requirements. But for open data portals, you should go further because of their weaker infrastructure. Below are additional rules you should consider:</p><ul><li><p><strong>Respect </strong><em><strong>Crawl-delay</strong></em><strong> strictly</strong>: Even if major crawlers ignore it, on underfunded infrastructure, that directive is a good signal about server capacity.</p></li><li><p><strong>Cache responses locally</strong>: If you need to re-run your scraper for testing or debugging, you should not be hitting the server again. Cache what you have already fetched.</p></li><li><p><strong>Scrape during off-peak hours</strong>: For public portals serving researchers and government users, off-peak typically means nights and weekends in the portal&#8217;s local timezone.</p></li><li><p><strong>Scrape only what you need</strong>: This sounds obvious, but it&#8217;s easy to over-collect data &#8220;just in case&#8221;. However, for open portals, remember that every unnecessary request is a cost imposed on a public resource that stays online on an underfunded infrastructure.</p></li></ul><h3>Handling PII in Open Datasets</h3><p>PII stands for Personally Identifiable Information. This refers to any data that can be used, alone or in combination with other data, to identify a specific individual. Think names, email addresses, phone numbers, but also subtler things like IP addresses or device IDs.</p><p>The reality is that most well-maintained open data portals go through a review process before publication, and raw PII in open datasets is not as common as you might think. The most common cases where PII can slip through are quite specific: older government datasets published before modern privacy review processes, improperly anonymized academic research deposits, or crowdsourced datasets where contributors included personal details voluntarily.</p><p>In such specific cases, the real risk for most scraping engineers is at the aggregation level. A dataset of names, a dataset of ZIP codes, and a dataset of employment records might each be perfectly clean and openly licensed in isolation. But combine them, and you have built a detailed profile of real individuals. This is something that privacy regulations like the GDPR and CPRA were specifically designed to prevent. And once you collect, store, and process that combined data, you become responsible for it, regardless of where it originally came from.</p><p>The key principle remains the usual one: identify and handle PII at collection time. Here is a schema you can use to audit the fields that are likely to contain PII:</p><ul><li><p><strong>Direct identifiers</strong>: names, email addresses, phone numbers, national ID numbers, passport numbers, and social security numbers. These are the clearest cases as they point to a specific individual on their own, without needing to be combined with anything else. If you see these fields in a dataset, there is no ambiguity: you are dealing with PII.</p></li><li><p><strong>Quasi-identifiers</strong>: dates of birth, ZIP codes, job titles, gender, ethnicity, and salary ranges. None of these identify a person on their own, but they become dangerous in combination. A classic example is aggregating just three fields&#8212;say date of birth, gender, and ZIP code: this is enough to uniquely identify a great portion of the population in a country.</p></li><li><p><strong>Sensitive categories under GDPR</strong>: health and medical data, political opinions, religious or philosophical beliefs, biometric data, genetic data, sexual orientation, and trade union membership. This is a legally distinct class that carries stricter obligations regardless of context. In other words, you cannot process this data based on legitimate interest alone. You need explicit consent or another specific legal basis, and the bar is significantly higher than for ordinary PII.</p></li></ul><p>For each PII field, decide upfront: do you need it? If not, drop it at collection time. If you do need it, apply pseudonymization (replacing the identifier with a reversible token) or anonymization (irreversible removal or generalization) before storage.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Python Implementation: Putting the Full Responsible Scraping Pipeline Into Code</h2><p>Principles are only useful if they translate into implementation. Below are two concrete components you can adapt for your own pipelines:</p><ul><li><p>Checking a dataset&#8217;s license before downloading anything, using CKAN&#8217;s metadata API, with a practical fallback strategy for portals that don&#8217;t run CKAN.</p></li><li><p>Running PII detection at collection time, using field-level schema classification, with an honest discussion of where that approach has limits.</p></li></ul><p>Note that the examples below omit an API-first fetch pattern and a polite scraper skeleton, even though they are covered in the framework section above. This is because those are problems with well-known, straightforward solutions that every scraping engineer should be aware of. The idea of the following sections is to provide you with lesser-known solutions, to help you get ideas to apply to your pipelines.</p><h3>Checking a Dataset&#8217;s License Programmatically</h3><p>Many open data portals are built on <a href="https://ckan.org/">CKAN</a>, an open-source data management system used by governments and enterprises. CKAN exposes a REST API that includes license metadata, which makes programmatic license checking straightforward.</p><p>Here is how to query a CKAN-based portal and extract license information before proceeding:</p><pre><code><code>import requests

def check_dataset_license(portal_base_url: str, dataset_id: str) -&gt; dict:
    """
    Queries a CKAN portal API to retrieve license information
    for a given dataset before any scraping begins.
    """
    api_url = f"{portal_base_url}/api/3/action/package_show"
    params = {"id": dataset_id}

    response = requests.get(api_url, params=params, timeout=10)
    response.raise_for_status()

    data = response.json()
    result = data.get("result", {})

    license_info = {
        "dataset_name": result.get("title", "Unknown"),
        "license_id": result.get("license_id", "Not specified"),
        "license_title": result.get("license_title", "Not specified"),
        "license_url": result.get("license_url", "Not specified"),
    }

    return license_info

# Example: querying the UK government's open data portal
portal = "&lt;https://data.gov.uk&gt;"
dataset = "road-accidents-safety-data"

license_info = check_dataset_license(portal, dataset)

print(f"Dataset: {license_info['dataset_name']}")
print(f"License: {license_info['license_title']}")
print(f"License ID: {license_info['license_id']}")
print(f"License URL: {license_info['license_url']}")</code></code></pre><p>Which outputs the following:</p><pre><code><code>Dataset: Road Safety Data
License: UK Open Government Licence (OGL)
License ID: uk-ogl
License URL: &lt;https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/&gt;</code></code></pre><p>With this information in hand, you can make an informed decision before a single byte of dataset content is downloaded. Specifically, you can directly check the <a href="https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/">government license page</a>. The image below partially shows the license page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hLlt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hLlt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 424w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 848w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1272w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png" width="1211" height="690" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:690,&quot;width&quot;:1211,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135747,&quot;alt&quot;:&quot;The license page of the National Archive of the UK Government by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196400924?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The license page of the National Archive of the UK Government by Federico Trotta" title="The license page of the National Archive of the UK Government by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!hLlt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 424w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 848w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1272w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The license page of the National Archive of the UK Government</figcaption></figure></div><p>But what if the portal you need to scrape doesn&#8217;t run CKAN? Not all open data portals do&#8230; <a href="https://dev.socrata.com/">Socrata</a> (used by many US city and state governments), <a href="https://getdkan.org/">DKAN</a>, and custom-built portals each have different or no metadata APIs. In those cases, your fallback options are the following:</p><ul><li><p>Check for a <em>LICENSE</em> or <em>METADATA</em> file in the dataset&#8217;s root directory or bulk download package. Many portals include one.</p></li><li><p>Look for a <em>&lt;link rel=&#8221;license&#8221;&gt;</em> tag in the dataset&#8217;s HTML page, which some portals emit as structured metadata.</p></li><li><p>Check the portal&#8217;s documentation or &#8220;About&#8221; page, where license terms are often stated globally for all datasets.</p></li></ul><p>If none of the above yield a clear answer, treat the license as unknown and do not redistribute without seeking explicit written permission from the dataset owner. A short email asking for clarification is a professional move.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>PII Detection at Scrape Time</h2><p>In this case, the approach depends heavily on what you actually know about the data you need to scrape. Two situations you will encounter in practice, each calling for a different strategy:</p><ul><li><p><strong>You know the schema</strong>: If you are retrieving structured data, field-level detection is the right approach. You know which fields are likely to carry PII, so you can target them directly. This is faster, more precise, and produces far fewer false positives than running a general NER model over free text.</p></li><li><p><strong>You have no schema</strong>: For unstructured data, NER-based detection is a reasonable starting point, but go in with realistic expectations. A common solution is using <a href="https://spacy.io/models/en">spaCy&#8217;s </a><em><a href="https://spacy.io/models/en">en_core_web_sm</a></em>, which is a small model trained on news text, so don&#8217;t expect it to do miracles for you. Another approach, which can give way better results, is <a href="https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text">using LLMs to give a structure to unstructured text</a>.</p></li></ul><p>For the structured case, here is a field-level PII detection pipeline:</p><pre><code><code>import re
import hashlib
from dataclasses import dataclass, field
from typing import Any

# Fields that are unambiguously PII on their own
DIRECT_IDENTIFIER_FIELDS = {
    "name", "full_name", "first_name", "last_name",
    "email", "email_address",
    "phone", "phone_number", "mobile",
    "ssn", "national_id", "passport_number",
    "ip_address", "device_id"
}

# Fields that are not PII alone but dangerous in combination
QUASI_IDENTIFIER_FIELDS = {
    "date_of_birth", "dob", "birth_date",
    "zip_code", "postcode", "zip",
    "gender", "sex",
    "job_title", "occupation",
    "salary", "income",
    "ethnicity", "race"
}

# Regex patterns for validating suspected PII values at the content level
EMAIL_PATTERN = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")
PHONE_PATTERN = re.compile(r"\\b(\\+?\\d[\\d\\s\\-().]{7,}\\d)\\b")

@dataclass
class FieldAudit:
    field_name: str
    classification: str    # "direct", "quasi", or "clean"
    original_value: Any
    processed_value: Any   # pseudonymized, generalized, or original
    action_taken: str       # "pseudonymized", "generalized", "dropped", "kept"

def pseudonymize(value: Any) -&gt; str:
    """
    Replaces a PII value with a consistent, reversible token.
    Using a hash means the same value always produces the same token,
    which preserves referential integrity across records (e.g., you can
    still count unique users without knowing who they are).
    In production, use an HMAC with a secret key instead of plain SHA-256.
    """
    return hashlib.sha256(str(value).encode()).hexdigest()[:16]

def generalize_date(value: str) -&gt; str:
    """
    Reduces a full date of birth to a birth year only.
    A simple but effective generalization for quasi-identifiers.
    """
    # Handles common formats: YYYY-MM-DD, DD/MM/YYYY, MM/DD/YYYY
    match = re.search(r"\\b(19|20)\\d{2}\\b", str(value))
    return match.group(0) if match else "UNKNOWN_YEAR"

def audit_record(record: dict) -&gt; tuple[dict, list[FieldAudit]]:
    """
    Processes a single structured record field by field.
    Returns a cleaned record and a full audit trail of what was done to each field.

    Strategy:
    - Direct identifiers: pseudonymize (preserve referential integrity)
    - Quasi-identifiers: generalize where possible, pseudonymize otherwise
    - Everything else: pass through unchanged
    """
    clean_record = {}
    audit_trail = []

    for field_name, value in record.items():
        normalized = field_name.lower().strip()

        if normalized in DIRECT_IDENTIFIER_FIELDS:
            processed = pseudonymize(value)
            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="direct",
                original_value=value,
                processed_value=processed,
                action_taken="pseudonymized"
            ))
            clean_record[field_name] = processed

        elif normalized in QUASI_IDENTIFIER_FIELDS:
            # Apply field-specific generalization where we can
            if normalized in {"date_of_birth", "dob", "birth_date"}:
                processed = generalize_date(value)
                action = "generalized"
            else:
                # For other quasi-identifiers, pseudonymize as a safe default
                processed = pseudonymize(value)
                action = "pseudonymized"

            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="quasi",
                original_value=value,
                processed_value=processed,
                action_taken=action
            ))
            clean_record[field_name] = processed

        else:
            # Field is not in either PII list &#8212; pass through, but still
            # run a regex check on string values as a safety net
            if isinstance(value, str):
                if EMAIL_PATTERN.search(value) or PHONE_PATTERN.search(value):
                    # Unexpected PII in a non-PII field: flag it and pseudonymize
                    processed = pseudonymize(value)
                    audit_trail.append(FieldAudit(
                        field_name=field_name,
                        classification="direct",
                        original_value=value,
                        processed_value=processed,
                        action_taken="pseudonymized (unexpected PII in non-PII field)"
                    ))
                    clean_record[field_name] = processed
                    continue

            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="clean",
                original_value=value,
                processed_value=value,
                action_taken="kept"
            ))
            clean_record[field_name] = value

    return clean_record, audit_trail

def process_records(records: list[dict]) -&gt; list[dict]:
    """
    Runs field-level PII detection and handling across a list of records.
    Prints an audit summary for any record where PII was found.
    """
    clean_records = []

    for i, record in enumerate(records):
        clean_record, audit_trail = audit_record(record)
        pii_fields = [a for a in audit_trail if a.classification != "clean"]

        if pii_fields:
            print(f"Record {i}: PII detected and handled in {len(pii_fields)} field(s):")
            for audit in pii_fields:
                print(f"  [{audit.classification.upper()}] {audit.field_name} "
                      f"&#8594; {audit.action_taken}")

        clean_records.append(clean_record)

    return clean_records

# Example: a batch of records from a scraped open dataset
records = [
    {
        "record_id": "A001",
        "name": "Jane Doe",
        "date_of_birth": "1985-03-22",
        "zip_code": "SW1A 1AA",
        "incident_type": "Road accident",
        "severity": "Slight"
    },
    {
        "record_id": "A002",
        "name": "John Smith",
        "date_of_birth": "1973-11-04",
        "zip_code": "EC1A 1BB",
        "incident_type": "Road accident",
        "severity": "Serious",
        # An email that slipped into a free-text notes field
        "notes": "Witness contact: witness@example.com"
    }
]

clean = process_records(records)</code></code></pre><p>The output is the following</p><pre><code><code>Record 0: PII detected and handled in 3 field(s):
  [DIRECT] name &#8594; pseudonymized
  [QUASI] date_of_birth &#8594; generalized
  [QUASI] zip_code &#8594; pseudonymized
Record 1: PII detected and handled in 4 field(s):
  [DIRECT] name &#8594; pseudonymized
  [QUASI] date_of_birth &#8594; generalized
  [QUASI] zip_code &#8594; pseudonymized
  [DIRECT] notes &#8594; pseudonymized (unexpected PII in non-PII field)</code></code></pre><p>A few things worth calling out in this implementation:</p><ul><li><p><strong>Pseudonymization preserves referential integrity:</strong> Because the same input always produces the same hash token, you can still count unique individuals, join records, or track entities across datasets, without storing the raw PII. In production, replace the plain SHA-256 with an HMAC keyed on a secret, so tokens cannot be reversed by someone who also has access to the hashing algorithm.</p></li><li><p><strong>The regex safety net on non-PII fields</strong>: This catches the common real-world case where PII slips into a free-text or notes field that your schema classification didn&#8217;t anticipate. It is not foolproof, but it catches the obvious cases.</p></li><li><p><strong>The audit trail is intentional:</strong> Every field-level decision is logged. If you are ever asked to demonstrate that your collection process handled PII responsibly, you have a record of exactly what was done to each field in each record.</p></li></ul><h2>Conclusion</h2><p>Open data is a shared resource, and how you interact with it says something about you as a professional. In this article, you learned what &#8220;open&#8221; means in the context of data scraping and how you should treat it if you want to be an ethical scraper.</p><p>So, let us know: Did we miss something? What&#8217;s your approach to handling open datasets in your scraping projects? Let&#8217;s discuss in the comments.</p>]]></content:encoded></item><item><title><![CDATA[Using Web Scraping in Finance to Discover Investment Insights]]></title><description><![CDATA[Tired of guessing? Use web scraping to make data-backed financial decisions!]]></description><link>https://substack.thewebscraping.club/p/web-scraping-in-finance</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/web-scraping-in-finance</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 17 May 2026 16:03:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8d7b98ff-dc95-41cf-bc83-5cfa5241ed96_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve ever invested, you know how challenging it can be (even if you don&#8217;t <em>YOLO</em> all your money into a single stock, lol). Thankfully, things get a lot easier when you build data-powered processes to guide your decision-making.</p><p>No wonder nearly half a trillion dollars are spent every year by financial firms on technology. Now, you probably don&#8217;t have that kind of money in the first place (and if you do, you don&#8217;t need to invest much anyway), but you might still want to collect financial data for personal use, research, academic projects, backtesting, or even just for selling it to industry giants.</p><p>No matter what you want to do with scraped financial data, there are a few pivotal tips to understand before embarking on this journey, which is exactly what I will explain here!</p><p>In this blog post, I will show why web scraping and finance are a match made in heaven and cover everything you need to know about retrieving both historical and real-time financial data from the web.</p><h2>Web Scraping + Finance: A Happy Marriage</h2><p>Before diving into web scraping for finance, let me explain why this is such a powerful approach and the advantages you can gain from it.</p><h3>Finance Runs on (Web) Data</h3><p>If there&#8217;s one thing that&#8217;s become clear over the past decade, it&#8217;s this: <a href="https://www.acceldata.io/blog/the-critical-role-of-data-in-finance">finance runs on data!</a></p><p>Financial institutions process massive volumes of market, customer, and transactional data every single day. In finance, data powers everything, from investment strategies to risk management. And the stakes are high, as <a href="https://www.gartner.com/smarterwithgartner/how-to-improve-your-data-quality">bad data alone costs organizations an average of $12.9 million per year</a>!</p><p>Data drives real-time decision-making, predictive modeling, and scenario planning. Finance teams feed that data into pipelines built around statistical analysis, machine learning, and <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">AI to identify patterns</a>, forecast market movements, and manage uncertainty in increasingly complex environments.</p><p>Now, here&#8217;s the central question we, web scraping enthusiasts, are all interested in: <em>where does most of that data actually come from? </em>A big portion of it comes from the web (not that surprising, uh?).</p><p>I&#8217;m talking about news sites, financial portals, company pages, official exchange websites, regulatory filings, institutional reports, and more. The web is essentially the largest and most dynamic data source available for financial purposes.</p><p>That&#8217;s exactly why web scraping in finance isn&#8217;t just useful. It&#8217;s foundational!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Benefits of a Data-Driven Approach in Finance</h2><p>Keep in mind that it&#8217;s not just big corporations or financial firms that benefit from data. Even individual retail investors can leverage financial data scraping to gain an edge. In particular, the main advantages include:</p><ul><li><p><strong>Informed decisions</strong>: Access to accurate historical data supports smarter investment decisions, while real-time data enables more solid trading choices.</p></li><li><p><strong>Market trend insights</strong>: Spot patterns and emerging trends before the wider market does.</p></li><li><p><strong>Risk management</strong>: Identify potential risks early and adjust strategies proactively.</p></li><li><p><strong>Portfolio optimization</strong>: <a href="https://substack.thewebscraping.club/p/llm-fine-tuning-for-scraping">Fine-tune asset allocation</a> based on backtesting and up-to-date market and company data.</p></li><li><p><strong>Efficiency and speed</strong>: Automate data collection, reducing time spent on manual research.</p></li></ul><p>I mean, financial firms wouldn&#8217;t be <a href="https://www.forrester.com/blogs/us-financial-services-tech-spending-hits-495-billion/">spending over $495 billion a year</a> (yeah, you read that right!) on technology (mostly built around collecting, processing, and leveraging data) if it didn&#8217;t give them a real edge!</p><h3>Getting vs Selling Financial Web Data: High-Level Overview</h3><p>There&#8217;s no doubt that financial firms invest billions into data. But what about you, as a web scraping expert, <em>how can you leverage financial data for potential gain?</em> There are two high-level approaches:</p><ol><li><p><strong>For yourself or your company</strong>: Build custom web scraping pipelines to gather data from multiple sources. Use it to <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">feed investment models, AI agents</a>, trading algorithms, or analytics pipelines. This is usually highly tailored to your strategies, risk appetite, or operational goals.</p></li><li><p><strong>To sell to financial services</strong>: Collect, aggregate, and potentially enrich data from various sources to sell. You can offer broad datasets for many clients or fully customized solutions for a specific customer&#8217;s needs.</p></li></ol><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>How to Approach Financial Data Scraping: Historical vs Real-Time</h2><p>When it comes to finance, the web is packed with countless data fields and categories (e.g., news, stock prices, filings, analyst reports, and more). It&#8217;s a huge industry, and almost anything can be scraped!</p><p>At a high level, though, the key distinction for web scraping is simple: the financial data you want to collect is either historical or real-time. That&#8217;s what actually makes a difference in the approach to data scraping.</p><p>In the following chapters, I&#8217;ll dive deeper into each of the two categories of financial data. I&#8217;ll cover which fields are most interesting to scrape, where to find them, and how to collect them efficiently and effectively.</p><p>For now, start with a brief introduction to historical and real-time financial web data scraping!</p><h3>Historical Financial Web Data</h3><p>This includes all past financial data collected from the web, from historical stock prices to inflation rates and archived news. It&#8217;s used for analysis supporting long-term investment decisions.</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Enables backtesting of investment and trading strategies.</p></li><li><p>Easier to scrape, as it isn&#8217;t time-sensitive.</p></li><li><p>Data itself is stable and doesn&#8217;t change over time&#8230;</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>&#8230;but the web pages displaying it (e.g., in tables and static charts) can still change, breaking your static parsing logic.</p></li><li><p>Misses recent market shifts or breaking events.</p></li><li><p>Data completeness varies across websites, often requiring aggregation from multiple sources.</p></li></ul><h3>Real-Time Financial Web Data</h3><p>This includes live financial data extracted from the web, such as stock prices, market news, order books, etc. It&#8217;s employed for trading and short-term investment decisions.</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Enables fast, data-driven trading decisions.</p></li><li><p>Captures live market movements and breaking news.</p></li><li><p>Can be passed to AI agents and pipelines directly, as it tends to require minimal preprocessing.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Harder to scrape reliably due to latency constraints and <a href="https://substack.thewebscraping.club/p/rate-limit-scraping-exponential-backoff">rate limits</a>.</p></li><li><p>Requires robust infrastructure for real-time ingestion and analysis, as every second counts.</p></li><li><p>Data storage can grow rapidly because new data arrives continuously.</p></li></ul><h3>Mastering Historical Financial Data Scraping</h3><p>As promised, let me guide you through the world of scraping historical financial data from the web.</p><h3>Main Types of Historical Financial Web Data</h3><p>The most important types of historical financial data you can retrieve from websites are:</p><ul><li><p><strong>Historical stock and commodity prices</strong>: Open, high, low, close (OHLC) prices and trading volumes for stocks, ETFs, indices, and commodities, used for <a href="https://substack.thewebscraping.club/p/predictive-analytics-web-scraped-data">time-series analysis, modeling, and predictions</a>.</p></li><li><p><strong>Summary info and infographics</strong>: Stock profiles, key metrics, and past indicators (e.g., P/E, EPS, moving averages), presented in dashboards or visual charts for quick insights.</p></li><li><p><strong>Macroeconomic indicators</strong>: Inflation, GDP, interest rates, unemployment, CPI, and PCE data, essential for understanding economic cycles and long-term market behavior.</p></li><li><p><strong>Financial statements</strong>: Company filings (income statements, balance sheets, cash flow), utilized for fundamental analysis and valuation models.</p></li><li><p><strong>News data</strong>: Archived headlines and press releases analyzed via NLP to correlate past market movements with specific events and sentiment shifts.</p></li><li><p><strong>ESG scores and sustainability reports</strong>: Historical environmental, social, and governance metrics employed to assess how &#8220;green&#8221; or ethical a company has been over time.</p></li><li><p><strong>Alternative data</strong>: Non-traditional datasets like web traffic, social media, satellite imagery (e.g., new headquarters or production plants), or credit card data for early performance signals.</p></li></ul><h3>Most Popular Targets</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!grNv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!grNv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 424w, https://substackcdn.com/image/fetch/$s_!grNv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 848w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png" width="1456" height="1184" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1184,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:228005,&quot;alt&quot;:&quot;Popular historical financial data scraping sources&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Popular historical financial data scraping sources" title="Popular historical financial data scraping sources" srcset="https://substackcdn.com/image/fetch/$s_!grNv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 424w, https://substackcdn.com/image/fetch/$s_!grNv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 848w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Popular historical financial data scraping sources</figcaption></figure></div><p>Also, if you&#8217;re interested in how to scrape historical data from the Wayback Machine, <a href="https://substack.thewebscraping.club/p/scraping-wayback-machine">read my previous guide for this newsletter!</a></p><h3>Scraping Techniques</h3><p>Typical examples of historical financial data include lists of open, high, low, and close prices for a given stock:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S5x0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S5x0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 424w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 848w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1272w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png" width="1456" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;NVDA historical stock data (Source: Yahoo Finance)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="NVDA historical stock data (Source: Yahoo Finance)" title="NVDA historical stock data (Source: Yahoo Finance)" srcset="https://substackcdn.com/image/fetch/$s_!S5x0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 424w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 848w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1272w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">NVDA historical stock data (Source: Yahoo Finance)</figcaption></figure></div><p>Or, another example, the historical returns of a specific index (.e.g, SP500) over time:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XcNT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XcNT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 424w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 848w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png" width="1456" height="979" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:979,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)" title="100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)" srcset="https://substackcdn.com/image/fetch/$s_!XcNT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 424w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 848w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)</figcaption></figure></div><p>These cases fall into the category of table-based data scraping, one of the most common web scraping scenarios. You&#8217;re probably already familiar with it, so there&#8217;s no need to go too deep here. Scraping older news and media can be slightly more challenging due to the unstructured nature of the target data, but it&#8217;s still a simple task.</p><p>At a high level, the process for getting historical finance data via web scraping follows a standard workflow:</p><ol><li><p>Visit the target web page, either via an HTTP client or a browser automation tool.</p></li><li><p>Parse the page using an HTML parser, either directly or after rendering in a controlled browser.</p></li><li><p>Select the HTML elements of interest and extract the data.</p></li><li><p>Store the scraped data in your desired format (e.g., XLS, CSV, JSON) or in a database.</p></li></ol><p>The main challenges involve generic anti-scraping mechanisms, such as CAPTCHAs, WAFs, IP bans, as well as <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">browser</a>, TLS, and <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">device fingerprinting</a>.</p><h3>Best Practices</h3><p>Based on my experience with financial web scraping, especially when focusing on historical data, these are the tips you should apply:</p><ul><li><p><strong>Normalize and validate data</strong>: Standardize formats (dates, currencies, units) and validate across sources to catch inconsistencies early.</p></li><li><p><strong>Be cautious with AI parsing</strong>: Avoid <a href="https://substack.thewebscraping.club/p/llms-ai-web-scraping">using AI for automatically parsing structured data</a> (tables, metrics, structured fields). It can introduce subtle errors and hallucinations, so prefer deterministic parsing. Harness AI mainly for retrieving unstructured text like news.</p></li><li><p><strong>Store raw HTML snapshots</strong>: Always keep the original page HTML. It lets you <a href="https://substack.thewebscraping.club/p/offline-web-scraping">re-parse data later and extract new signals without re-scraping</a>.</p></li><li><p><strong>Avoid single-source bias</strong>: When scraping news or market analysis pieces, pull data from multiple sources to reduce bias and improve reliability.</p></li><li><p><strong>Handle pagination properly</strong>: Many sites split historical data across pages or date ranges. Make sure your scraper fully traverses them all.</p></li><li><p><strong>Respect rate limits and retries</strong>: Even for historical data, implement retries and throttling to avoid blocks and incomplete datasets.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Understanding Real-Time Financial Data Scraping</h2><p>This is where things get a bit more interesting. Let me introduce you to real-time financial scraping!</p><h3>Main Types of Real-Time Financial Web Data</h3><p>The most relevant types of real-time financial web data are:</p><ul><li><p><strong>Live price tickers</strong>: Continuously updated &#8220;last trade&#8221; prices and bid/ask spreads for stocks, crypto, and forex, used to detect breakouts and short-term trading opportunities.</p></li><li><p><strong>Order book and market depth</strong>: Incoming buy/sell orders, liquidity levels, and spreads, fundamental for execution strategies and high-frequency trading.</p></li><li><p><strong>Breaking news</strong>: Immediate updates and announcements that trigger sentiment models as soon as key figures (CEOs, central banks, governments) release information.</p></li><li><p><strong>Corporate event triggers</strong>: Monitoring press releases or SEC feeds for earnings surprises, M&amp;A rumors, or sudden executive changes.</p></li><li><p><strong>Social media signals</strong>: <a href="https://substack.thewebscraping.club/p/how-to-scrape-reddit-with-scrapy">Tracking ticker mentions on platforms like Reddit</a> or X to detect retail-driven momentum, hype cycles, or panic selling in near real time.</p></li><li><p><strong>Institutional &#8220;whale&#8221; activity</strong>: Observing large trades or major wallet movements (especially in crypto) to identify where significant capital is flowing.</p></li><li><p><strong>Alternative digital signals</strong>: Web traffic spikes, app store ranking changes, or &#8220;out of stock&#8221; alerts on retail sites as proxies for real-world demand.</p></li></ul><p>As you can tell, this category is more varied than historical financial data, including social media tracking and other less conventional practices. Thus, the sources to monitor for live financial web scraping can be less standardized and intuitive.</p><h3>Most Popular Targets</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UIN0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UIN0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 424w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 848w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png" width="1456" height="1487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:289145,&quot;alt&quot;:&quot;Popular live financial data scraping sources&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Popular live financial data scraping sources" title="Popular live financial data scraping sources" srcset="https://substackcdn.com/image/fetch/$s_!UIN0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 424w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 848w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Popular live financial data scraping sources</figcaption></figure></div><h3>Scraping Techniques</h3><p>Imagine applying a traditional scraping pattern to real-time financial data. You send a request to a target site, extract a stock price, and repeat the operation every few seconds or even milliseconds.</p><p>The problem is latency. By the time the server responds, the page is rendered or parsed, the target data field is collected, and stored or sent to your pipeline, that piece of data is already outdated.</p><p>On top of that, this approach requires a crazy number of requests in a very short time. That increases the risk of triggering rate limiting or even IP bans. You might think proxies solve that through IP rotation, but most proxy networks introduce additional latency, often 2/3/5 seconds per request. In real-time scenarios, that delay is simply not acceptable!</p><p>Even if you <a href="https://substack.thewebscraping.club/p/choosing-proxy-provider-scraping">switch to faster or dedicated proxies</a>, you may end up with a smaller IP pool, which increases the likelihood of those IPs getting blocked.</p><p>A more advanced idea is to rely on browser automation and keep a page open, capturing updates as they happen. This is smarter, but still problematic. Long-lived sessions with little or no user interaction are highly suspicious and can easily trigger anti-bot systems. Plus, browser automation at scale tends to be flaky, not really reliable for persistent connections.</p><p>Long story short, scraping real-time financial data this way quickly turns into a losing game.</p><p>The solution? Stop targeting the data presentation layer in HTML and instead go directly to the data source!</p><h4>API/WebSocket Scraping as The Solution</h4><p>Web pages showing real-time financial data aren&#8217;t doing anything magical. Behind the scenes, they either poll APIs at regular intervals or (more commonly) <a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSocket">maintain a persistent connection via WebSockets</a> to receive continuous updates. The page simply renders that incoming data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q9lT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 424w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 848w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1272w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif" width="1080" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the live price update&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the live price update" title="Note the live price update" srcset="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 424w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 848w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1272w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the live price update</figcaption></figure></div><p>As a result, a much better approach is to intercept and replicate those data flows. You can do this through<a href="https://substack.thewebscraping.club/p/apis-in-web-scraping"> AJAX/API request inspection</a> or WebSocket sniffing. Open the browser developer tools, go to the &#8220;Network&#8221; tab, and check where the data is coming from.</p><p>If it&#8217;s an API call, you&#8217;ll see it under the &#8220;Fetch/XHR&#8221; tab:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d22T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d22T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 424w, https://substackcdn.com/image/fetch/$s_!d22T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 848w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1272w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png" width="1456" height="1180" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1180,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the API used by Yahoo Finance to determine whether the market is open in real time&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the API used by Yahoo Finance to determine whether the market is open in real time" title="Note the API used by Yahoo Finance to determine whether the market is open in real time" srcset="https://substackcdn.com/image/fetch/$s_!d22T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 424w, https://substackcdn.com/image/fetch/$s_!d22T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 848w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1272w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the API used by Yahoo Finance to determine whether the market is open in real time</figcaption></figure></div><p>If it&#8217;s a WebSocket, you&#8217;ll find it under the &#8220;Socket&#8221; tab:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!quUU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!quUU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 424w, https://substackcdn.com/image/fetch/$s_!quUU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 848w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1272w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png" width="1456" height="778" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/171d48c6-665a-4753-86bb-c30793609101_3059x1634.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:778,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page" title="Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page" srcset="https://substackcdn.com/image/fetch/$s_!quUU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 424w, https://substackcdn.com/image/fetch/$s_!quUU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 848w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1272w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page</figcaption></figure></div><p>Once identified, replicate those API calls or connect directly to the WebSocket in your scraping script. This gives you access to near real-time financial data in a structured format (typically JSON) without the overhead of parsing HTML.</p><p>Of course, that&#8217;s not trivial. <a href="https://substack.thewebscraping.club/p/websocket-bot-detection-scraping">WebSockets require proper anti-bot bypass</a>, and APIs may still enforce rate limits, tracking, and TLS fingerprinting protections. However, this approach is generally faster, more reliable, and much easier to maintain than scraping rendered pages!</p><h4>And What About Live News or Social Media Scraping?</h4><p>When it comes to news, if available, it makes sense to connect to public RSS feeds exposed by websites to monitor updates. This allows you to trigger scraping only when new and relevant content is published, instead of constantly polling pages unnecessarily.</p><p>Otherwise, you can build a polling mechanism that periodically checks news sites, social media platforms, and similar sources to capture fresh data. In these cases, you usually can&#8217;t rely on techniques like API or WebSocket scraping, as that&#8217;s not how those platforms fetch data.</p><p>Instead, you need a solid and robust infrastructure built around speed and efficiency: fast connections, high-quality proxies, optimized parsing, and lightweight requests. The goal is to minimize latency while maintaining reliability at scale.</p><h3>Best Practices</h3><p>Scraping real-time financial data is a demanding art, but it becomes easier with the following best practices:</p><ul><li><p><strong>Prefer APIs and WebSockets over HTML parsing</strong>: Whenever possible, save time by extracting data directly from the underlying APIs or WebSocket streams utilized by web pages instead of scraping data from rendered pages.</p></li><li><p><strong>Choose clean, structured sources</strong>: Prioritize endpoints that return well-formatted JSON to minimize preprocessing and reduce latency.</p></li><li><p><strong>Stream data into pipelines immediately</strong>: Send incoming data directly to processing pipelines for real-time insights, while storing it in parallel for later analysis.</p></li><li><p><strong>Use specialized AI for sentiment analysis</strong>: Prefer AI/ML models tuned for finance/social media, as Reddit and X content often include slang, memes, and non-standard language.</p></li><li><p><strong>Optimize browser automation</strong>: Configure Playwright, Selenium, or similar browser automation tools to block images, stylesheets, and fonts. This reduces bandwidth usage and significantly speeds up rendering time.</p></li><li><p><strong>Design for low latency</strong>: Optimize your stack (async requests, streaming ingestion, fast JSON parsers) to minimize delays, as even milliseconds matter.</p></li><li><p><strong>Prefer high-quality premium proxies</strong>: Count on <a href="https://substack.thewebscraping.club/p/how-many-ip-needed-scraping">proxy providers with a proven track record of fast, stable connections</a> to minimize latency and avoid disruptions.</p></li><li><p><strong>Time-synchronize everything</strong>: Append timestamps to all scraped data to enable time-series analysis and accurately reconstruct events.</p></li><li><p><strong>Build fault-tolerant systems:</strong> Expect disconnections (especially with WebSockets) and issues, so add reconnection logic and configure fallback data sources.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Top 5 Open-Source Financial Web Scraping Libraries</h3><p>Below is a selected set of interesting, fully open-source libraries, packages, and projects for simplified financial web scraping:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XR35!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XR35!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 424w, https://substackcdn.com/image/fetch/$s_!XR35!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 848w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1272w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png" width="1456" height="1136" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1136,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267258,&quot;alt&quot;:&quot;Top open-source financial web scraping libraries&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Top open-source financial web scraping libraries" title="Top open-source financial web scraping libraries" srcset="https://substackcdn.com/image/fetch/$s_!XR35!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 424w, https://substackcdn.com/image/fetch/$s_!XR35!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 848w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1272w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Top open-source financial web scraping libraries</figcaption></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve gone through the rabbit hole of financial web scraping, the task of collecting finance-related data from the Internet. This is one of the main use cases of corporate web scraping, powering enterprise data pipelines for decision-making and market analysis.</p><p>As you&#8217;ve seen, the main difference in the approach comes down to whether you&#8217;re targeting historical or real-time data. The first follows standard web scraping practices you&#8217;re likely already familiar with. The second is trickier and requires more advanced techniques.</p><p>I hope you found this helpful and insightful. If you have questions, feel free to share them in the comments below!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #104: Bypassing AWS WAF on IMDB with Scrapling ]]></title><description><![CDATA[An hands-on test on tools for TLS spoofing and Scrapling]]></description><link>https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 14 May 2026 22:23:56 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e0fccfe7-d622-4fe8-a6d2-d99c1a73a9d9_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AWS WAF is the protection we run into most often on Amazon&#8217;s public properties. It also sits in front of a long tail of third-party sites whose operators built on AWS and clicked the WAF checkbox. We wrote about it two years ago in <a href="https://substack.thewebscraping.club/p/bypassing-aws-waf-scraping">The Lab #53: Bypassing AWS WAF</a>, but this time our focus is just on AWS WAF. In fact, Traveloka used DataDome on top of AWS WAF, and our analysis had to account for both systems at once.</p><p>This time, we wanted AWS WAF on its own, in front of a target with nothing else in front of it, and we wanted to see what changes when the 2024 Scrapy-Playwright stack is replaced with the 2026 toolbox. </p><p>The target we picked is <a href="https://www.imdb.com">imdb.com</a>. It is an Amazon subsidiary, runs a standard AWS WAF deployment, and Wappalyzer confirms that there are not others antibot on the website. That makes IMDB a perfect use case for our article.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Today we&#8217;ll test three Python HTTP clients with strong TLS fingerprint impersonation: <code>curl_cffi</code>, the newer <code>httpx-curl-cffi</code>, and Rust-backed <code>rnet</code>. Each one produces a TLS handshake indistinguishable from real Chrome. Is that enough to scrape an AWS WAF target without spinning up a browser? And if not, what is the smallest browser step that gets us past the gate so the rest of the work can run on a cheap HTTP client?</p><h2>The tools we used</h2><p>Four libraries are in scope. Three are HTTP-only, one runs a real browser.</p><p><strong><a href="https://github.com/yifeikong/curl_cffi">curl_cffi</a></strong> is a Python binding for the <code>curl-impersonate</code> patched curl. It exposes a requests-like API and ships impersonation profiles for recent Chrome, Firefox, and Safari builds and works at the TLS layer. JA3 and JA4 fingerprints match the impersonated browser, along with HTTP/2 settings and header order. We tested with <code>chrome142</code>, the latest Chrome profile in version 0.14.0.</p><p><a href="https://github.com/vgavro/httpx-curl-cffi">httpx-curl-cffi</a> is a transport for <code>httpx</code> that delegates the actual HTTP work to <code>curl_cffi</code>. While it does not add new fingerprinting capability, it implements the <code>httpx</code> programming model: sync <code>Client</code>, async <code>AsyncClient</code>, event hooks, the same response object you get from the rest of an <code>httpx</code>-based codebase. We tested with the Chrome profile and <code>default_headers=True</code>.</p><p><strong><a href="https://github.com/0x676e67/rnet">rnet</a></strong><code> </code>is a Rust HTTP client with Python bindings. It implements its own impersonation stack rather than wrapping <code>curl-impersonate</code>. The enum <code>rnet.Impersonate</code> exposes a wide range of Chrome, Firefox, Safari, Edge, Opera, and OkHttp profiles. We tested with <code>Chrome137</code>.</p><p><a href="https://github.com/D4Vinci/Scrapling">Scrapling</a> is the only browser-driven tool in the set. Our <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide">Scrapling: A Complete Hands-On Guide</a> goes through the library in depth, with Cloudflare as the test target. Its <code>StealthyFetcher</code> drives a stealth-patched Chromium that runs JavaScript and applies fingerprint countermeasures. The library README only advertises Cloudflare Turnstile, but the same machinery handles AWS WAF&#8217;s challenge too.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h2>How AWS WAF protects IMDB</h2><p>A quick intro of the system helps interpret the results that follow. AWS WAF is not a dedicated anti-bot platform like DataDome or Kasada. It is a general-purpose web application firewall with a bot-control module that operators can enable per rule. When the bot-control rule is in challenge mode, AWS WAF inserts a single JavaScript gate at the start of a session.</p><p>A request without a valid cookie returns <code>HTTP 202</code> with <code>x-amzn-waf-action: challenge</code> and a short HTML body. The body contains <code>window.gokuProps</code> containing three base64 blobs (<code>key</code>, <code>iv</code>, <code>context</code>), a <code>&lt;script src&gt;</code> pointing to a customer-specific URL on <code>*.token.awswaf.com</code>, and a small inline script that calls <code>AwsWafIntegration.saveReferrer()</code>, <code>AwsWafIntegration.checkForceRefresh()</code>, and <code>AwsWafIntegration.getToken()</code>. The remote <code>challenge.js</code> tests the browser environment, posts a validation payload back to AWS, and on success, the response sets <code>Set-Cookie: aws-waf-token=...</code>. The inline script then reloads the page, and the second request, now carrying the token, gets the real content.</p><p>This works very differently from systems that score every request. Once the token is in our jar, AWS WAF lets us through with no further behavioral checks beyond IP reputation and rate limits. <br>What we want to discover with this article is if we&#8217;re able to bypass AWS WAF with &#8220;convincing&#8221; requests, with a proper TLS fingerprint and set of headers, or if we need a JS rendering engine.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Test setup</h2><p>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">104.IMDB</a>. If you&#8217;re not able to access the repository, <a href="https://twsc-private-form.lovable.app/">please use this form to request access.</a></strong></p><p>The libraries we pinned at the time of writing are <code>curl_cffi==0.14.0</code>, <code>httpx==0.28.1</code>, <code>httpx-curl-cffi==0.1.5</code>, <code>rnet==2.4.2</code>, <code>scrapling==0.4.7</code>. Python is 3.11.</p><p>Each HTTP test creates a <code>GET</code> against two URLs: the IMDB home page </p><p>https://www.imdb.com/</p><p> and a title page <code>https://www.imdb.com/title/tt0111161/</code>. We use two URLs to confirm the challenge fires the same way on both, not only on one entry point. We do not follow redirects (<code>follow_redirects=False</code>) because the AWS WAF response is a 202 with content rather than a redirect, and we want to see it raw. </p><p>We capture status code, HTTP version, the full response headers, any cookies, body length, and the first 600 characters of the body, and we saved everything to JSON under <code>aws_waf_imdb/responses/</code> for later inspection.</p><p>The baseline probe in <a href="../code/aws_waf_imdb/probe_plain.py">probe_plain.py</a> uses an unmodified <code>httpx.Client(http2=True)</code> with a generic Chrome User-Agent header and the standard <code>Accept</code> headers. This is the control: no TLS impersonation, no fingerprint trickery, just a normal Python HTTP client.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Use LLMs to Enhance Data Extraction From Unstructured Text]]></title><description><![CDATA[How combining LLMs with schema validation solves the extraction problem that NLP never could]]></description><link>https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 10 May 2026 19:06:21 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9c72c499-8522-4fcd-b662-e37cf857c78a_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>&#127465;&#127466; Before starting this article, let me remind you that on Friday the 15th, there will be the first TWSC meetup in Munich. For more details and to confirm your attendance, go to <a href="https://www.meetup.com/the-web-scraping-club/events/314567280/">the event page</a>  &#127465;&#127466; </em></p><div><hr></div><p>The web contains an extraordinary volume of information, the majority of which is in textual form. Blogs, forums, and newsletters alone generate millions of words of domain-specific knowledge every week. And they&#8217;re not the only sources of text on the web.</p><p>When you want to get insights from that kind of data, successfully extracting it from the web is only half of the battle, even now that <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">LLMs can use vision to scrape complex visual layouts</a>. The second part of the challenge is structuring this data to get it ready for analytics. Why? Because when you point a scraper at a news article, you get back a wall of text. But you cannot query it. You cannot aggregate it. You cannot feed it reliably into a machine learning pipeline or a database without significant preprocessing.</p><p>This article addresses the preprocessing problem of unstructured text when you scrape it from the web. It traces the evolution of solutions from classical NLP to large language models, identifies where each approach breaks down, and proposes a practical architectural solution.</p><p>Let&#8217;s get into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What &#8220;Unstructured&#8221; Really Means in Practice</h2><p>Unstructured text refers to content that carries no machine-readable schema. The information exists in the data you retrieved from the web, but no field boundaries exist, no consistent labels, and no guaranteed position for any given fact.</p><p>The following schema represents the difference between unstructured and structured text (machine-readable schema):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FNRZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 424w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 848w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1272w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png" width="1037" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1037,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58633,&quot;alt&quot;:&quot;The difference between unstructured and machine-readable text by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The difference between unstructured and machine-readable text by Federico Trotta" title="The difference between unstructured and machine-readable text by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 424w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 848w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1272w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The difference between unstructured and machine-readable text</figcaption></figure></div><p>Let&#8217;s consider three concrete scraping targets to illustrate what this costs you in practice.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>News Articles: When Signals Are Buried in Noise</h3><p>Consider you scraped a Reuters article about an ECB rate decision. The text you get back from the scraper could be something as follows:</p><pre><code><code>European Central Bank decides on rates.
Listen to this article. 2 min audio. 
You might also like: Eurozone inflation hits 3-year low. 
Christine Lagarde announced Thursday a 25 basis point reduction, bringing the main 
refinancing rate to 3.40%. 
SPONSORED: Track macro events with Bloomberg Terminal. 
The decision was widely anticipated after last month's CPI print. Share this article. 
4 comments. John M. writes: this was priced in already</code></code></pre><p>Your raw text contains the article body, a teaser for a related story, a sponsored insertion, and reader comments. The fact you want is buried in there: the ECB cut its main refinancing rate to 3.40% on a specific date. But your extractor gets the full content.</p><p>Such a wall of text, which, generally speaking, is way bigger than this and is useless for analytics purposes without preprocessing.</p><h3><strong>Financial Newsletters: When &#8220;Just Under Two Percent&#8221; Breaks Your Aggregation</strong></h3><p>Suppose you scrape a financial newsletter to extract an updated macroeconomic forecast. You need to capture a specific fact. Something like &#8220;Goldman Sachs revised its 2026 US GDP growth forecast down to 1.8%&#8221;. Your scraper captures the entire page output, which is similar to an article. Similarly to the previous example, the resulting raw text mixes the core facts with boilerplate and unrelated news:</p><pre><code><code>Market Daily Newsletter. November 12.
Jan Hatzius (Goldman Sachs) and his team were out with a note early Tuesday.
SPONSORED: Get 50% off your trading fees today. 
They see tariffs shaving roughly 0.7 points off the baseline. Meanwhile, 
European markets rallied on ECB news. 
Read our full coverage of the Eurozone here.
The revised number now sits just under two percent for the full year.
Subscribe for premium insights.</code></code></pre><p>The text distributes the target fact across the entire document. Also, the wording &#8220;just under two percent&#8221; requires numerical understanding to say that the text refers to the actual number you were searching for, that is, an exact 1.8%.</p><p>Now, imagine generalizing this after scraping hundreds of financial news and newsletters to regroup the information to summarize the numbers. Getting insight would be impossible. Why? Because some sources will give you the actual information you want (growth forecast down to 1.8%), others will use different phrasing to define the trend (&#8221;An expected growth under 2 percent&#8221;, &#8220;a slightly shrinking trend&#8221;, etc).</p><p>Without a way to create a structure for such data, you can&#8217;t get any insights from it.</p><h3>Job Posting Offers: They Are Always Messier Than They Look</h3><p>Consider the case when you want to scrape job offers to get an idea of what the market is paying on average for a specific position, given the expected technical skills, and considering the same day-to-day activity. Job offers can have the following ambiguities:</p><ul><li><p>A sentence might read &#8220;3+ years of experience with Python&#8221;. This establishes a floor and ignores a ceiling. Alternatively, the text might read &#8220;Senior-level candidates only&#8221;. This uses qualitative seniority as a proxy for an exact quantitative number.</p></li><li><p>Salary breaks in a different direction. One posting can say <em>&#8220;$120,000 - $145,000 base&#8221;</em>. Another can be <em>&#8220;competitive compensation commensurate with experience&#8221;</em>. A third could be<em>&#8220;&#8364;100,000&#8221;</em>, which you need to convert to dollars to make an actual comparison.</p></li><li><p>Employment type can introduce further ambiguity and difficulties. <em>&#8220;Full-time&#8221;</em>, <em>&#8220;FTE&#8221;</em>, <em>&#8220;permanent&#8221;</em>, and <em>&#8220;direct hire&#8221;</em> basically mean the same thing but are written differently. Also, the text might specify the role is &#8220;Hybrid&#8221;, which means multiple different things across companies. It could mean two days in the office. It could mean occasional travel with headquarters-optional rules.</p></li></ul><div><hr></div><blockquote><p>When sites get tough, skip the heavy lifting. Get clean, structured CSV datasets,  ready for Excel, BI or your apps</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KpSw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" width="592" height="149.84467881112175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1043,&quot;resizeWidth&quot;:592,&quot;bytes&quot;:81723,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.databoutique.com/buy-data-list&quot;,&quot;text&quot;:&quot;Find your dataset&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.databoutique.com/buy-data-list"><span>Find your dataset</span></a></p></blockquote><div><hr></div><h2>How Classical NLP Tried to Solve This (and Where It Stopped)</h2><p>Before large language models were released, the standard answer to this problem was Natural Language Processing. The classical NLP toolkit gave developers a set of tools that could, with enough effort, extract meaningful structure from text using different, but often interconnected, processes like the following:</p><ul><li><p><strong>Named Entity Recognition (NER)</strong>: NER is a process used in <a href="https://substack.thewebscraping.club/p/using-nlp-scraped-data">NLP to extract entities from text corpora</a>. It can particularly identify spans of text as persons, organizations, locations, or dates. An NLP model trained on news corpora, for example, is able to scan an article and tag &#8220;Jane Doe&#8221; as a person and &#8220;Washington D.C.&#8221; as a geopolitical entity.</p></li><li><p><strong>Part-of-speech tagging</strong>: Is a process in which NLP models can identify nouns, verbs, and adjectives. This enables the downstream logic to focus on the right parts of a sentence.</p></li><li><p><strong>Dependency parsing:</strong> Maps grammatical relationships between words, helping to extract which subject performed which action on which object.</p></li><li><p><strong>Relation extraction:</strong> Identifies when two co-occurring entities have a specific relationship. For example, a person who was affiliated with an organization, or an event that occurred in a specific location.</p></li></ul><p>Libraries like <a href="https://spacy.io/">spaCy</a>, <a href="https://nlp.stanford.edu/">Stanford NLP</a>, and <a href="https://www.nltk.org/">NLTK</a> made these processes largely accessible. But they work well for well-defined, narrow tasks on consistent text domains. The problems and limitations of this solution appear quickly at the edges:</p><ul><li><p><strong>Domain shift breaks everything:</strong> A NER model trained on news articles performs poorly on scientific abstracts. A model tuned for English financial text fails on multilingual content. In other words, every new domain requires retraining, re-labeling, and re-evaluation. These processes are very costly, both in terms of money and time.</p></li><li><p><strong>Context is invisible:</strong> Classical NLP models operate at the token and sentence level. They have no mechanism for understanding that &#8220;Apple&#8221; in a technology article refers to a corporation, while &#8220;apple&#8221; in a nutrition blog refers to a fruit. Disambiguation requires hand-crafted rules or separate classification layers bolted on top (which, again, is costly).</p></li></ul><p>Before NLP, you could basically only use regex (with all the difficulties associated with manually cleaning data, standardizing it, and&#8230;using regex!). So, NLP was a genuine (big) step forward: it made large-scale text analysis possible in ways that pure pattern matching never could (which is a way to <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">find patterns in scraped data using AI</a>). But it still required substantial domain expertise, constant maintenance, and produced results that were narrow, fragile, and difficult to generalize.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>The Modern Solution: LLMs as Universal Structure Extractors</h2><p>Large language models fundamentally changed the extraction problem. On the side of the underlying technology, a classical NLP model learns the statistical patterns inside the text. An LLM, instead, learns to understand language. This distinction matters enormously because it opened the doors to the following:</p><ul><li><p><strong>Context disambiguation that works out of the box:</strong> Feed an LLM with a paragraph from a technology article containing the word &#8220;Apple&#8221; and it will correctly identify it as a company. Feed it with a paragraph from a recipe blog, and it will correctly identify it as a fruit. No separate disambiguation layer. The model resolves ambiguity the same way a human reader does: by reading the surrounding context.</p></li><li><p><strong>Semantic equivalence that is understood, not computed:</strong> An LLM knows that &#8220;$40,&#8221; &#8220;forty dollars,&#8221; &#8220;40 USD,&#8221; and &#8220;forty bucks&#8221; all express the same value. You don&#8217;t need to instruct it to understand that.</p></li><li><p><strong>Implicit information that becomes accessible:</strong> A sentence like &#8220;the study, conducted over three months at a Boston hospital, found no significant effect&#8221; contains a location, a duration, and a finding. An LLM can extract all three without requiring the text to follow any particular structure.</p></li><li><p><strong>Domain generalization that requires no retraining:</strong> The same LLM that extracts entities from political news articles can extract findings from scientific abstracts, event mentions from cultural journalism, and source attributions from investigative reporting. You just need to change the prompt, not the model.</p></li></ul><p>The practical workflow becomes straightforward:</p><ul><li><p>You scrape unstructured text from the web.</p></li><li><p>You pass the content to an LLM with a prompt that describes what you want to extract.</p></li><li><p>The model returns a response.</p></li><li><p>You use that response downstream.</p></li></ul><p>This process works. But using LLMs alone introduces a different class of problems:</p><ul><li><p><strong>Output format is not guaranteed:</strong> Ask an LLM to return a price, and it might return <em>$40</em> in one run, <code>40</code><em> dollars </em>in another, and <em>40 USD</em> in a third. The model understands the value when it retrieves it from scraped content. But it does not guarantee how it expresses that value unless you explicitly constrain it.</p></li><li><p><strong>Required fields can go missing:</strong> If the article you extracted the content from does not mention a publication date, the model might omit the field, return <em>null</em>, or return <code>"</code><em>not mentioned</em><code>"</code>, or invent a plausible date (which is way worse). Each behavior is different, and none of them is predictable without enforcement.</p></li><li><p><strong>Hallucination is a real risk:</strong> When the model is uncertain, it always generates a plausible answer. For extraction tasks, that means it can invent entity names, fabricate statistics, or fill in missing information with confident-sounding fiction. Without validation, these errors pass into your data, creating issues at the analytics level.</p></li></ul><p>Generalizing all of this, you also get scalability issues because you have no consistency guaranteed. A pipeline processing 10,000 articles requires every output to follow the same schema. But a model that returns slightly different structures across runs cannot feed a database reliably without significant error handling.</p><p>In other words, LLMs provide you with the understanding that NLP lacked. But they do not, on their own, provide the structural guarantees that production pipelines require.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>How to Get Semantic Power and Structural Guarantees at the Same Time: A Practical Approach</h3><p>One possible solution to the unpredictability of LLM outputs is to separate the two concerns that these models conflate: semantic understanding and structure enforcement.</p><p>To do so, you can:</p><ul><li><p>Use the LLM for what it does well: reading text, resolving ambiguity, extracting meaning, and normalizing inconsistent expressions.</p></li><li><p>Use specific libraries to define schemas, enforce types, validate outputs, and reject malformed data before it enters your pipeline.</p></li></ul><p>Below is how this solution works, at a high level:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zCax!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zCax!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 424w, https://substackcdn.com/image/fetch/$s_!zCax!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 848w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1272w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png" width="998" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2508960-4a9d-44cb-9877-0df6263956b9_998x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:998,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52459,&quot;alt&quot;:&quot;The high-level process of creating machine-readable content from unstructured text by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The high-level process of creating machine-readable content from unstructured text by Federico Trotta" title="The high-level process of creating machine-readable content from unstructured text by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!zCax!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 424w, https://substackcdn.com/image/fetch/$s_!zCax!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 848w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1272w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The high-level process of creating machine-readable content from unstructured text</figcaption></figure></div><p>Let&#8217;s see how to implement this process and how the two approaches differ in practice.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>The Baseline Approach: A Direct LLM Call (and What It Gives You)</h3><p>Consider the following content that can come from scraping a news article:</p><pre><code><code>Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May</code></code></pre><p>To directly pass it to a GPT model, asking it for a precise output, you can use the following code:</p><pre><code><code>import os
import json
from openai import OpenAI

# Scraped content
SCRAPED_TEXT = """
Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May.
"""

# Define LLM client
raw_client = OpenAI(api_key=os.environ.get("YOUR_OPENAI_API_KEY"))

# Define prompt for the LLM
raw_prompt = """
Extract the following information from the article below and return it as JSON:
- title
- author
- publication_date
- mentioned_organizations
- cpi_march_value
- key_claim
- market_sentiment

Article:
""" + SCRAPED_TEXT

# Get response from LLM
raw_response = raw_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": raw_prompt}]
)

raw_output = raw_response.choices[0].message.content

# Print results
print(raw_output)</code></code></pre><p>The result will be as follows:</p><pre><code><code>{
  "title": "Fed Signals Caution as Inflation Data Disappoints",
  "author": "Sarah M. Connelly",
  "publication_date": "April 14, 2026",
  "mentioned_organizations": [
    "Federal Reserve",
    "Labor Department",
    "Reuters",
    "Goldman Sachs",
    "JPMorgan"
  ],
  "cpi_march_value": "3.5 percent",
  "key_claim": "The Federal Reserve is in no rush to cut interest rates and needs greater confidence that inflation is moving sustainably toward its two-percent target before reducing borrowing costs.",
  "market_sentiment": "Negative"
}</code></code></pre><p>Now, at first sight, this seems good. The prompt asked the GPT model to create a JSON file with specific values, and the model was able to do so. But two major problems affect the next steps when analyzing this data. They are:</p><ul><li><p>The publication date is reported as &#8220;April 14, 2026&#8221;. This is not represented in ISO 8601 format and will break any date parser.</p></li><li><p>The CPI is reported as &#8220;3.5 percent&#8221;, which is a string. Not a number or a float, which is what is required for such data if you want to further analyze it (without any intermediate steps).</p></li></ul><p>So, the LLM was able to give structure to an unstructured text, after being specifically prompted to do so. But it failed at providing the data in the right format. To do so, you have to provide specific guidance to the model.</p><h3>What Changes When You Define The Schema</h3><p>To have guarantees on the output format, you can use the following code:</p><pre><code><code>import os
import json
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Optional, Literal

# Scraped content
SCRAPED_TEXT = """
Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May.
"""

# Validation schema
class ArticleExtraction(BaseModel):
    title: str = Field(description="The article's headline")
    author: Optional[str] = Field(description="Full name of the author if explicitly mentioned")
    publication_date: Optional[str] = Field(description="Publication date in ISO 8601 format (YYYY-MM-DD)")
    mentioned_organizations: list[str] = Field(description="All organizations referenced in the article")
    cpi_march_value: Optional[float] = Field(description="CPI value as a float (e.g. 3.5)")
    key_claim: str = Field(description="The central argument or finding of the article in one sentence")
    market_sentiment: Literal["positive", "negative", "neutral"] = Field(
        description="Overall market sentiment expressed in the article"
    )

structured_client = instructor.from_openai(OpenAI(api_key=os.environ.get("YOUR_OPENAI_API_KEY")))

extraction = structured_client.chat.completions.create(
    model="gpt-4o",
    response_model=ArticleExtraction,
    messages=[
        {
            "role": "user",
            "content": f"Extract structured information from the following article:\\n\\n{SCRAPED_TEXT}"
        }
    ]
)

print(extraction.model_dump_json(indent=2))

print("\\n" + "=" * 60)
print("INSPECTION OUTPUT")
print("=" * 60)
for field, value in extraction.model_dump().items():
    print(f"  {field}: {repr(value)}  &#8594;  type: {type(value).__name__}")</code></code></pre><p>The above code leverages two fundamental libraries:</p><ul><li><p><strong><a href="https://pydantic.dev/">Pydantic</a></strong>: This is a Python data validation library. You define a schema as a Python class, declare the fields and their types, and Pydantic enforces that any data you put into that class matches what you declared.</p></li><li><p><strong><a href="https://python.useinstructor.com/">Instructor</a></strong>: This is the bridge between Pydantic and the LLM. The core problem it solves is that LLMs&#8217; APIs return text, but Pydantic validates Python objects. So, something has to sit in the middle, take the LLM&#8217;s response, parse it into the structure your Pydantic model expects, and retry the call if the output doesn&#8217;t validate. That&#8217;s what Instructor does. Without Instructor, you would have to manually prompt the model to return JSON, parse that JSON yourself, handle malformed responses, write retry logic, and coerce types by hand.</p></li></ul><p>By using these two libraries, the <em>ArticleExtraction() </em>class does the following<code>:</code></p><ul><li><p><strong>Type enforcement:</strong> Defines <em>cpi_march_value</em> as a float.  This guarantees the model will return an actual number) instead of a  string (3.5 instead of "3.5 percent" as the previous example<code>)</code>.</p></li><li><p><strong>Controls formatting and vocabulary:</strong> The <em>Literal</em> type on <em>market_sentiment</em> restricts the LLM&#8217;s output to <em>"positive"</em>, <em>"negative"</em>, or <em>"neutral"</em>. The model cannot invent new categories. Similarly, the description for <em>publication_date</em> explicitly demands the ISO 8601 format.</p></li><li><p><strong>Built-in prompting:</strong> The <em>Field(description="...")</em> parameters serve a dual purpose. First, they document the code for developers. Secondly, under the hood, the Instructor library feeds these exact descriptions to the LLM as targeted instructions. This ensures the model understands <em>exactly </em>what &#8220;key claim&#8221; or &#8220;publication date&#8221; means in this context.</p></li><li><p><strong>Graceful omissions:</strong> Wrapping fields like <code>author</code> in <em>Optional[...]</em> gives the model permission to safely return a null value if the information isn&#8217;t present in the scraped text.  This highly reduces the risk of hallucinations.</p></li></ul><p>The JSON output is as follows:</p><pre><code><code>{
  "title": "Fed Signals Caution as Inflation Data Disappoints",
  "author": "Sarah M. Connelly",
  "publication_date": "2026-04-14",
  "mentioned_organizations": [
    "Federal Reserve",
    "Labor Department",
    "Reuters",
    "Goldman Sachs",
    "JPMorgan"
  ],
  "cpi_march_value": 3.5,
  "key_claim": "The Federal Reserve remains cautious about cutting interest rates because inflation has not yet shown sufficient progress toward its two-percent target.",
  "market_sentiment": "negative"
}</code></code></pre><p>As you can see, now the CPI is a float, and the publication date is in ISO 8601.</p><p>The inspection output is the following:</p><pre><code><code>============================================================
INSPECTION OUTPUT
============================================================
  title: 'Fed Signals Caution as Inflation Data Disappoints'  &#8594;  type: str
  author: 'Sarah M. Connelly'  &#8594;  type: str
  publication_date: '2026-04-14'  &#8594;  type: str
  mentioned_organizations: ['Federal Reserve', 'Labor Department', 'Reuters', 'Goldman Sachs', 'JPMorgan']  &#8594;  type: list
  cpi_march_value: 3.5  &#8594;  type: float
  key_claim: 'The Federal Reserve remains cautious about cutting interest rates because inflation has not yet shown sufficient progress toward its two-percent target.'  &#8594;  type: str
  market_sentiment: 'negative'  &#8594;  type: str</code></code></pre><p>This validation helps immediately see that the data types are correct.</p><h2>Conclusion</h2><p>In this article, you learned what unstructured text actually costs a data pipeline. You saw how classical NLP made structured extraction possible but fragile, and how LLMs removed the domain constraints that NLP never solved. You also learned why LLMs alone are not enough and saw a practical solution to provide &#8220;guardrails&#8221; for LLMs so that their output follows a defined schema.</p><p>So, let us know: how are you managing unstructured text after you scraped it?</p>]]></content:encoded></item><item><title><![CDATA[Cloudflare Crawl Endpoint: Everything You Need to Know]]></title><description><![CDATA[Is the Cloudflare /crawl endpoint a real game-changer?]]></description><link>https://substack.thewebscraping.club/p/cloudflare-crawl-endpoint-for-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/cloudflare-crawl-endpoint-for-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 03 May 2026 20:24:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/898316de-e54e-4a62-8089-2ad66bc363b8_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Cloudflare just shook the Web by announcing its first API for crawling entire websites. It&#8217;s built for RAG systems and website monitoring, but can it really be used for real-world web scraping scenarios?</p><p>In this article, you&#8217;ll find out this and more. I&#8217;ll walk you through a complete guided example of how to use it, and break down its (Spoiler: undoubtedly serious) limitations.</p><h2>An Introduction to the Cloudflare Crawl Endpoint</h2><p>Before exploring the technical aspects behind the Cloudflare <em>/crawl</em> endpoint and seeing it in action, let me first give you some context!</p><h3>What Is the Cloudflare <em>/crawl</em> Endpoint?</h3><p>The <em><a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/">/crawl</a></em> endpoint is a new addition to <a href="https://developers.cloudflare.com/fundamentals/api/">Cloudflare&#8217;s REST APIs</a>. Its goal is to crawl an entire website (or just a portion of it) starting from a single URL.</p><p><strong>Note</strong>: The Crawl endpoint is currently in beta and was introduced on March 10, 2026, <a href="https://developers.cloudflare.com/changelog/post/2026-03-10-br-crawl-endpoint/">as highlighted in the Cloudflare changelog</a>.</p><p>Under the hood, it automatically discovers and visits new pages, <a href="https://developers.cloudflare.com/browser-rendering/">rendering them in a headless browser</a>. It returns the discovered content as HTML, Markdown, or structured JSON, making it ideal for RAG pipelines, monitoring, or dataset creation.</p><p>As I&#8217;ll dive into later, it respects <em>robots.txt</em> and <em>doesn&#8217;t</em> bypass bot protection or captchas. Thus, it&#8217;s designed as a <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">compliant approach to web crawling!</a></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" loading="lazy" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>How It Works at a High Level</strong></h2><p>At a high level, the <em>/crawl</em> endpoint involves a two-step flow:</p><ol><li><p>You kick off an asynchronous crawl job, passing a starting URL. Cloudflare returns a job ID.</p></li><li><p>You use that job ID to periodically check the job&#8217;s status or fetch results as they become available, following typical <a href="https://en.wikipedia.org/wiki/Polling_(computer_science)">polling behavior</a>.</p></li></ol><p><strong>Important</strong>: A crawl job can run for <em>up to seven days!</em><strong> </strong>Results remain available for 14 days after completion, after which the job data is deleted.</p><p>Behind the scenes, the crawler expands outward from the starting URL. By default, the API follows a clear order:</p><ol><li><p>The initial page.</p></li><li><p>Sitemap URLs.</p></li><li><p>Links discovered within pages.</p></li></ol><p>Still, you can tweak that depending on whether you want to prioritize sitemaps, page links, or both.</p><h3>Supported Use Cases</h3><p>The officially promoted use cases for the Cloudflare <em>/crawl</em> API are just two:</p><ul><li><p>Creating knowledge bases or training AI systems (like <a href="https://substack.thewebscraping.club/p/ingest-web-data-rag-llm">RAG applications</a>) using up-to-date web content.</p></li><li><p>Collecting and analyzing content across multiple pages <a href="https://substack.thewebscraping.club/p/build-an-ai-agent-for-scraping-papers">for research</a>, summarization, or monitoring purposes.</p></li></ul><h3>Pricing</h3><p>Compared to most other web crawling or discovery APIs on the market, Cloudflare&#8217;s /<em>crawl</em> API doesn&#8217;t charge by the number of pages. Instead, costs are based on resource usage, which depends on whether you enable the headless browser rendering feature.</p><p>If headless rendering is active, pricing follows the <a href="https://developers.cloudflare.com/browser-rendering/pricing/">Browser Rendering model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vrIj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vrIj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 424w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 848w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1272w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png" width="1456" height="238" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:238,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48862,&quot;alt&quot;:&quot;The Browser Rendering pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Browser Rendering pricing model" title="The Browser Rendering pricing model" srcset="https://substackcdn.com/image/fetch/$s_!vrIj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 424w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 848w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1272w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Browser Rendering pricing model</figcaption></figure></div><p>If rendering isn&#8217;t active, pricing follows the <a href="https://developers.cloudflare.com/workers/platform/pricing/">Workers model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HVsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HVsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 424w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 848w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1272w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png" width="1456" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66389,&quot;alt&quot;:&quot;The Workers pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Workers pricing model" title="The Workers pricing model" srcset="https://substackcdn.com/image/fetch/$s_!HVsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 424w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 848w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1272w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Workers pricing model</figcaption></figure></div><p><em>Yeah, I know&#8230; It&#8217;s honestly a bit confusing, and it&#8217;s almost impossible to predict the exact cost of a crawling task. The good news? You can test it for free!</em></p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Cloudflare Crawl Endpoints: Technical Analysis</h2><p>Now that you know what Cloudflare is and what it brings to the table, it&#8217;s time to better understand its functioning, strengths, and limitations.</p><h3><strong>Endpoint Presentation</strong></h3><p>The Cloudflare Crawl API is built around two main endpoints. Both share the same base URL:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl</code></pre></div><p>Where <em>&lt;ACCOUNT_ID&gt;</em> is your <a href="https://developers.cloudflare.com/fundamentals/account/find-account-and-zone-ids/#copy-your-account-id">Cloudflare account ID</a>.</p><h4>1. Initiate the Crawl Job (POST)</h4><p>To start a new crawl, you need to send a POST request with the target URL (and optional parameters like depth, rendering mode, etc.) as below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -X POST 'https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl' \
  -H 'Authorization: Bearer &lt;CLOUDFLARE_API_TOKEN&gt;' \
  -H 'Content-Type: application/json' \
  -d '{ "url": "https://example.com" }'</code></pre></div><p>Keep in mind that the endpoint supports several parameters, allowing you to greatly customize the crawling behavior, output format (JSON, HTML, or Markdown), rendering options, caching, and more. Check out the <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#optional-parameters">full list of supported body parameters for all available options</a>.</p><p>Cloudflare immediately returns a job ID that you&#8217;ll use to track or retrieve results. A possible response looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{
  "success": true,
  "result": "9f1c2d3a-4b5e-6f7a-8c9d-0e1f2a3b4c5d"
}</code></pre></div><p>The UUID in the <em>result</em> field is the Crawl job ID you&#8217;ll use to poll for updates.</p><h4>2. Request Crawl Results (GET)</h4><p>Once the crawl is running, make a GET request with the job ID to check the status or fetch results:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -X GET 'https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl/&lt;JOB_ID&gt;' \
  -H 'Authorization: Bearer &lt;CLOUDFLARE_API_TOKEN&gt;'</code></pre></div><p>Here, the <em>&lt;JOB_ID&gt;</em> placeholder is the UUID retrieved before from the <em>result </em>field.</p><p>The response either includes a <em>status</em> field like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">{
  "result": {
    "id": "3e7a1c92-b4d8-4f6e-9a21-6c0d5b8e2f14",
    "status": "running"
    // ...
  }
}</code></pre></div><p>The possible <em>status</em> values are: <em>running</em>, <em>completed</em>, <em>errored</em>, or one of several cancellation states (<em>cancelled_due_to_timeout</em>, <em>cancelled_due_to_limits</em>, <em>cancelled_by_user</em>).</p><p>Or, once the job is completed, calling the API returns the full results in the <em>records</em> field:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">{
  "result": {
    "id": "3e7a1c92-b4d8-4f6e-9a21-6c0d5b8e2f14",
    "status": "completed",
    "browserSecondsUsed": 98.3,
    "total": 12,
    "finished": 12,
    "records": [
      {
        "url": "https://example.com/",
        "status": "completed",
        "markdown": "# Example Domain\nThis domain is for use in illustrative examples...",
        "metadata": {
          "status": 200,
          "title": "Example Domain",
          "url": "https://example.com/"
        }
      },
      {
        "url": "https://example.com/about",
        "status": "completed",
        "markdown": "## About\nLearn more about this example site...",
        "metadata": {
          "status": 200,
          "title": "About - Example Domain",
          "url": "https://example.com/about"
        }
      }
      // additional entries omitted for brevity...
    ],
    "cursor": 10
  },
  "success": true
}</code></pre></div><p>Note that the response will vary based on the specified query parameters. For example, you can filter by specific statuses, limit the number of results, and <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#polling-for-completion">navigate through them using a pagination system</a>.</p><h3>Features</h3><p>Below is a list of the main, most relevant capabilities provided by the Cloudflare Crawl API:</p><ul><li><p><strong>Asynchronous crawl jobs</strong>:<strong> </strong>Trigger crawling jobs and poll results when they are ready, enabling non-blocking, large-scale crawling workflows.</p></li><li><p><strong>Automatic URL discovery</strong>: Finds pages from the starting URL, sitemaps, and in-page links, with configurable source control.</p></li><li><p><strong>Flexible output formats</strong>: Returns HTML, Markdown, or structured JSON. JSON leverages <a href="https://developers.cloudflare.com/workers-ai/features/json-mode/">Workers AI for schema-driven data extraction</a>.</p></li><li><p><strong>Headless browser rendering</strong>: Control JavaScript execution with <em>render: true</em> or perform fast static HTML fetches with <em>render: false</em>.</p></li><li><p><strong>Fine-grained crawl control</strong>: Configure <em>limit</em>, <em>depth</em>, and URL inclusion/exclusion with the <em>includePatterns</em>/<em>excludePatterns </em>fields.</p></li><li><p><strong>Incremental and cache-aware crawling</strong>: Use <em>modifiedSince</em> and <em>maxAge </em>parameters to avoid re-fetching unchanged content, optimizing performance and cost.</p></li><li><p><strong>Advanced filtering and pagination</strong>: Retrieve results using <em>limit</em>, <em>cursor</em>, and <em>status</em> filters to handle large datasets efficiently.</p></li><li><p><strong>Authentication and custom headers</strong>: Supports HTTP auth, cookies, and custom headers for crawling protected or API-driven content.</p></li><li><p><strong>Dynamic content handling</strong>: Wait for JS-rendered content using <em>gotoOptions</em> and <em>waitForSelector</em>, ideal for SPAs and interactive pages.</p></li><li><p><strong>Resource skipping for performance</strong>: Optionally block images, media, fonts, or stylesheets to speed up crawling and reduce unnecessary bandwidth usage.</p></li></ul><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h3>Limitations</h3><p>Cloudflare&#8217;s <em>/crawl</em> API also comes with several important limitations, such as:</p><ul><li><p><strong>Respects bot protection</strong>: The crawler can&#8217;t <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">bypass CAPTCHAs (including Turnstile challenges) or Cloudflare bot protections</a>. As a rule of thumb, sites protected via Cloudflare Bot Management or other WAFs tend to block crawling tasks entirely, limiting automated access and leading to incomplete datasets.</p></li><li><p><strong>Fixed User-Agent</strong>: The <em>/crawl</em> endpoint sets a non-customizable <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent">User-Agent</a> </em>value<em> </em>(<em>CloudflareBrowserRenderingCrawler/1.0</em>). You can&#8217;t change it, which may cause sites to block requests or serve different content based on the <em>User-Agent</em>.</p></li><li><p><strong>Content Signals enforcement</strong>: If a site disallows AI usage via <a href="https://contentsignals.org/">Cloudflare Content Signals</a>, crawl requests for those purposes are rejected with a <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/400">400 Bad Request</a></em> error. Even if the site allows other uses, attempts to crawl disallowed categories will fail, limiting AI-specific data collection.</p></li><li><p><strong>Rate limiting and crawl pacing</strong>: Sites with strict rate limits can slow down crawling. The crawler respects the robots.txt <em>Crawl-delay </em>directive and implements backoff. Large crawls may need to be split into smaller jobs to avoid throttling or skipped URLs.</p></li><li><p><strong>Browser usage limits and job cancellation</strong>: Accounts on Workers free plans are capped at 10 minutes of browser time per day. Exceeding this limit results in a <em>cancelled_due_to_limits</em> status. To avoid that, you can upgrade to a paid plan.</p></li></ul><h2>How to Use the Cloudflare Crawl Endpoint: Step-by-Step Tutorial</h2><p>In this guided section, I&#8217;ll show you how to use the Cloudflare Crawl Endpoint to crawl a website in Python. The target site will be the &#8220;<a href="https://quotes.toscrape.com/">Quotes to Scrape</a>&#8221; sandbox. The goal here is to demonstrate how to use the API, rather than actually collecting relevant data.</p><p>Follow the instructions below!</p><h3>Prerequisites</h3><p>To follow this tutorial section, make sure you have:</p><ul><li><p>Your <a href="https://developers.cloudflare.com/fundamentals/account/find-account-and-zone-ids/#copy-your-account-id">Cloudflare account ID</a> at hand.</p></li><li><p>A <a href="https://developers.cloudflare.com/fundamentals/api/get-started/create-token/">Cloudflare API token</a> with the &#8220;Browser Rendering - Edit&#8221; permission.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nJvY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nJvY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 424w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 848w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission" title="Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission" srcset="https://substackcdn.com/image/fetch/$s_!nJvY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 424w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 848w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission</figcaption></figure></div><p>For the sake of simplicity and to keep this tutorial concise, I&#8217;ll assume you already have a Python project set up with <em><a href="https://substack.thewebscraping.club/p/python-http-request-explained">requests</a></em> installed. That said, you can use any programming language and any HTTP client, because the high-level logic remains the same.</p><h3>Step #1: Set Up the Configurations</h3><p>Start by importing the required libraries and reading the necessary secrets (your Cloudflare API token and account ID). Use these secrets to prepare the Cloudflare Crawl base URL and headers. Also, specify the starting target URL as a constant.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import requests
import time
import json

# The Cloudflare secrets required for authenticating Crawl API calls
CLOUDFLARE_API_TOKEN = "&lt;YOUR_CLOUDFLARE_API_TOKEN&gt;"
CLOUDFLARE_ACCOUNT_ID = "&lt;YOUR_CLOUDFLARE_ACCOUNT_ID&gt;"

# The base URL used for all Crawl API calls
CLOUDFLARE_CRAWL_BASE_URL = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl"

# Common headers shared by all API calls
HEADERS = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# The URL to crawl
START_URL = "https://www.ssense.com/en-us/men/product/acne-studios/silver-folded-leather-wallet/18169981"</code></pre></div><p><strong>Tip</strong>: In a production script, read the Cloudflare API token and account ID from environment variables rather than hardcoding them.</p><h3>Step #2: Trigger the Crawling Job</h3><p>Define a <em>start_crawl()</em> function to send a POST request to Cloudflare&#8217;s Crawl API:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def start_crawl(start_url):
    # Customize according to your needs
    payload = {
        "url": start_url,
        "limit": 20,
        "depth": 2,
        "formats": ["markdown"],
        "render": True
    }

    response = requests.post(CLOUDFLARE_CRAWL_BASE_URL, headers=HEADERS, json=payload)
    response.raise_for_status()

    job_id = response.json()["result"]
    return job_id</code></pre></div><p>This creates a new crawling job for the target URL. Then, it returns a <em>job_id</em> that identifies this specific crawl.</p><p><strong>Tip</strong>: In a production-level script, make the <em>payload</em> object configurable via function input arguments for greater flexibility and reusability.</p><h3>Step #3: Poll Over the Job</h3><p>Next, add a <em>wait_for_completion()</em> function to repeatedly check the job status every few seconds until the crawl finishes or times out:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def wait_for_completion(job_id, max_attempts=60, delay=5):
    status_url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}?limit=1"

    for attempt in range(max_attempts):
        res = requests.get(status_url, headers=HEADERS)
        res.raise_for_status()

        status = res.json()["result"]["status"]
        print(f"Attempt {attempt+1}: {status}")

        if status != "running":
            return status

        time.sleep(delay)

    raise Exception("Timeout waiting for crawl job")</code></pre></div><p>This makes GET calls to the Cloudflare <em>/crawl</em> endpoint. It ensures you&#8217;re waiting for the task to complete processing before fetching the crawled records.</p><p><strong>Tip</strong>: The <em>limit=1</em> query parameter is recommended to restrict the number of retrieved records, keeping the response lightweight. After all, at this stage, you&#8217;re only interested in checking the job status, not in retrieving the actual output data.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Step #4: Get the Crawled Content Pages</h3><p>Build a <em>fetch_records()</em> function to collect all crawled pages:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def fetch_records(job_id):
    print("Fetching results...")

    all_records = []
    cursor = None

    while True:
        url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}"
        # Getting 10 records at a time
        params = {
            "limit": 10
        }

        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()

        result = response.json()["result"]
        records = result.get("records", [])

        all_records.extend(records)
        print(f"+{len(records)} records (Total: {len(all_records)})")

        cursor = result.get("cursor")
        if not cursor:
            break

    return all_records</code></pre></div><p>This handles pagination using a <em>cursor</em>, accessing records in batches (<em>10</em> per request) until all results are returned.</p><h3>Step #5: Put It All Together</h3><p>Finally, in the <em>main()</em> function, orchestrate the workflow:</p><ol><li><p>Start the crawl</p></li><li><p>Wait for completion</p></li><li><p>Fetch all results</p></li></ol><p>Then, you can export the crawled records to a local JSON file for further use, store the retrieved data in a database, process it there, etc.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def main():
    job_id = start_crawl(START_URL)
    status = wait_for_completion(job_id)

    if status != "completed":
        raise Exception(f"Crawl failed: {status}")

    print("Crawl completed!")

    records = fetch_records(job_id)

    # Export the crawled pages to an output JSON file
    with open("records.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()</code></pre></div><h3>Step #6: Complete Code</h3><p>This is what your Python script for interacting with the Cloudflare Crawl API will look like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># pip install requests

import requests
import time
import json

# The Cloudflare secrets required for authenticating Crawl API calls
CLOUDFLARE_API_TOKEN = "&lt;YOUR_CLOUDFLARE_API_TOKEN&gt;"
CLOUDFLARE_ACCOUNT_ID = "&lt;YOUR_CLOUDFLARE_ACCOUNT_ID&gt;"

# The base URL used for all Crawl API calls
CLOUDFLARE_CRAWL_BASE_URL = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl"

# Common headers shared by all API calls
HEADERS = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# The URL to crawl
START_URL = "http://quotes.toscrape.com/"

def start_crawl(start_url):
    """
    Triggers the Cloudflare Crawl API job
    """

    # Customize according to your needs
    payload = {
        "url": start_url,
        "limit": 20,
        "depth": 2,
        "formats": ["markdown"],
        "render": True
    }

    response = requests.post(CLOUDFLARE_CRAWL_BASE_URL, headers=HEADERS, json=payload)
    response.raise_for_status()

    job_id = response.json()["result"]
    return job_id

def wait_for_completion(job_id, max_attempts=60, delay=5):
    """
    Waits for the crawling task to complete
    """

    status_url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}?limit=1"

    for attempt in range(max_attempts):
        res = requests.get(status_url, headers=HEADERS)
        res.raise_for_status()

        status = res.json()["result"]["status"]
        print(f"Attempt {attempt+1}: {status}")

        if status != "running":
            return status

        time.sleep(delay)

    raise Exception("Timeout waiting for crawl job")

def fetch_records(job_id):
    """
    Collects all records from the paginated results
    """

    print("Fetching results...")

    all_records = []
    cursor = None

    while True:
        url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}"
        # Getting 10 records at a time
        params = {
            "limit": 10
        }

        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()

        result = response.json()["result"]
        records = result.get("records", [])

        all_records.extend(records)
        print(f"+{len(records)} records (Total: {len(all_records)})")

        cursor = result.get("cursor")
        if not cursor:
            break

    return all_records

def main():
    job_id = start_crawl(START_URL)
    status = wait_for_completion(job_id)

    if status != "completed":
        raise Exception(f"Crawl failed: {status}")

    print("Crawl completed!")

    records = fetch_records(job_id)

    # Export the crawled pages to an output JSON file
    with open("records.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()</code></pre></div><h3>Step #7: Test the Script</h3><p>Launch the script, and it&#8217;ll produce an output like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XDal!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XDal!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 424w, https://substackcdn.com/image/fetch/$s_!XDal!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 848w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1272w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png" width="1175" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1175,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output produced by the script in the terminal&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output produced by the script in the terminal" title="The output produced by the script in the terminal" srcset="https://substackcdn.com/image/fetch/$s_!XDal!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 424w, https://substackcdn.com/image/fetch/$s_!XDal!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 848w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1272w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the script in the terminal</figcaption></figure></div><p>The polling mechanism required 5 attempts (~25 seconds), and the API discovered and retrieved 22 pages.</p><p>A <em>records.json</em> file will appear in your project directory. Open it, and you&#8217;ll see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZwCj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 424w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 848w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png" width="1456" height="1071" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1071,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 424w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 848w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the script</figcaption></figure></div><p>Notice how the &#8220;Quotes to Scrape&#8221; entries contain a <em>markdown</em> field with the Markdown version of the page. Instead, external links like Zyte&#8217;s homepage and Goodreads.com are skipped, since <em>includeExternalLinks</em> is set to <em>false</em> by default. In other words, the Cloudflare Crawl API doesn&#8217;t automatically attempt to fetch data from different domains than the target source URL.</p><p>Et voil&#224;! Implementation complete.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Benchmark Against Protected Websites</h3><p>Cool! The Cloudflare Crawl endpoint works like a charm and is easy to use. However, I was particularly concerned about its documented limitations and wanted to verify whether they actually hold up in practice&#8230;</p><p>So, I ran tests against several well-known sites protected by common WAF and anti-bot solutions (from different providers). Here&#8217;s a summary of the results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!chL4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!chL4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 424w, https://substackcdn.com/image/fetch/$s_!chL4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 848w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1272w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111887,&quot;alt&quot;:&quot;Cloudflare Crawl API vs anti-bot solutions&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cloudflare Crawl API vs anti-bot solutions" title="Cloudflare Crawl API vs anti-bot solutions" srcset="https://substackcdn.com/image/fetch/$s_!chL4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 424w, https://substackcdn.com/image/fetch/$s_!chL4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 848w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1272w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cloudflare Crawl API vs anti-bot solutions</figcaption></figure></div><p>As you can tell, the limitations are very real. The results are quite discouraging:<strong> the Cloudflare Crawl API failed against all anti-bot&#8211;protected websites I tested.</strong></p><p>So, is this solution reliable for web scraping? When (and how) should you actually use it? Let me break that down in a final comment!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Final Comment</h2><p>In this article, I introduced you to one of the newest tools in Cloudflare&#8217;s growing ecosystem: the Crawl API! This endpoint is designed to help you crawl entire websites using distributed crawling tasks running on Cloudflare&#8217;s infrastructure.</p><p>Sure, the crawling mechanism works and is easy to launch, control, and implement. With just a few lines of code, you can get started. Still, several concerns should be raised:</p><ol><li><p><strong>Opaque pricing</strong>: Costs are tied to resource usage rather than the number of pages crawled, making them harder to predict.</p></li><li><p><strong>Fixed </strong><em><strong>User-Agent</strong></em>: The API doesn&#8217;t allow <em>User-Agent</em> customization, meaning even basic server-side checks can block it.</p></li><li><p><strong>Limited effectiveness on protected sites</strong>: The API has an intended very low success rate against anti-bot&#8211;protected websites (unless you specify in Cloudflare Bot Protection settings that you <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#robotstxt-and-bot-protection">allow it against your site</a>).</p></li><li><p><strong>Rate limiting constraints</strong>: It strictly respects <em>robots.txt</em> directives and crawl delays, which can significantly slow or limit large crawls.</p></li></ol><p>In simple terms, if you want to use it for general-purpose, large-scale web crawling, I wouldn&#8217;t recommend it. The market offers more effective solutions that can actually bypass anti-bot limitations. Plus, remember that around <em><a href="https://www.securitymagazine.com/articles/101188-65-of-websites-arent-protected-from-bots">35% of the entire Internet</a></em> is estimated to be protected against bots (i.e., you won&#8217;t be able to crawl it with this API).</p><p>Yet, if you know the target site is not protected, budget isn&#8217;t a concern, and you want to remain (<em>overly?</em>) ethical and compliant, the Cloudflare Crawl API can be an option.</p><p>I hope this breakdown helps you better understand this new solution and make an informed decision on whether to adopt it. Lastly, remember that the Cloudflare Crawl API is still in beta, so things may change soon. Just <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/">keep an eye on the docs for updates</a>. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #103: Bypassing DataDome-Protected Websites in the Agentic Era]]></title><description><![CDATA[Fifteen browser configurations, one tough anti-bot, and only a couple made it to the cart]]></description><link>https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 30 Apr 2026 21:34:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9e5dad0e-b094-41c0-942c-c76f3783b289_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This year every web infrastructure company seems to be shipping a browser. But not a regular browser,  one designed to be driven by an AI agent and to look human while doing it. We wanted to know if any of those browsers actually work against a serious anti-bot, so we picked a hard target, leroymerlin.fr behind DataDome, and tested more than a dozen different setups on the same four-step task: open the homepage, search for a product, open the first result, add it to the cart.<br></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>The short answer is that a couple of tools finished the task, just one with any consistency. The story behind why is worth telling, because it explains what is happening at the intersection of AI agents and web data right now. We ran a similar exercise <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">against Cloudflare earlier this year</a>, and the conclusion is broadly the same: each anti-bot needs its own answer, and the answer changes every quarter.</p><h2>From workflows to agents, and why that changes the data problem</h2><p>Most code shipped under the AI banner is not really agentic. It is workflow code with an LLM dropped into a slot: generate a summary here, classify a record there, draft an email at the end. The control flow is hard-coded, and the model is one component among many.</p><p>The definition of an agent is quite different. The model decides the next action, observes the outcome, and decides again. The control flow lives inside the loop, not outside it. The agent has goals rather than scripts, and it picks tools and steps based on what it sees. That is what makes the engineering interesting, that is what makes it hard, and that is what sometimes makes it unreliable.</p><p>It also forces a different relationship with data. An agent that only sees its training corpus is stuck in the past. To make decisions worth anything, it has to read prices that change daily, stocks that move minute by minute, listings that did not exist last week. Some of that data sits behind APIs. Most of it does not. The web is still the largest and most current dataset in the world, and most of it is reachable only through a browser. So if we want our agents to act on real information, we have to give them a way to browse: opening a page, reading it, clicking a link, typing into a search bar, following a result, filling a form, all on sites that were never built for machines.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:69.35779816513761,&quot;width&quot;:630,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p>This is the constraint that produced the wave of &#8220;agentic browser&#8221; launches we have seen over the last twelve months. Y Combinator alone has backed a long string of them. <a href="https://www.hyperbrowser.ai/">Hyperbrowser</a> (S21) was an early entry: scalable cloud browser infrastructure with built-in CAPTCHA solving, proxy management, and now a multi-agent playground. The newer cohort followed the agent wave more directly: <a href="https://www.browseros.com/">BrowserOS</a> (S24) is an open-source agentic browser that runs the agent locally on the user&#8217;s machine; <a href="https://browser-use.com/">Browser Use</a> (W25) offers an open-source agent loop on top of Playwright, plus a cloud version. <a href="https://www.skyvern.com/">Skyvern</a> is a self-hostable browser agent that uses an LLM and computer vision instead of fixed selectors.  Outside the YC pipeline, <a href="https://lightpanda.io/">Lightpanda</a> is doing something different again, a headless browser engine written from scratch in Zig and aimed squarely at agents and crawlers (claiming roughly 9x faster execution and 16x lower memory than Chrome). It fits the &#8220;browser built for machines&#8221; line of thought we covered in <a href="https://substack.thewebscraping.club/p/rethinking-the-web-browser">Rethinking the web browser</a> earlier this year. <a href="https://www.browserbase.com/">Browserbase</a> ships a managed browser plus Stagehand for natural-language automation. And the big AI labs are now in the same space: OpenAI shipped Operator and the ChatGPT Atlas browser, Anthropic shipped Computer Use, Perplexity launched Comet. Each project attacks the same problem from a slightly different angle, but the goal is identical: a browser an agent can drive without immediately tripping every detection mechanism on the other side.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h2>The same problem scrapers have been chasing for a decade</h2><p>For anyone who has worked in web data, none of this is new. The fight over whether a request looks human or automated has been going on as long as commercial scraping has existed. The product names have changed but the purpose not.</p><p>What has changed is who is selling the bypass. The companies that have spent years selling residential proxies and unblockers noticed quickly that the agentic boom is good for their business. They already have the IP networks, the fingerprint research, the bypass code, the cat-and-mouse experience. They know what TLS handshake Chrome sends in October 2025 and what it sent in October 2024. Pivoting all of that into a managed browser is a smaller leap than building one from scratch. <a href="https://brightdata.com">Bright Data</a>, <a href="https://oxylabs.io">Oxylabs</a>, <a href="https://rayobyte.com">Rayobyte</a>, <a href="https://www.zenrows.com">ZenRows</a> have all added a managed browser product alongside the proxy. </p><p>The other side of the line is moving in the opposite direction. Bot traffic has grown faster than human traffic for years, and the operators of large public sites care more about it than ever. <a href="https://datadome.co">DataDome</a>, <a href="https://www.cloudflare.com/products/bot-management/">Cloudflare Bot Management</a>, <a href="https://www.akamai.com/products/bot-manager">Akamai Bot Manager</a>, <a href="https://www.humansecurity.com">HUMAN</a>, <a href="https://www.kasada.io">Kasada</a>: every one of them ships updates that target the exact tools we just listed. Fingerprint checks get stricter. Behavioral models get more sensitive. The JavaScript challenge changes shape every few weeks. There is no silver bullet, and there is no tool, browser, proxy, or service that bypasses every anti-bot on every site at all times. Anyone who claims otherwise is selling something that worked last quarter and might still work this week. The useful question is what works on a given target, today, at what cost.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Picking a hard target</h2><p>To answer that question concretely, we needed a target where the anti-bot was good and the signal was clean. We picked leroymerlin.fr, the French DIY retailer. Leroy Merlin runs DataDome standalone, with no other anti-bot layer on top, so attribution is straightforward. It also runs one of the more verbose DataDome configurations we have come across: response headers expose <code>x-datadome-riskscore</code>, <code>x-datadome-protection</code>, <code>x-datadome-cid</code>, and <code>x-datadome-endpointid</code>. Most DataDome-protected sites only show us the outcome. Here we see the score the engine assigns at every request, which is rare and very useful when comparing tools side by side.</p><p>The task we picked is small but realistic. From the homepage, the agent has to type &#8220;ampoule B22 led blanc&#8221; into the search bar, click the first product result, and add the product to the cart. Four steps. We dropped the login step on purpose: leroymerlin.fr requires an OTP to sign in, and we did not want OTP friction to confound an anti-bot test.</p><p>A run is a pass if the agent reaches the cart confirmation. Otherwise we record where it stopped and what DataDome said about it. Each tool runs ten times back to back, and we aggregate the results. Tools that support an external proxy use the same residential pool: Bright Data residential FR for the Bright Data runs, <a href="https://geonode.com">Geonode</a> residential FR for the Geonode runs. Tools that ship their own proxy use it. The reason behind two different providers was because we wanted to diversify the IP addresses, to be sure that blocks were not a matter of IP reputation.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The contestants</h2><p>As you&#8217;ve seen before, the browser landscape is quite crowded and we could not cover all the tools. We picked four open-source projects and seven commercial products. Let&#8217;s start with the open source.</p><p><a href="https://camoufox.com">Camoufox</a> is the stealth Firefox fork most people in the scraping world have already met (we <a href="https://substack.thewebscraping.club/p/open-source-python-libraries-scraping">introduced it</a> on TWSC back in September 2024). It rotates real-world fingerprints, patches the obvious automation tells, and ships a Playwright-compatible API. We pair it with both Bright Data and Geonode residential proxies in France. </p><p><a href="https://github.com/autoscrape-labs/pydoll">Pydoll</a> takes a different route: it drives Chromium directly over CDP without WebDriver, with built-in humanized cursor movement and typing. Importantly, Pydoll implements an explicit <code>Fetch.authRequired</code> handler, which lets it authenticate proxies that require Basic auth. </p><p><a href="https://scrapling.readthedocs.io">Scrapling</a> is a higher-level Python library. We use it in two modes. <code>DynamicFetcher</code> launches vanilla Playwright Chromium driven by Scrapling&#8217;s session manager. <code>StealthyFetcher</code> does the same, but under the hood uses an improved and customized version of <a href="https://github.com/Kaliiiiiiiiii-Vinyzu/patchright">patchright</a>, a stealth-patched Playwright fork. Each gets its own row in the comparison. </p><p><a href="https://github.com/rayobyte-data/rayobrowse">RayoBrowse</a> is the self-hosted stealth Chromium fork from Rayobyte, distributed as a Docker container that exposes a CDP endpoint on port 9222. Here we hit a wall worth flagging: for some reason RayoBrowse could not use the Bright Data residential proxy in our setup. Every navigation through that proxy failed instantly, even though the same credentials worked fine through <code>curl</code> from inside the same container. The same RayoBrowse setup worked fine with Geonode. We did not isolate the root cause, so we report RayoBrowse on Geonode only.</p><p>The commercial side is more crowded. </p><p><a href="https://browser-use.com/">Browser Use</a> exists in two flavors, and we tested both. The cloud version is the managed Browser Use, with its own residential proxy, its own stealth fingerprinting, and a fixed set of supported models; we drove it once in raw CDP mode (we steer it ourselves with Playwright) and once in agent mode (we hand the LLM the task in natural language and let it plan the steps). </p><p><a href="https://www.browserbase.com/">Browserbase</a> is a managed Chromium with optional residential proxies, Cloudflare Web Bot Auth verification, and the Stagehand agent SDK. We discovered during the test that the free tier excludes proxies entirely; without one, the session egresses from a US datacenter. We left this configuration in the test because it is what a free user would experience. </p><p><a href="https://www.browserless.io">Browserless</a> is a managed browser-as-a-service whose anti-bot story is a stealth path (<code>/chromium/stealth</code>) plus optional residential proxies for paid plans. The free plan caps sessions at 60 seconds, which is tight for a four-step flow. We tested it with the built-in residential proxy targeting France, and tried to test it with our external proxies via the <code>externalProxyServer</code> parameter; the external mode failed at connection time on every run, in the same Chromium-side authentication way that broke RayoBrowse, so we drop those configurations from the comparison. </p><p><a href="https://zenrows.com/">ZenRows</a> Scraping Browser is a managed Chromium with a built-in residential proxy network and built-in CAPTCHA solving; we connect via the WSS endpoint with <code>proxy_country=fr</code> to get a French exit point. </p><p><a href="https://brightdata.com/">Bright Data Browser API</a> sits at the other end of the same product category: a managed Chromium with built-in residential rotation and CAPTCHA solving, on a dedicated Browser API zone we configured on their dashboard.</p><p>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">103.BROWSERS</a>.</strong></p><h2>What we had to fix before the numbers made sense</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Stop Paying for Bandwidth: How to Leverage IPv6 Subnets for Infinite Proxy Rotation]]></title><description><![CDATA[Escape metered residential proxy billing. Discover how to build a self-hosted, rotating proxy gateway using IPv6 /64 subnets to drastically cut your web scraping costs at scale.]]></description><link>https://substack.thewebscraping.club/p/use-ipv6-scraping-nyxproxy</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/use-ipv6-scraping-nyxproxy</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 26 Apr 2026 20:30:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/21b6b18a-a1f6-4511-aec6-c5fc9ba435cd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p style="text-align: justify;">When your data extraction pipelines scale from a few thousand requests a day to thousands of requests per second, the bottleneck becomes network egress and IP reputation. Modern web architectures are defended by sophisticated Web Application Firewalls (WAFs) that deploy strict rate limiting, fingerprinting, and behavioral analysis.</p><p style="text-align: justify;">This means that if you route all your traffic through a single egress IP, you will be rate-limited in seconds and blacklisted in minutes. To survive at scale, you need to distribute your requests across a massive pool of IP addresses.</p><p style="text-align: justify;">Traditionally, the web scraping industry has solved this issue thanks to commercial proxy providers. However, this is not the only approach. This article responds to the following question: &#8220;<em>Is there a way to scrape at scale without burning budget on proxies</em>?&#8221;</p><p style="text-align: justify;">The answer is yes. But let&#8217;s be clear from the beginning: This approach is not a universal silver bullet. Let&#8217;s see how it works, how to build it, and what its limitations are.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>The Typical Solution for Scraping at Scale: Proxy Provider Services</h2><p style="text-align: justify;">Let&#8217;s start this discussion with the typical choice for scraping at scale. IP bans and rate limits are the #1 operational problem in scraping, especially at scale. The typical solution every web scraping engineer integrates is using proxy servers, for a simple reason: <a href="https://substack.thewebscraping.club/i/164246773/what-are-proxies-and-why-are-they-used">proxies act as intermediaries between your scrapers and the Internet</a>, avoiding your scrapers from getting banned. To do so, companies buy proxy IPs from proxy providers. The most common categories, both with their flaws, are the following:</p><ul><li><p style="text-align: justify;"><strong>Datacenter proxies:</strong> These are cheap and fast, but their ASNs(Autonomous System Numbers) are heavily scrutinized. WAFs maintain databases of known datacenter CIDR (Classless Inter-Domain Routing) blocks, so hitting a target with a static list of 100 datacenter proxies usually results in those IPs being flagged and blocked within hours.</p></li><li><p style="text-align: justify;"><strong>Residential proxies:</strong> These route traffic through actual consumer devices. They have highly trusted IP reputations, making them excellent for bypassing anti-bot systems. However, they are priced by bandwidth, so they are very expensive, especially when scraping at scale.</p></li></ul><p style="text-align: justify;">The main limitation of this approach is that it is highly expensive. So, what if you need to scrape at scale but don&#8217;t have enough budget for doing so?</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>An Alternative Approach: Scraping at Scale With Dedicated Infrastructure</h2><p style="text-align: justify;">To escape metered billing, you can move egress back to dedicated infrastructure. But before presenting the solution, let&#8217;s first point out shortly what happens when you buy and use proxies, at the infrastructure level.</p><h3>Buying Proxies Means Delegating Your Infrastructure</h3><p style="text-align: justify;">When you buy proxies from providers, you are delegating 100% of your infrastructure. When your scrapers make the requests, under the hood, the proxy provider connects to a gateway, which is a massive load balancer controlled entirely by the provider itself.</p><p style="text-align: justify;">Let&#8217;s consider the case of residential proxies, for simplicity. Behind the gateway is a peer-to-peer (P2P) network of millions of consumer devices that the provider has acquired bandwidth from. When your request hits the gateway, <strong>their proprietary routing algorithm decides which consumer device in which country will act as your final exit node</strong>.</p><p style="text-align: justify;">The second you route traffic through their gateway is the exact moment where you delegate the 100% of your scraping infrastructure.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h3>NyxProxy: The Infrastructural Solution</h3><p style="text-align: justify;"><a href="https://github.com/phanes-io/nyxproxy-oss?tab=readme-ov-file">NyxProxy</a> is a self-hosted HTTP/SOCKS5 proxy server that exploits a well-known IPv6 networking trick: When a cloud provider gives you a <em>/64</em> subnet, you legally own 18.4 <em>quintillion</em> IPv6 addresses.</p><p style="text-align: justify;">Let&#8217;s explain the number and the trick around IPv6s. An IPv6 address looks like this:</p><pre><code><code> 2a05:f480:1800:25db:0000:0000:0000:0001</code></code></pre><p style="text-align: justify;">They are 128 bits long. That gives <em>2^128</em> possible addresses. The number is so large that the designers said: &#8220;W<em>e can afford to give every organization a massive block and never worry about running out&#8221;.</em></p><p style="text-align: justify;">Now, here is the trick. An IPv6 address is split into two halves, 64 bits each:</p><pre><code><code>2a05:f480:1800:25db : 0000:0000:0000:0001
|___________________|   |_________________|
   Network prefix            Host part
   (your subnet)          (you control this)</code></code></pre><p style="text-align: justify;">The <em>/64</em> notation means: the first 64 bits identify the network, the last 64 bits are yours to assign however you want. The last 64 bits can be any value from <em>0000:0000:0000:0000</em> to <em>ffff:ffff:ffff:ffff</em>: That&#8217;s <em>2^64</em> = 18.4 quintillion combinations. All valid addresses, all routable to your server.</p><p style="text-align: justify;">Thanks to this trick, NyxProxy can assign a pool of those addresses to your network interface at startup, then rotate your outgoing traffic across them. This means having a fresh IP per request. The tool handles pool management, background rotation, NDP proxying via <em>ndppd</em>, and exposes a monitoring endpoint.</p><p style="text-align: justify;">The best part is, indeed, in the NDP proxying. When your server uses a random address like <em>2a05:f480:1800:25db:a3f1:9922:beef:1234</em> as a source IP, your router upstream needs to know <em>your server is responsible for that address</em>. Otherwise, the response packets have nowhere to go.</p><p style="text-align: justify;">IPv6 uses NDP (Neighbor Discovery Protocol) for this. The router sends an NDP query: <em>&#8220;who has 2a05:f480:1800:25db:a3f1:9922:beef:1234?&#8221;</em> and your server must answer.</p><p style="text-align: justify;"><em><a href="https://github.com/DanielAdolfsson/ndppd">ndppd</a></em> (NDP Proxy Daemon) runs on your server and answers those queries automatically for your entire /64 subnet, essentially saying <em>&#8220;yes, all of those addresses are mine&#8221;</em>. Without it, your packets go out, but responses never come back.</p><p style="text-align: justify;">Below is a summary schema of how this whole process works:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;ac241add-8e8d-40d0-a7df-518bccfc20bc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Provider gives you:  2a05:f480:1800:25db::/64
                     &#8595;
Your server can use: 2a05:f480:1800:25db:[anything]
                     &#8595;
NyxProxy assigns 200 random IPs to your interface
                     &#8595;
Each outgoing request binds to a different one
                     &#8595;
Target sees 200 different source IPs
                     &#8595;
ndppd makes sure responses route back correctly</code></pre></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>How To Use NyxProxy</h2><p>Let&#8217;s now see how to use NyxProxy with a practical implementation.</p><h3>Environment Setup &amp; Prerequisites</h3><p style="text-align: justify;">To replicate this tutorial for deploying NyxProxy and utilizing it in your scraping scripts, you must have the following system and hardware requirements:</p><ul><li><p style="text-align: justify;"><strong>Hardware</strong>: A Virtual Private Server (VPS) or bare-metal server with at least 512 MB of RAM and 100 MB of disk space. Supported architectures are <em>amd64</em> or <em>arm64</em>.</p></li><li><p style="text-align: justify;"><strong>Subnet</strong>: A cloud provider that natively delegates a full IPv6 <em>/64</em> subnet to your network interface. Note that not all the VPS providers are supported: Check out the <a href="https://github.com/phanes-io/nyxproxy-oss?tab=readme-ov-file#network-requirements">NyxProxy documentation to learn more about supported VPSs</a>.</p></li><li><p style="text-align: justify;"><strong>Operating system</strong>: A modern Linux distribution, specifically Ubuntu or Debian, to ensure compatibility with the automated setup scripts and <em>sysctl</em> kernel modifications.</p></li><li><p style="text-align: justify;"><strong>Python</strong>: <a href="https://www.python.org/downloads/">Python 3.7 or higher</a> installed on your local machine to run the scraping scripts.</p></li></ul><p style="text-align: justify;">To get your server ready to run the proxy daemon, you need to verify your IPv6 setup and gain root access. Ensure you are logged into your VPS via SSH as the <em>root</em> user, or have <em>sudo</em> privileges.</p><p style="text-align: justify;">First, verify that your server has a globally routable IPv6 <em>/64</em> subnet assigned to it. You can check this by running the following command in your server&#8217;s terminal:</p><pre><code><code>ip -6 addr show | grep "scope global"</code></code></pre><p>If done correctly, you should see an output similar to the following:</p><pre><code><code>inet6 2a05:f480:1800:25db::1/64 scope global</code></code></pre><p>If you do not see a <em>/64</em> subnet, you will not be able to rotate IPs, and you must review your cloud provider&#8217;s network settings.</p><p>Next, prepare your local development environment. Suppose you call the main folder of your Python project <em>nyxproxy_scraper/</em>. At the end of this step, the folder will have the following structure:</p><pre><code><code>nyxproxy_scraper/
    &#9500;&#9472;&#9472; main.py
    &#9492;&#9472;&#9472; venv/</code></code></pre><p>Where:</p><ul><li><p><em>main.py</em> is the Python file that will store your proxy request logic.</p></li><li><p><em>venv/</em> contains the standard Python virtual environment.</p></li></ul><p>You can create the <em>venv/</em> <a href="https://docs.python.org/3/library/venv.html">virtual environment</a> directory like so:</p><pre><code><code>python -m venv venv</code></code></pre><p>To activate it, on Windows, run:</p><pre><code><code>venv\Scripts\activate</code></code></pre><p>Equivalently, on macOS and Linux, execute:</p><pre><code><code>source venv/bin/activate</code></code></pre><p>As a final prerequisite, install the <a href="https://requests.readthedocs.io/en/latest/">Requests library</a> in your activated virtual environment so your Python script can make HTTP calls:</p><pre><code><code>pip install requests</code></code></pre><p>Well done! You are now ready to test and use Nyxproxy.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3><strong>Installing and Configuring NyxProxy</strong></h3><p style="text-align: justify;">NyxProxy provides a quick setup script that handles the infrastructural heavy lifting. It auto-detects your network interface, installs <em>ndppd</em>, tweaks the Linux kernel parameters via <em>sysctl</em> to allow non-local binding, and downloads the compiled Go binary.</p><p style="text-align: justify;">You can launch it with the following single command:</p><pre><code><code>wget &lt;https://raw.githubusercontent.com/jannik-schroeder/nyxproxy-oss/main/scripts/quick-setup.sh&gt; &amp;&amp; chmod +x quick-setup.sh &amp;&amp; sudo ./quick-setup.sh</code></code></pre><p style="text-align: justify;">During the setup, you will be prompted to configure your proxy credentials and set your rotation rules. Behind the scenes, the script generates a <em>config.yaml</em> file. Let&#8217;s look at the crucial subset of that configuration:</p><pre><code><code>network:
  rotate_ipv6: true
  ipv6_subnet: "2a05:f480:1800:25db::/64"

  # The rotation mechanics:
  ipv6_pool_size: 200
  ipv6_max_usage: 100
  ipv6_max_age: 30</code></code></pre><p style="text-align: justify;">Below is an explanation of what these three parameters mean for your scraping pipeline:</p><ul><li><p style="text-align: justify;"><em>ipv6_pool_size</em>: NyxProxy keeps 200 mathematically unique IPs &#8220;hot&#8221; and bound to your network interface at any given time. This keeps proxy startup times under 100ms while maintaining IP diversity.</p></li><li><p style="text-align: justify;"><em>ipv6_max_usage</em>: After a specific IP has been utilized for 100 requests, it is considered &#8220;burned.&#8221; NyxProxy destroys the route and spins up a fresh address to dynamically replace it.</p></li><li><p style="text-align: justify;"><em>ipv6_max_age:</em> If an IP hasn&#8217;t hit 100 requests but has been alive for 30 minutes, it gets forcefully rotated out. This prevents time-based algorithmic tracking by the target WAF.</p></li></ul><p style="text-align: justify;">Once the daemon is running as a systemd service, your VPS is officially acting as a rotating proxy gateway. When NyxProxy receives a scraper request, the underlying Go binary takes over. It looks at its internal memory, picks one of the 200 rotating IPv6 addresses in its pool, and binds to that specific address to establish the outbound connection.</p><p>The expected output is as follows:</p><pre><code><code>IPv6 rotation mode: IP Pool with dynamic rotation
  Interface: enp1s0
  Subnet: 2a05:f480:1800:25db::/64
  Pool size: 200 IPs
  Rotation: Every 100 uses or 30m0s
  Initializing IP pool...
  Progress: 50/200 IPs added
  Progress: 100/200 IPs added
  Progress: 150/200 IPs added
  Progress: 200/200 IPs added
  IP pool ready with 200 addresses
  Background IP rotation started

Starting https proxy on 0.0.0.0:8080 (Protocol: IPv6)</code></code></pre><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3><strong>Testing the Proxy Logic</strong></h3><p style="text-align: justify;">At this point, NyxProxy has done its job. To verify it works correctly, you can use the following Python script that hits <em><a href="https://www.ipify.org/">api6.ipify.org</a></em>, which is an API that simply bounces back the IP address it sees:</p><pre><code><code>import requests

# Point this to your VPS IP and the credentials you set during setup
proxies = {
    'http': '&lt;http://admin:password@your-vps-ip:8080&gt;',
    'https': '&lt;http://admin:password@your-vps-ip:8080&gt;'
}

# Test 5 consecutive scraping requests
for i in range(5):
    response = requests.get('&lt;https://api6.ipify.org&gt;', proxies=proxies)
    print(f"Request {i+1}: Target sees IP -&gt; {response.text}")
</code></code></pre><p style="text-align: justify;">(NOTE: If you are already familiar with ipify.org, note that the &#8220;api6&#8221; prefix can be used for IPv6 requests only.)</p><p>The result should be similar to the following:</p><pre><code><code>Request 1: Target sees IP -&gt; 2a05:f480:1800:25db:1a2b:3c4d:5e6f:7890
Request 2: Target sees IP -&gt; 2a05:f480:1800:25db:9988:7766:5544:3322
Request 3: Target sees IP -&gt; 2a05:f480:1800:25db:aaaa:bbbb:cccc:dddd
Request 4: Target sees IP -&gt; 2a05:f480:1800:25db:1122:3344:5566:7788
Request 5: Target sees IP -&gt; 2a05:f480:1800:25db:dead:beef:cafe:babe</code></code></pre><p style="text-align: justify;">This shows that every single HTTP request utilizes a completely different, globally routable IPv6 address generated from your subnet block. To the target server, these look like entirely distinct users connecting from across the internet.</p><p style="text-align: justify;">Perfect! You have successfully built a self-healing, infinitely rotating proxy pool without handing over your budget for metered residential bandwidth.</p><h2>The Illusion of Infinity: Critical Limitations of IPv6 Subnet Rotation</h2><p style="text-align: justify;">At this point, you may think you have found a solution to all of your budgeting problems for scraping at scale. But before you tear down your commercial proxy infrastructure, you must understand that a $5/Mo VPS and an open-source rotation daemon are not a universal silver bullet. If it were that simple, the commercial proxy industry would not exist.</p><p>This architecture has the following main limitation:</p><ul><li><p style="text-align: justify;"><strong>The IPv4 compatibility wall:</strong> This entire architecture is built on one absolute prerequisite: Your target endpoint must support IPv6. If you are scraping legacy enterprise systems or platforms that haven&#8217;t migrated to dual-stack networking, this setup is useless. You cannot route an IPv6 packet to an IPv4-only server.</p></li><li><p style="text-align: justify;"><strong>Subnet-level bans (</strong><em><strong>/64</strong></em><strong> prefix blocking):</strong> Enterprise WAFs are fully aware of IPv6 prefix delegation standards. They know that hosting providers allocate a <em>/64</em> subnet to a single client. If their heuristics detect highly concurrent behavioral patterns (like missing <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">browser fingerprints</a> or anomalous TLS handshakes) originating from <em>2a05:f480...:1a2b</em>, they will ban the entire <em>/64</em> CIDR block. Once your <em>/64</em> prefix is banned, all 18 quintillion of your &#8220;infinite&#8221; IPs are simultaneously dead. To recover, you must physically destroy the VPS and provision a new one in a different IP range.</p></li><li><p style="text-align: justify;"><strong>ASN reputation:</strong> No matter how many IPs you rotate, your traffic still originates from a Datacenter Autonomous System Number (ASN). Target firewalls assign a baseline trust score to every ASN. Traffic originating from a Datacenter ASN always starts with a highly degraded trust score compared to a Residential ASN. For highly restrictive targets, any request from a datacenter IP is instantly met with an unpassable CAPTCHA or a hard <em>403 Forbidden</em>, regardless of whether it&#8217;s IPv4 or IPv6.</p></li><li><p style="text-align: justify;"><em>nf_conntrack</em><strong> and hardware exhaustion:</strong> You cannot push enterprise-grade throughput on a $5, 1-vCPU server without consequence. Rotating thousands of IPv6 addresses requires the Linux kernel to aggressively maintain the <em><a href="https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.txt">nf_conntrack</a></em> table and the NDP proxy table. At high concurrencies, the overhead of establishing, tracking, and tearing down thousands of TCP sockets across rotating interfaces will exhaust the memory or CPU of a low-tier VPS. The kernel will begin dropping packets natively, your latency will spike to useless levels, and your scrapers will be greeted with errors.</p></li></ul><h2>Conclusion</h2><p style="text-align: justify;">In this article, you learned how to leverage your hosting provider&#8217;s IPv6 <em>/64</em> subnets to build an infinitely rotating proxy pool with NyxProxy, escaping the metered billing of residential proxy networks.</p><p style="text-align: justify;">The competitive advantage of engineering your own proxy infrastructure is in your unit economics and architectural control. However, you also learned that this solution is not a universal silver bullet for every scraping scenario: It comes with trade-offs and constraints.</p><p style="text-align: justify;">So, let us know: Have you already experimented with bare-metal IPv6 rotation for your scraping pipelines? What targets did it work best for? Let&#8217;s discuss in the comments!</p>]]></content:encoded></item><item><title><![CDATA[The Trick to Scrape Next.js Websites in Seconds]]></title><description><![CDATA[Scraping data from the most widely used full-stack framework in the world with just 3 lines of code!]]></description><link>https://substack.thewebscraping.club/p/scrape-nextjs-websites</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/scrape-nextjs-websites</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 19 Apr 2026 19:18:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/17ee7337-9a3d-445a-a255-2895a6ed8235_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Next.js is one of the most widely adopted full-stack JavaScript frameworks on the planet. If you&#8217;ve ever built or deployed a web app, you definitely know it&#8212;or at least you&#8217;ve heard of it.</p><p>Behind the scenes, it relies on hydration to make server-rendered pages interactive. And here&#8217;s the interesting part: the same mechanism that makes Next.js fast and popular also exposes a significant amount of structured data in the HTML sent by the server. From a scraping perspective, that&#8217;s a huge opportunity!</p><p>In this post, I&#8217;ll show you a simple trick to scrape data from virtually any Next.js website. Follow along as I break down how it works and how you can apply it yourself.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Next.js in Numbers</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1F7B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1F7B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 424w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 848w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1272w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Next.js&#8217; GitHub star growth&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Next.js&#8217; GitHub star growth" title="Next.js&#8217; GitHub star growth" srcset="https://substackcdn.com/image/fetch/$s_!1F7B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 424w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 848w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1272w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Next.js&#8217; GitHub star growth</figcaption></figure></div><p>Next.js needs no introduction, but it&#8217;s worth giving some context to truly understand how popular it is (<em>and therefore how useful the trick I&#8217;m about to present for Next.js web scraping can be</em>):</p><ul><li><p>According to the <a href="https://survey.stackoverflow.co/2025/">2025 Stack Overflow Developer Survey</a>, 20.8% of respondents used Next.js extensively over the past year.</p></li><li><p>Next.js is the 14th largest repository on GitHub, with <a href="https://github.com/vercel/next.js">over 138k stars</a> (and still growing!).</p></li><li><p><a href="https://w3techs.com/technologies/overview/javascript_library">According to W3Techs</a>, Next.js has a 2.9% market share among JavaScript libraries.</p></li><li><p>Major brands such as <a href="https://nextjs.org/showcase">Nike, Stripe, and Notion have chosen this full-stack framework</a> to build their official websites.</p></li></ul><h2>Before Getting Started: A Bit of Context on Hydration</h2><p>I know you probably just want the trick&#8230; Still, let me take a minute to explain why it works in the first place, why it&#8217;s even possible, and what kind of data you&#8217;ll actually retrieve with it!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h3>What Is Hydration?</h3><p><a href="https://en.wikipedia.org/wiki/Hydration_(web_development)">Hydration</a> is the process that makes a server-rendered page interactive in the browser.</p><p>Frameworks like Next.js, Remix, Nuxt, and SvelteKit employ this mechanism to combine the performance benefits of <a href="https://nextjs.org/docs/pages/building-your-application/rendering/server-side-rendering">server-side rendering (SSR)</a> with the interactivity of client-side applications.</p><p>The idea is that the server first sends fully rendered static HTML to the browser. Then, hydration happens next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jt2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 424w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 848w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png" width="1227" height="633" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:633,&quot;width&quot;:1227,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)" title="The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)" srcset="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 424w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 848w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)</figcaption></figure></div><p>The browser downloads the JavaScript bundle, and the frontend framework reconstructs the component tree in memory, attaches event listeners, and links that virtual tree to the existing DOM instead of re-rendering it from scratch. The result is a fully interactive application built on top of server-rendered HTML.</p><h3>How Does the Hydration Mechanism Work?</h3><p>It&#8217;s now clear that in Next.js and similar frameworks, hydration is the process where a static, server-rendered HTML page &#8220;comes to life&#8221; and becomes fully interactive in the browser. But what&#8217;s actually happening under the hood?</p><p>At a high level, hydration is a 3-step process:</p><ol><li><p>The server generates and sends a fully rendered HTML snapshot. The user immediately sees the content (great for <a href="https://web.dev/articles/fcp">First Contentful Paint</a>). At this point, though, the page is just static HTML. Buttons, forms, and other interactive elements are visible, but they don&#8217;t work yet because no JavaScript is attached.</p></li><li><p>The client&#8217;s browser downloads the JavaScript bundle (which includes React and your frontend application code) and executes it.</p></li><li><p>React rebuilds the component tree in memory and attaches event listeners to the existing DOM nodes. Instead of discarding the HTML and re-rendering everything from scratch, React &#8220;hydrates&#8221; the existing markup, meaning it reuses it and wires it up with state and interactivity.</p></li></ol><p>Once hydration completes, the page behaves like a normal single-page application: it responds to clicks, manages state, and updates dynamically.</p><p>And here&#8217;s an important detail: if the browser doesn&#8217;t support JavaScript (or it fails to load), the user still sees the server-rendered HTML. It won&#8217;t be interactive, but the core content is there. That&#8217;s great for SEO and perceived performance!</p><h3>Why It Matters for Scraping Next.js (and Other Full-Stack Frameworks&#8230;)</h3><p>The key insight you need to understand is simple: <strong>hydration requires data</strong>, and that data must be embedded somewhere in the HTML sent by the server!</p><p>In Next.js, when the server renders a page, it doesn&#8217;t only send markup. It also serializes the data required to rebuild the React component tree on the client. That serialized payload is embedded directly into the page&#8217;s HTML.</p><p>That&#8217;s exactly why hydration matters for scraping. Instead of parsing the DOM or simulating user interactions through <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation</a>, you can extract the structured data that React itself uses to hydrate the page.</p><p>In many cases, hydration data is cleaner and easier to parse than the rendered HTML. It can also contain more information than what&#8217;s visibly displayed on the page, including hidden and interesting metadata.</p><p>Keep in mind that this principle applies not only to Next.js! All other full-stack frameworks that rely on hydration, such as Remix, Nuxt, Angular Universal, and SvelteKit, tend to dehydrate state on the server and rehydrate it on the client.</p><p>So remember this simple rule. If a framework hydrates, it must serialize data. And if it serializes data into the HTML, you can scrape it.</p><h2>How to Scrape Next.js Websites: 2 Approaches</h2><p>The approach to scraping Next.js by targeting hydration data depends on how that data is embedded in the HTML generated on the server side.</p><p>I won&#8217;t go too deep into framework internals here (if you&#8217;re a Next.js dev, you already know things shift depending on whether you&#8217;re using the<a href="https://nextjs.org/docs/app/getting-started"> </a><em><a href="https://nextjs.org/docs/app/getting-started">App Router</a></em> or the<a href="https://nextjs.org/docs/pages/getting-started"> </a><em><a href="https://nextjs.org/docs/pages/getting-started">Pages Router</a></em>), but there are essentially two scenarios you&#8217;ll run into.</p><p>In this section, I&#8217;ll walk through both of them and show you exactly how I retrieve data from each!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Approach #1: Target the __NEXT_DATA__ Script</h3><p>As a target site, I&#8217;ll use a <a href="https://www.nike.com/t/air-jordan-5-retro-wolf-grey-mens-shoes-0M9kM1yX/DD0587-002">Nike product page</a> as a reference:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uJcE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uJcE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 424w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 848w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target Nike page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target Nike page" title="The target Nike page" srcset="https://substackcdn.com/image/fetch/$s_!uJcE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 424w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 848w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target Nike page</figcaption></figure></div><p>That&#8217;s actually a great example because Nike.com is even showcased on the Next.js homepage as a real-world site built with the framework.</p><p>Now, right-click on the page and select the &#8220;Inspect&#8221; option in your browser to open the DevTools. Scroll through the DOM and get familiar with the page structure. If the Next.js site is using the <em>Pages Router</em>, you&#8217;ll notice a <em>&lt;script&gt;</em> tag with the id <em>__NEXT_DATA__</em> containing a large JSON blob:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rV1e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rV1e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the JSON data inside the #__NEXT_DATA__ element&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the JSON data inside the #__NEXT_DATA__ element" title="Note the JSON data inside the #__NEXT_DATA__ element" srcset="https://substackcdn.com/image/fetch/$s_!rV1e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the JSON data inside the #__NEXT_DATA__ element</figcaption></figure></div><p>That JSON data is precisely the hydration data I was referring to earlier.</p><p>When a site uses the Pages Router approach in Next.js, the server embeds all the page data directly into that <em>&lt;script&gt;</em> tag. From a scraping perspective, that&#8217;s gold, as the data is already structured and ready to be captured.</p><p>Below&#8217;s a simple JavaScript snippet to extract it:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">const hydartionScript = document.querySelector("#__NEXT_DATA__")
const hydrationData = JSON.parse(hydartionScript.innerHTML)
console.log(hydrationData)</code></pre></div><p>What&#8217;s happening here is straightforward. The JS script:</p><ul><li><p>Selects the <em>&lt;script&gt;</em> element with <em>id</em> <em>__NEXT_DATA__</em>.</p></li><li><p>Reads its inner HTML (which is a JSON string).</p></li><li><p>Parses it into a JavaScript object.</p></li><li><p>Logs it to the console.</p></li></ul><p>Run this directly in the DevTools Console, and you&#8217;ll immediately see the result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2AK7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2AK7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dc28944-8842-4605-be59-b746fef469db_1746x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2AK7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the structured hydration data</figcaption></figure></div><p>What&#8217;s interesting is how much structured data you get right away. This includes product details, images, metadata, and more. All is neatly organized, and it only took three lines of code!</p><p>If you want to store the JSON hydration object, just right-click the object in the Console and select the &#8220;Copy object&#8221; option:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m1uv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m1uv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 424w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 848w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1272w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png" width="1456" height="483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Selecting the &#8220;Copy object&#8221; option&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Selecting the &#8220;Copy object&#8221; option" title="Selecting the &#8220;Copy object&#8221; option" srcset="https://substackcdn.com/image/fetch/$s_!m1uv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 424w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 848w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1272w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Selecting the &#8220;Copy object&#8221; option</figcaption></figure></div><p>From there, you can paste it wherever you need (e.g., into a local <em>.json</em> file, a MongoDB collection, etc.).</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Approach #2: Target the self.__next_f.push() Elements</h3><p>Another, more complex approach to scraping Next.js involves pages built with the <em>App Router</em>.</p><p>Even if the <em>App Router</em> has been the recommended direction for a while, in my experience, it&#8217;s still not as widely adopted as the <em>Pages Router</em>. And honestly, that&#8217;s a bit of a gift for us (as scraping hydration data in <em>App Router</em> sites is definitely more complex!)</p><p>As a reference, let&#8217;s look at the &#8220;<a href="https://openai.com/business/">Business Overview</a>&#8221; page on the OpenAI website, which is built with Next.js <em>App Router</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CAEI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CAEI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 424w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 848w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1272w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png" width="1456" height="709" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:709,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target page" title="The target page" srcset="https://substackcdn.com/image/fetch/$s_!CAEI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 424w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 848w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1272w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target page</figcaption></figure></div><p>Just like before, open DevTools and inspect the page. This time, focus on the <em>&lt;script&gt;</em> tags inside the <em>&lt;body&gt;</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LTkB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LTkB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png" width="1456" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the hydration script elements&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the hydration script elements" title="Note the hydration script elements" srcset="https://substackcdn.com/image/fetch/$s_!LTkB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the hydration script elements</figcaption></figure></div><p>You&#8217;ll notice several scripts containing content like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">self.__next_f.push(&lt;some_data&gt;)</code></pre></div><p>That &#8220;<em>&lt;some_data&gt;</em>&#8221; is serialized using the <a href="https://tonyalicea.dev/blog/understanding-react-server-components/">React Flight protocol for React Server Components (RSC)</a>. I won&#8217;t go too deep into the internals here (it&#8217;s a dense topic!), but what matters is that <strong>deserializing that data is </strong><em><strong>not</strong></em><strong> straightforward!</strong></p><p>React Flight isn&#8217;t plain JSON. It mixes control records (<em>HL</em>, <em>I</em>, <em>J</em>, etc.), module references, streaming boundaries, and serialized model fragments into a transport format that React incrementally resolves at runtime.</p><p>You might think: &#8220;Why not just reuse the frontend deserialization library?&#8221; In practice, that doesn&#8217;t work well because:</p><ul><li><p>The client decoder (<em><a href="https://www.npmjs.com/package/react-server-dom-webpack">react-server-dom-webpack</a></em>) expects a full React runtime.</p></li><li><p>It relies on module maps and webpack IDs generated at build time.</p></li><li><p>It resolves component references against the exact bundle that produced the stream.</p></li><li><p>It assumes streaming semantics and internal React wiring.</p></li></ul><p>Basically, outside that exact environment, you don&#8217;t have the module graph, build manifest, or hydration context. So even if you import the decoder, you can&#8217;t reconstruct the component tree the way the browser does.</p><p>There have been recent security issues in the React Flight payload deserialization system, highlighting just how sensitive and complex this layer is. For more details, refer to:</p><ul><li><p><em><a href="https://nextjs.org/blog/CVE-2025-66478">Security Advisory: CVE-2025-66478</a></em></p></li><li><p><em><a href="https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components">Critical Security Vulnerability in React Server Components</a></em></p></li></ul><p>Thus, instead of fighting the protocol, I&#8217;d simplify and accept that in this case, it&#8217;s better to extract the unparsed React Flight string data. Achieve that with the JS script below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">const nextFlightScripts = [...document.querySelectorAll("script")]
  .filter(script =&gt; script.textContent.includes("self.__next_f"))
  .map(script =&gt; script.textContent.trim())
console.log(nextFlightScripts)</code></pre></div><p>This selects all <em>&lt;script&gt;</em> elements containing &#8220;self.__next_f&#8221; and builds an array of their raw contents.</p><p>Run it in the Console, and you&#8217;ll get an array of React Flight chunks:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LBAG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LBAG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png" width="1456" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the React Flight strings&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the React Flight strings" title="Note the React Flight strings" srcset="https://substackcdn.com/image/fetch/$s_!LBAG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the React Flight strings</figcaption></figure></div><p>From there, the simplest way to extract structured data is often to copy the array, feed it to an AI, and ask it to reconstruct a parsed JSON representation of the meaningful payload sections:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!08ee!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!08ee!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 424w, https://substackcdn.com/image/fetch/$s_!08ee!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 848w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1272w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png" width="1456" height="973" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:973,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the parsed version of the source data produced by Gemini&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the parsed version of the source data produced by Gemini" title="Note the parsed version of the source data produced by Gemini" srcset="https://substackcdn.com/image/fetch/$s_!08ee!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 424w, https://substackcdn.com/image/fetch/$s_!08ee!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 848w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1272w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the parsed version of the source data produced by Gemini</figcaption></figure></div><p>Is this more complicated than the <em>__NEXT_DATA__</em> trick? Absolutely! Yet, it&#8217;s still a powerful way to access a large amount of page data with just a few lines of code.</p><h2>Final Script to Quickly Access Data From Next.js Sites</h2><p>If you combine the two approaches, you can build a production-ready script for brute-force hydration data scraping in Next.js:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// Pages Router approach (__NEXT_DATA__)
const hydrationScript = document.querySelector("#__NEXT_DATA__")
let nextData = null
if (hydrationScript) {
  try {
    nextData = JSON.parse(hydrationScript.textContent)
    console.log("__NEXT_DATA__ found:")
    console.log(nextData)
  } catch (err) {
    console.warn("Failed to parse __NEXT_DATA__:", err)
  }
} else {
  console.log("No __NEXT_DATA__ script found.")
}

// App Router approach (self.__next_f)
const nextFlightScripts = [...document.querySelectorAll("script")]
  .map(script =&gt; script.textContent.trim())
  .filter(content =&gt; content.includes("self.__next_f.push"))

if (nextFlightScripts.length &gt; 0) {
  console.log("React Flight scripts found:")
  console.log(nextFlightScripts)
} else {
  console.log("No React Flight scripts found.")
}</code></pre></div><p>To test it, just open the Console in DevTools, paste the script, and run it.</p><p><strong>Important</strong>: The <em>&lt;script&gt;</em> components containing hydration data aren&#8217;t loaded dynamically via client-side rendering. They&#8217;re embedded directly in the HTML generated by the server.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Km-I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Km-I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Km-I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the #__NEXT_DATA__ element in the page source</figcaption></figure></div><p>That means you can:</p><ol><li><p>Fetch the target Next.js-powered page with an HTTP client.</p></li><li><p>Parse the HTML using an HTML parsing library like Beautiful Soup or Cheerio.</p></li><li><p>Apply a similar version of the JavaScript script above, but adapt it to the API provided by your HTML parser.</p></li></ol><p>In other words, this trick for scraping Next.js doesn&#8217;t only work in the browser DevTools. It also works perfectly in regular scraping scripts!</p><h2>Pros and Cons of This Approach to Next.js Scraping</h2><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Simple and effective, requiring only a few lines of code.</p></li><li><p>Works on all Next.js websites (and, more generally, on most sites that rely on hydration).</p></li><li><p>Can let you access more data than what is actually displayed on the page.</p></li><li><p>No need for browser automation, waiting for client-side rendering, or simulating user interactions.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>You may only get partial data, meaning you might still need to complement it with a more traditional scraping approach.</p></li><li><p>React Flight data is difficult to parse and may require custom logic or even <a href="https://substack.thewebscraping.club/p/llms-ai-web-scraping">AI-assisted parsing</a>.</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve shared <a href="https://brightdata.com/blog/how-tos/web-scraping-with-next-js">a trick I personally documented years ago</a>, and that still works to this day. It allows you to quickly scrape data from virtually any Next.js site by targeting the hydration data embedded in the HTML document generated by the server and sent to the client for rendering.</p><p>As you&#8217;ve seen, with just a few lines of JavaScript, you can extract hydration data from any Next.js-powered page. What you get back is clean, or at least almost clean, data that you can process directly in your data pipelines.</p><p>Instead of fighting the frontend, this Next.js web scraping approach helps you leverage the data the framework itself needs to function!</p><p>I hope you found this useful and insightful. If you have questions or thoughts, feel free to share them in the comments below!</p>]]></content:encoded></item></channel></rss>