<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Web Scraping Club]]></title><description><![CDATA[News, solutions and interviews about web scraping.
In this substack you will find weekly content about:
- Web Scraping techniques
- Interviews with key people in the industry
- Anti bot infos and counter measures
- Real world examples and code]]></description><link>https://substack.thewebscraping.club</link><image><url>https://substackcdn.com/image/fetch/$s_!gJt2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1e343ec9-7946-4440-8c00-57209a1d99a1_1024x1024.png</url><title>The Web Scraping Club</title><link>https://substack.thewebscraping.club</link></image><generator>Substack</generator><lastBuildDate>Mon, 25 May 2026 17:41:46 GMT</lastBuildDate><atom:link href="https://substack.thewebscraping.club/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Web Scraping Club SRL]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[pier@thewebscraping.club]]></webMaster><itunes:owner><itunes:email><![CDATA[pier@thewebscraping.club]]></itunes:email><itunes:name><![CDATA[Pierluigi Vinciguerra]]></itunes:name></itunes:owner><itunes:author><![CDATA[Pierluigi Vinciguerra]]></itunes:author><googleplay:owner><![CDATA[pier@thewebscraping.club]]></googleplay:owner><googleplay:email><![CDATA[pier@thewebscraping.club]]></googleplay:email><googleplay:author><![CDATA[Pierluigi Vinciguerra]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[How to Scrape Open-Source Datasets Ethically]]></title><description><![CDATA[How to collect open data responsibly, without breaking rules or burning bridges]]></description><link>https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 24 May 2026 19:58:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1ef53778-bd4a-4fd8-9911-912fc9f8ea67_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you need to scrape data from the web, &#8220;open data&#8221; and &#8220;open-source datasets&#8221; sound like a green light. No paywall, no login, no restrictions: just data sitting there, ready to be collected. It is a reasonable assumption, right?</p><p>Well, not so fast.</p><p>Open data does not automatically mean free to use, free to redistribute, or free from privacy obligations. And scraping it without thinking through the implications can land you in legal trouble, get your IP banned from a public infrastructure that was never designed to handle aggressive crawlers, or cause you to expose people&#8217;s personal information.</p><p>In this article, we will go through a complete picture of the &#8220;open data&#8221; world: what the problem actually is, how to approach it correctly, and how to implement responsible open data scrapers in Python. </p><p>Let&#8217;s dive into it!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank <strong>NetNut</strong>, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>What &#8220;Open Data&#8221; Actually Means Legally, Ethically, and Practically</strong></h2><p>&#8220;Open&#8221; is one of the most overloaded words in the data world. Depending on the license, the jurisdiction, and the type of data involved, the same publicly accessible dataset can be freely redistributable, commercially restricted, privacy-sensitive, or legally off-limits entirely. </p><p>So, before anything else, let&#8217;s establish what you are actually dealing with.</p><h3>What &#8220;Open-Source Dataset&#8221; Actually Means (and What It Doesn&#8217;t)</h3><p>Where a dataset sits on the licensing spectrum determines everything: whether you can redistribute it, whether you can use it commercially, and whether collecting it at all exposes you to liability. Here is how the spectrum breaks down:</p><ul><li><p><strong>CC0</strong> (Creative Commons Zero): Essentially, it is a public domain dedication. The author waives all rights. You can scrape it, redistribute it, use it commercially, and modify it.</p></li><li><p><strong>CC-BY</strong> (Creative Commons Attribution): It requires you to credit the original source. This means you must clearly state where the data came from, who created it, and link back to the original when you publish or redistribute it. This is the most permissive license after CC0, and it is generally easy to comply with.</p></li><li><p><strong>CC-BY-SA</strong> (Share-Alike): This carries the same attribution requirement as CC-BY, but adds a condition: any derivative work you publish must carry the same license. In practice, this means you cannot fold a CC-BY-SA dataset into a proprietary product and lock it down.</p></li><li><p><strong>CC-BY-NC</strong> (Non-Commercial): It also requires attribution, but restricts commercial use entirely. You can use the data for research, journalism, or personal projects, but the moment money is involved, you need a separate agreement with the data owner.</p></li><li><p><strong>ODbL</strong> (Open Database License), used by OpenStreetMap: It requires both attribution and share-alike, specifically for databases. It is worth noting that ODbL distinguishes between the database itself and the contents. Basically, you can use individual facts freely, but any public use of the database as a whole must comply with the license terms.</p></li></ul><p>And then there is the grey zone, which is where most scraping engineers actually operate: data that is publicly accessible but carries no explicit license. Common cases are government portals, academic repositories, open court records, and municipal datasets. This is a huge portion of what people call &#8220;open data&#8221;. And here is the thing that matters for scraping professionals: <strong>no license does not mean free to use</strong>. In most jurisdictions, the absence of a license means the default copyright law applies. Which means the creator reserves all rights.</p><p>So before you write a single line of scraper code, the first question is not <em>&#8220;Can I access this?&#8221;</em> but <em>&#8220;Under what terms am I allowed to use what I access?&#8221;</em></p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>Where the Ethical (and Legal) Risks Hide</h3><p>Once you have cleared the license question, there are still several risk areas that are easy to overlook:</p><ul><li><p><strong>License violations</strong>: This is the most obvious one. If a dataset requires attribution and you redistribute it without crediting the source, you are in breach. If it has a non-commercial clause and you use it in a commercial product, it&#8217;s the same story. These are the kind of things that generate cease-and-desist letters.</p></li><li><p><strong>PII embedded in &#8220;open&#8221; datasets</strong>: This is a subtler and arguably more dangerous problem than license violation. Consider open court records: they are public by design, but they contain names, addresses, and sometimes sensitive personal details. Census microdata, even when anonymized at the aggregate level, can contain individual-level records. For example, the GitHub commit history is public, but it contains email addresses, which is personal data. So, the fact that data was made public by someone else does not strip it of its privacy implications when you collect, aggregate, and store it.</p></li><li><p><strong>Jurisdictional complexity</strong>: A dataset hosted on a European government portal carries GDPR obligations even if you are scraping it from the United States. The GDPR applies based on where the data subjects are located, not where the scraper is running. If you are collecting data about EU residents, you are in GDPR territory regardless of your own geography.</p></li><li><p><strong>The aggregation problem</strong>: This is probably one of the most underappreciated risks in the scraping industry. Individually, a dataset of names, a dataset of addresses, and a dataset of employment records might each be harmless and openly licensed. But combine them, and you have created a detailed profile of real people. This is something that privacy regulations were specifically designed to prevent.</p></li></ul><h3>The Infrastructure Problem: Open Data Portals Are Not Built for Scrapers</h3><p>Many scraping engineers come to open data with habits built on commercial targets. That experience can be misleading, because the infrastructure behind open data portals is completely different.</p><p><a href="https://substack.thewebscraping.club/p/sentiment-analysis-product-reviews">When you scrape a large e-commerce website</a> or a <a href="https://substack.thewebscraping.club/p/scraping-linkedin-public-data">major social media platform</a>, you are hitting servers that are engineered to handle millions of requests per day, backed by CDNs, load balancers, and dedicated anti-bot teams. In other words, they can take a (hard) hit.</p><p>On the other hand, a municipal open data portal, a university&#8217;s research repository, or a small NGO&#8217;s dataset hosting is an entirely different story. This means that a scraper that would barely register as noise on Amazon&#8217;s servers could genuinely degrade performance for a public data portal serving thousands of researchers.</p><p>This is why scraping open data portals aggressively is arguably more unethical than doing the same to a commercial target. You are not fighting a corporation&#8217;s anti-bot system. You are potentially taking down a public resource that other people depend on.</p><h3><strong>A Four-Step Framework for Scraping Open Datasets Without Breaking Rules or Infrastructure</strong></h3><p>Every risk outlined above has a straightforward mitigation, but only if you apply it at the right point in your workflow. The mistake most scraping engineers make is treating these as afterthoughts: checking the license after the scraper is already built, thinking about PII after the data is already stored. Let&#8217;s discuss a framework that inverts this.</p><h3>License-First Workflow: Read Before You Scrape</h3><p>The fix for the license problem is simple in principle, even if it requires discipline in practice: make license verification the first step of your workflow.</p><p>Most well-maintained open data portals provide license information in one of these three places: a <code>LICENSE</code> file in the dataset&#8217;s root directory, a metadata field in the dataset&#8217;s API response, or the dataset&#8217;s documentation page. Here is a quick reference for what the licenses described above mean for your use case:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AbdL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AbdL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 424w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 848w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1272w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png" width="1021" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:1021,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48171,&quot;alt&quot;:&quot;Summary table for data licenses by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196400924?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Summary table for data licenses by Federico Trotta" title="Summary table for data licenses by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!AbdL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 424w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 848w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1272w, https://substackcdn.com/image/fetch/$s_!AbdL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec701e4-c030-42d0-be3b-21fcda67b48c_1021x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Summary table for data licenses</figcaption></figure></div><p>When there is no license, the safe default is not to scrape and redistribute without seeking explicit permission from the dataset owner. A short email asking for clarification is a sign of professionalism.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Prefer APIs and Bulk Downloads Over Scraping</h3><p>This is a rule that experienced scraping engineers sometimes forget because they are so used to reaching for their scraper toolkit: always check for an official API or bulk download endpoint before writing a scraper.</p><p>Most serious open data portals expose REST APIs or provide direct bulk download links. Using these is better in every dimension: it is faster, more reliable, more respectful of the server, and often gives you cleaner, structured data than you would get from parsing HTML.</p><p>Your workflow should be:</p><ol><li><p>Check the portal&#8217;s documentation for an API.</p></li><li><p>Check for a <code>Sitemap</code> or structured data endpoint (as discussed in our <a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">article on robots.txt and its implications</a>).</p></li><li><p>Check for bulk download links (CSV, JSON, Parquet).</p></li><li><p>Only fall back to HTML scraping if none of the above exist.</p></li></ol><p>Scraping should be your last resort, not your first instinct.</p><h3>Responsible Scraping Behavior for Open Infrastructure</h3><p>When scraping is genuinely the only option, the rules of polite scraping apply. But in the case of open data portals, you should apply a higher standard than you would on a commercial target.</p><p>As covered in &#8220;<a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">best practices for ethical web scraping</a>&#8221;, respecting rate limits, introducing delays between requests, and using a descriptive User-Agent are baseline requirements. But for open data portals, you should go further because of their weaker infrastructure. Below are additional rules you should consider:</p><ul><li><p><strong>Respect </strong><em><strong>Crawl-delay</strong></em><strong> strictly</strong>: Even if major crawlers ignore it, on underfunded infrastructure, that directive is a good signal about server capacity.</p></li><li><p><strong>Cache responses locally</strong>: If you need to re-run your scraper for testing or debugging, you should not be hitting the server again. Cache what you have already fetched.</p></li><li><p><strong>Scrape during off-peak hours</strong>: For public portals serving researchers and government users, off-peak typically means nights and weekends in the portal&#8217;s local timezone.</p></li><li><p><strong>Scrape only what you need</strong>: This sounds obvious, but it&#8217;s easy to over-collect data &#8220;just in case&#8221;. However, for open portals, remember that every unnecessary request is a cost imposed on a public resource that stays online on an underfunded infrastructure.</p></li></ul><h3>Handling PII in Open Datasets</h3><p>PII stands for Personally Identifiable Information. This refers to any data that can be used, alone or in combination with other data, to identify a specific individual. Think names, email addresses, phone numbers, but also subtler things like IP addresses or device IDs.</p><p>The reality is that most well-maintained open data portals go through a review process before publication, and raw PII in open datasets is not as common as you might think. The most common cases where PII can slip through are quite specific: older government datasets published before modern privacy review processes, improperly anonymized academic research deposits, or crowdsourced datasets where contributors included personal details voluntarily.</p><p>In such specific cases, the real risk for most scraping engineers is at the aggregation level. A dataset of names, a dataset of ZIP codes, and a dataset of employment records might each be perfectly clean and openly licensed in isolation. But combine them, and you have built a detailed profile of real individuals. This is something that privacy regulations like the GDPR and CPRA were specifically designed to prevent. And once you collect, store, and process that combined data, you become responsible for it, regardless of where it originally came from.</p><p>The key principle remains the usual one: identify and handle PII at collection time. Here is a schema you can use to audit the fields that are likely to contain PII:</p><ul><li><p><strong>Direct identifiers</strong>: names, email addresses, phone numbers, national ID numbers, passport numbers, and social security numbers. These are the clearest cases as they point to a specific individual on their own, without needing to be combined with anything else. If you see these fields in a dataset, there is no ambiguity: you are dealing with PII.</p></li><li><p><strong>Quasi-identifiers</strong>: dates of birth, ZIP codes, job titles, gender, ethnicity, and salary ranges. None of these identify a person on their own, but they become dangerous in combination. A classic example is aggregating just three fields&#8212;say date of birth, gender, and ZIP code: this is enough to uniquely identify a great portion of the population in a country.</p></li><li><p><strong>Sensitive categories under GDPR</strong>: health and medical data, political opinions, religious or philosophical beliefs, biometric data, genetic data, sexual orientation, and trade union membership. This is a legally distinct class that carries stricter obligations regardless of context. In other words, you cannot process this data based on legitimate interest alone. You need explicit consent or another specific legal basis, and the bar is significantly higher than for ordinary PII.</p></li></ul><p>For each PII field, decide upfront: do you need it? If not, drop it at collection time. If you do need it, apply pseudonymization (replacing the identifier with a reversible token) or anonymization (irreversible removal or generalization) before storage.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Python Implementation: Putting the Full Responsible Scraping Pipeline Into Code</h2><p>Principles are only useful if they translate into implementation. Below are two concrete components you can adapt for your own pipelines:</p><ul><li><p>Checking a dataset&#8217;s license before downloading anything, using CKAN&#8217;s metadata API, with a practical fallback strategy for portals that don&#8217;t run CKAN.</p></li><li><p>Running PII detection at collection time, using field-level schema classification, with an honest discussion of where that approach has limits.</p></li></ul><p>Note that the examples below omit an API-first fetch pattern and a polite scraper skeleton, even though they are covered in the framework section above. This is because those are problems with well-known, straightforward solutions that every scraping engineer should be aware of. The idea of the following sections is to provide you with lesser-known solutions, to help you get ideas to apply to your pipelines.</p><h3>Checking a Dataset&#8217;s License Programmatically</h3><p>Many open data portals are built on <a href="https://ckan.org/">CKAN</a>, an open-source data management system used by governments and enterprises. CKAN exposes a REST API that includes license metadata, which makes programmatic license checking straightforward.</p><p>Here is how to query a CKAN-based portal and extract license information before proceeding:</p><pre><code><code>import requests

def check_dataset_license(portal_base_url: str, dataset_id: str) -&gt; dict:
    """
    Queries a CKAN portal API to retrieve license information
    for a given dataset before any scraping begins.
    """
    api_url = f"{portal_base_url}/api/3/action/package_show"
    params = {"id": dataset_id}

    response = requests.get(api_url, params=params, timeout=10)
    response.raise_for_status()

    data = response.json()
    result = data.get("result", {})

    license_info = {
        "dataset_name": result.get("title", "Unknown"),
        "license_id": result.get("license_id", "Not specified"),
        "license_title": result.get("license_title", "Not specified"),
        "license_url": result.get("license_url", "Not specified"),
    }

    return license_info

# Example: querying the UK government's open data portal
portal = "&lt;https://data.gov.uk&gt;"
dataset = "road-accidents-safety-data"

license_info = check_dataset_license(portal, dataset)

print(f"Dataset: {license_info['dataset_name']}")
print(f"License: {license_info['license_title']}")
print(f"License ID: {license_info['license_id']}")
print(f"License URL: {license_info['license_url']}")</code></code></pre><p>Which outputs the following:</p><pre><code><code>Dataset: Road Safety Data
License: UK Open Government Licence (OGL)
License ID: uk-ogl
License URL: &lt;https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/&gt;</code></code></pre><p>With this information in hand, you can make an informed decision before a single byte of dataset content is downloaded. Specifically, you can directly check the <a href="https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/">government license page</a>. The image below partially shows the license page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hLlt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hLlt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 424w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 848w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1272w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png" width="1211" height="690" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:690,&quot;width&quot;:1211,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135747,&quot;alt&quot;:&quot;The license page of the National Archive of the UK Government by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/196400924?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The license page of the National Archive of the UK Government by Federico Trotta" title="The license page of the National Archive of the UK Government by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!hLlt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 424w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 848w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1272w, https://substackcdn.com/image/fetch/$s_!hLlt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b78522-2f2d-4c50-acee-c199264886d1_1211x690.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The license page of the National Archive of the UK Government</figcaption></figure></div><p>But what if the portal you need to scrape doesn&#8217;t run CKAN? Not all open data portals do&#8230; <a href="https://dev.socrata.com/">Socrata</a> (used by many US city and state governments), <a href="https://getdkan.org/">DKAN</a>, and custom-built portals each have different or no metadata APIs. In those cases, your fallback options are the following:</p><ul><li><p>Check for a <em>LICENSE</em> or <em>METADATA</em> file in the dataset&#8217;s root directory or bulk download package. Many portals include one.</p></li><li><p>Look for a <em>&lt;link rel=&#8221;license&#8221;&gt;</em> tag in the dataset&#8217;s HTML page, which some portals emit as structured metadata.</p></li><li><p>Check the portal&#8217;s documentation or &#8220;About&#8221; page, where license terms are often stated globally for all datasets.</p></li></ul><p>If none of the above yield a clear answer, treat the license as unknown and do not redistribute without seeking explicit written permission from the dataset owner. A short email asking for clarification is a professional move.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>PII Detection at Scrape Time</h2><p>In this case, the approach depends heavily on what you actually know about the data you need to scrape. Two situations you will encounter in practice, each calling for a different strategy:</p><ul><li><p><strong>You know the schema</strong>: If you are retrieving structured data, field-level detection is the right approach. You know which fields are likely to carry PII, so you can target them directly. This is faster, more precise, and produces far fewer false positives than running a general NER model over free text.</p></li><li><p><strong>You have no schema</strong>: For unstructured data, NER-based detection is a reasonable starting point, but go in with realistic expectations. A common solution is using <a href="https://spacy.io/models/en">spaCy&#8217;s </a><em><a href="https://spacy.io/models/en">en_core_web_sm</a></em>, which is a small model trained on news text, so don&#8217;t expect it to do miracles for you. Another approach, which can give way better results, is <a href="https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text">using LLMs to give a structure to unstructured text</a>.</p></li></ul><p>For the structured case, here is a field-level PII detection pipeline:</p><pre><code><code>import re
import hashlib
from dataclasses import dataclass, field
from typing import Any

# Fields that are unambiguously PII on their own
DIRECT_IDENTIFIER_FIELDS = {
    "name", "full_name", "first_name", "last_name",
    "email", "email_address",
    "phone", "phone_number", "mobile",
    "ssn", "national_id", "passport_number",
    "ip_address", "device_id"
}

# Fields that are not PII alone but dangerous in combination
QUASI_IDENTIFIER_FIELDS = {
    "date_of_birth", "dob", "birth_date",
    "zip_code", "postcode", "zip",
    "gender", "sex",
    "job_title", "occupation",
    "salary", "income",
    "ethnicity", "race"
}

# Regex patterns for validating suspected PII values at the content level
EMAIL_PATTERN = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")
PHONE_PATTERN = re.compile(r"\\b(\\+?\\d[\\d\\s\\-().]{7,}\\d)\\b")

@dataclass
class FieldAudit:
    field_name: str
    classification: str    # "direct", "quasi", or "clean"
    original_value: Any
    processed_value: Any   # pseudonymized, generalized, or original
    action_taken: str       # "pseudonymized", "generalized", "dropped", "kept"

def pseudonymize(value: Any) -&gt; str:
    """
    Replaces a PII value with a consistent, reversible token.
    Using a hash means the same value always produces the same token,
    which preserves referential integrity across records (e.g., you can
    still count unique users without knowing who they are).
    In production, use an HMAC with a secret key instead of plain SHA-256.
    """
    return hashlib.sha256(str(value).encode()).hexdigest()[:16]

def generalize_date(value: str) -&gt; str:
    """
    Reduces a full date of birth to a birth year only.
    A simple but effective generalization for quasi-identifiers.
    """
    # Handles common formats: YYYY-MM-DD, DD/MM/YYYY, MM/DD/YYYY
    match = re.search(r"\\b(19|20)\\d{2}\\b", str(value))
    return match.group(0) if match else "UNKNOWN_YEAR"

def audit_record(record: dict) -&gt; tuple[dict, list[FieldAudit]]:
    """
    Processes a single structured record field by field.
    Returns a cleaned record and a full audit trail of what was done to each field.

    Strategy:
    - Direct identifiers: pseudonymize (preserve referential integrity)
    - Quasi-identifiers: generalize where possible, pseudonymize otherwise
    - Everything else: pass through unchanged
    """
    clean_record = {}
    audit_trail = []

    for field_name, value in record.items():
        normalized = field_name.lower().strip()

        if normalized in DIRECT_IDENTIFIER_FIELDS:
            processed = pseudonymize(value)
            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="direct",
                original_value=value,
                processed_value=processed,
                action_taken="pseudonymized"
            ))
            clean_record[field_name] = processed

        elif normalized in QUASI_IDENTIFIER_FIELDS:
            # Apply field-specific generalization where we can
            if normalized in {"date_of_birth", "dob", "birth_date"}:
                processed = generalize_date(value)
                action = "generalized"
            else:
                # For other quasi-identifiers, pseudonymize as a safe default
                processed = pseudonymize(value)
                action = "pseudonymized"

            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="quasi",
                original_value=value,
                processed_value=processed,
                action_taken=action
            ))
            clean_record[field_name] = processed

        else:
            # Field is not in either PII list &#8212; pass through, but still
            # run a regex check on string values as a safety net
            if isinstance(value, str):
                if EMAIL_PATTERN.search(value) or PHONE_PATTERN.search(value):
                    # Unexpected PII in a non-PII field: flag it and pseudonymize
                    processed = pseudonymize(value)
                    audit_trail.append(FieldAudit(
                        field_name=field_name,
                        classification="direct",
                        original_value=value,
                        processed_value=processed,
                        action_taken="pseudonymized (unexpected PII in non-PII field)"
                    ))
                    clean_record[field_name] = processed
                    continue

            audit_trail.append(FieldAudit(
                field_name=field_name,
                classification="clean",
                original_value=value,
                processed_value=value,
                action_taken="kept"
            ))
            clean_record[field_name] = value

    return clean_record, audit_trail

def process_records(records: list[dict]) -&gt; list[dict]:
    """
    Runs field-level PII detection and handling across a list of records.
    Prints an audit summary for any record where PII was found.
    """
    clean_records = []

    for i, record in enumerate(records):
        clean_record, audit_trail = audit_record(record)
        pii_fields = [a for a in audit_trail if a.classification != "clean"]

        if pii_fields:
            print(f"Record {i}: PII detected and handled in {len(pii_fields)} field(s):")
            for audit in pii_fields:
                print(f"  [{audit.classification.upper()}] {audit.field_name} "
                      f"&#8594; {audit.action_taken}")

        clean_records.append(clean_record)

    return clean_records

# Example: a batch of records from a scraped open dataset
records = [
    {
        "record_id": "A001",
        "name": "Jane Doe",
        "date_of_birth": "1985-03-22",
        "zip_code": "SW1A 1AA",
        "incident_type": "Road accident",
        "severity": "Slight"
    },
    {
        "record_id": "A002",
        "name": "John Smith",
        "date_of_birth": "1973-11-04",
        "zip_code": "EC1A 1BB",
        "incident_type": "Road accident",
        "severity": "Serious",
        # An email that slipped into a free-text notes field
        "notes": "Witness contact: witness@example.com"
    }
]

clean = process_records(records)</code></code></pre><p>The output is the following</p><pre><code><code>Record 0: PII detected and handled in 3 field(s):
  [DIRECT] name &#8594; pseudonymized
  [QUASI] date_of_birth &#8594; generalized
  [QUASI] zip_code &#8594; pseudonymized
Record 1: PII detected and handled in 4 field(s):
  [DIRECT] name &#8594; pseudonymized
  [QUASI] date_of_birth &#8594; generalized
  [QUASI] zip_code &#8594; pseudonymized
  [DIRECT] notes &#8594; pseudonymized (unexpected PII in non-PII field)</code></code></pre><p>A few things worth calling out in this implementation:</p><ul><li><p><strong>Pseudonymization preserves referential integrity:</strong> Because the same input always produces the same hash token, you can still count unique individuals, join records, or track entities across datasets, without storing the raw PII. In production, replace the plain SHA-256 with an HMAC keyed on a secret, so tokens cannot be reversed by someone who also has access to the hashing algorithm.</p></li><li><p><strong>The regex safety net on non-PII fields</strong>: This catches the common real-world case where PII slips into a free-text or notes field that your schema classification didn&#8217;t anticipate. It is not foolproof, but it catches the obvious cases.</p></li><li><p><strong>The audit trail is intentional:</strong> Every field-level decision is logged. If you are ever asked to demonstrate that your collection process handled PII responsibly, you have a record of exactly what was done to each field in each record.</p></li></ul><h2>Conclusion</h2><p>Open data is a shared resource, and how you interact with it says something about you as a professional. In this article, you learned what &#8220;open&#8221; means in the context of data scraping and how you should treat it if you want to be an ethical scraper.</p><p>So, let us know: Did we miss something? What&#8217;s your approach to handling open datasets in your scraping projects? Let&#8217;s discuss in the comments.</p>]]></content:encoded></item><item><title><![CDATA[Using Web Scraping in Finance to Discover Investment Insights]]></title><description><![CDATA[Tired of guessing? Use web scraping to make data-backed financial decisions!]]></description><link>https://substack.thewebscraping.club/p/web-scraping-in-finance</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/web-scraping-in-finance</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 17 May 2026 16:03:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8d7b98ff-dc95-41cf-bc83-5cfa5241ed96_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve ever invested, you know how challenging it can be (even if you don&#8217;t <em>YOLO</em> all your money into a single stock, lol). Thankfully, things get a lot easier when you build data-powered processes to guide your decision-making.</p><p>No wonder nearly half a trillion dollars are spent every year by financial firms on technology. Now, you probably don&#8217;t have that kind of money in the first place (and if you do, you don&#8217;t need to invest much anyway), but you might still want to collect financial data for personal use, research, academic projects, backtesting, or even just for selling it to industry giants.</p><p>No matter what you want to do with scraped financial data, there are a few pivotal tips to understand before embarking on this journey, which is exactly what I will explain here!</p><p>In this blog post, I will show why web scraping and finance are a match made in heaven and cover everything you need to know about retrieving both historical and real-time financial data from the web.</p><h2>Web Scraping + Finance: A Happy Marriage</h2><p>Before diving into web scraping for finance, let me explain why this is such a powerful approach and the advantages you can gain from it.</p><h3>Finance Runs on (Web) Data</h3><p>If there&#8217;s one thing that&#8217;s become clear over the past decade, it&#8217;s this: <a href="https://www.acceldata.io/blog/the-critical-role-of-data-in-finance">finance runs on data!</a></p><p>Financial institutions process massive volumes of market, customer, and transactional data every single day. In finance, data powers everything, from investment strategies to risk management. And the stakes are high, as <a href="https://www.gartner.com/smarterwithgartner/how-to-improve-your-data-quality">bad data alone costs organizations an average of $12.9 million per year</a>!</p><p>Data drives real-time decision-making, predictive modeling, and scenario planning. Finance teams feed that data into pipelines built around statistical analysis, machine learning, and <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">AI to identify patterns</a>, forecast market movements, and manage uncertainty in increasingly complex environments.</p><p>Now, here&#8217;s the central question we, web scraping enthusiasts, are all interested in: <em>where does most of that data actually come from? </em>A big portion of it comes from the web (not that surprising, uh?).</p><p>I&#8217;m talking about news sites, financial portals, company pages, official exchange websites, regulatory filings, institutional reports, and more. The web is essentially the largest and most dynamic data source available for financial purposes.</p><p>That&#8217;s exactly why web scraping in finance isn&#8217;t just useful. It&#8217;s foundational!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Benefits of a Data-Driven Approach in Finance</h2><p>Keep in mind that it&#8217;s not just big corporations or financial firms that benefit from data. Even individual retail investors can leverage financial data scraping to gain an edge. In particular, the main advantages include:</p><ul><li><p><strong>Informed decisions</strong>: Access to accurate historical data supports smarter investment decisions, while real-time data enables more solid trading choices.</p></li><li><p><strong>Market trend insights</strong>: Spot patterns and emerging trends before the wider market does.</p></li><li><p><strong>Risk management</strong>: Identify potential risks early and adjust strategies proactively.</p></li><li><p><strong>Portfolio optimization</strong>: <a href="https://substack.thewebscraping.club/p/llm-fine-tuning-for-scraping">Fine-tune asset allocation</a> based on backtesting and up-to-date market and company data.</p></li><li><p><strong>Efficiency and speed</strong>: Automate data collection, reducing time spent on manual research.</p></li></ul><p>I mean, financial firms wouldn&#8217;t be <a href="https://www.forrester.com/blogs/us-financial-services-tech-spending-hits-495-billion/">spending over $495 billion a year</a> (yeah, you read that right!) on technology (mostly built around collecting, processing, and leveraging data) if it didn&#8217;t give them a real edge!</p><h3>Getting vs Selling Financial Web Data: High-Level Overview</h3><p>There&#8217;s no doubt that financial firms invest billions into data. But what about you, as a web scraping expert, <em>how can you leverage financial data for potential gain?</em> There are two high-level approaches:</p><ol><li><p><strong>For yourself or your company</strong>: Build custom web scraping pipelines to gather data from multiple sources. Use it to <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">feed investment models, AI agents</a>, trading algorithms, or analytics pipelines. This is usually highly tailored to your strategies, risk appetite, or operational goals.</p></li><li><p><strong>To sell to financial services</strong>: Collect, aggregate, and potentially enrich data from various sources to sell. You can offer broad datasets for many clients or fully customized solutions for a specific customer&#8217;s needs.</p></li></ol><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>How to Approach Financial Data Scraping: Historical vs Real-Time</h2><p>When it comes to finance, the web is packed with countless data fields and categories (e.g., news, stock prices, filings, analyst reports, and more). It&#8217;s a huge industry, and almost anything can be scraped!</p><p>At a high level, though, the key distinction for web scraping is simple: the financial data you want to collect is either historical or real-time. That&#8217;s what actually makes a difference in the approach to data scraping.</p><p>In the following chapters, I&#8217;ll dive deeper into each of the two categories of financial data. I&#8217;ll cover which fields are most interesting to scrape, where to find them, and how to collect them efficiently and effectively.</p><p>For now, start with a brief introduction to historical and real-time financial web data scraping!</p><h3>Historical Financial Web Data</h3><p>This includes all past financial data collected from the web, from historical stock prices to inflation rates and archived news. It&#8217;s used for analysis supporting long-term investment decisions.</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Enables backtesting of investment and trading strategies.</p></li><li><p>Easier to scrape, as it isn&#8217;t time-sensitive.</p></li><li><p>Data itself is stable and doesn&#8217;t change over time&#8230;</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>&#8230;but the web pages displaying it (e.g., in tables and static charts) can still change, breaking your static parsing logic.</p></li><li><p>Misses recent market shifts or breaking events.</p></li><li><p>Data completeness varies across websites, often requiring aggregation from multiple sources.</p></li></ul><h3>Real-Time Financial Web Data</h3><p>This includes live financial data extracted from the web, such as stock prices, market news, order books, etc. It&#8217;s employed for trading and short-term investment decisions.</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Enables fast, data-driven trading decisions.</p></li><li><p>Captures live market movements and breaking news.</p></li><li><p>Can be passed to AI agents and pipelines directly, as it tends to require minimal preprocessing.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Harder to scrape reliably due to latency constraints and <a href="https://substack.thewebscraping.club/p/rate-limit-scraping-exponential-backoff">rate limits</a>.</p></li><li><p>Requires robust infrastructure for real-time ingestion and analysis, as every second counts.</p></li><li><p>Data storage can grow rapidly because new data arrives continuously.</p></li></ul><h3>Mastering Historical Financial Data Scraping</h3><p>As promised, let me guide you through the world of scraping historical financial data from the web.</p><h3>Main Types of Historical Financial Web Data</h3><p>The most important types of historical financial data you can retrieve from websites are:</p><ul><li><p><strong>Historical stock and commodity prices</strong>: Open, high, low, close (OHLC) prices and trading volumes for stocks, ETFs, indices, and commodities, used for <a href="https://substack.thewebscraping.club/p/predictive-analytics-web-scraped-data">time-series analysis, modeling, and predictions</a>.</p></li><li><p><strong>Summary info and infographics</strong>: Stock profiles, key metrics, and past indicators (e.g., P/E, EPS, moving averages), presented in dashboards or visual charts for quick insights.</p></li><li><p><strong>Macroeconomic indicators</strong>: Inflation, GDP, interest rates, unemployment, CPI, and PCE data, essential for understanding economic cycles and long-term market behavior.</p></li><li><p><strong>Financial statements</strong>: Company filings (income statements, balance sheets, cash flow), utilized for fundamental analysis and valuation models.</p></li><li><p><strong>News data</strong>: Archived headlines and press releases analyzed via NLP to correlate past market movements with specific events and sentiment shifts.</p></li><li><p><strong>ESG scores and sustainability reports</strong>: Historical environmental, social, and governance metrics employed to assess how &#8220;green&#8221; or ethical a company has been over time.</p></li><li><p><strong>Alternative data</strong>: Non-traditional datasets like web traffic, social media, satellite imagery (e.g., new headquarters or production plants), or credit card data for early performance signals.</p></li></ul><h3>Most Popular Targets</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!grNv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!grNv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 424w, https://substackcdn.com/image/fetch/$s_!grNv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 848w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png" width="1456" height="1184" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1184,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:228005,&quot;alt&quot;:&quot;Popular historical financial data scraping sources&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Popular historical financial data scraping sources" title="Popular historical financial data scraping sources" srcset="https://substackcdn.com/image/fetch/$s_!grNv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 424w, https://substackcdn.com/image/fetch/$s_!grNv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 848w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!grNv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0af2c44-101d-426f-8398-a4ecc9166729_1490x1212.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Popular historical financial data scraping sources</figcaption></figure></div><p>Also, if you&#8217;re interested in how to scrape historical data from the Wayback Machine, <a href="https://substack.thewebscraping.club/p/scraping-wayback-machine">read my previous guide for this newsletter!</a></p><h3>Scraping Techniques</h3><p>Typical examples of historical financial data include lists of open, high, low, and close prices for a given stock:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S5x0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S5x0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 424w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 848w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1272w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png" width="1456" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;NVDA historical stock data (Source: Yahoo Finance)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="NVDA historical stock data (Source: Yahoo Finance)" title="NVDA historical stock data (Source: Yahoo Finance)" srcset="https://substackcdn.com/image/fetch/$s_!S5x0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 424w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 848w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1272w, https://substackcdn.com/image/fetch/$s_!S5x0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78e13563-e990-4776-816b-3af4b4e32423_3030x1561.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">NVDA historical stock data (Source: Yahoo Finance)</figcaption></figure></div><p>Or, another example, the historical returns of a specific index (.e.g, SP500) over time:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XcNT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XcNT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 424w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 848w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png" width="1456" height="979" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:979,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)" title="100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)" srcset="https://substackcdn.com/image/fetch/$s_!XcNT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 424w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 848w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!XcNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461333ea-5215-4178-bb27-d0f9744201ad_2314x1556.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">100-year monthly historical table for the S&amp;P500 (Source: Macrotrends)</figcaption></figure></div><p>These cases fall into the category of table-based data scraping, one of the most common web scraping scenarios. You&#8217;re probably already familiar with it, so there&#8217;s no need to go too deep here. Scraping older news and media can be slightly more challenging due to the unstructured nature of the target data, but it&#8217;s still a simple task.</p><p>At a high level, the process for getting historical finance data via web scraping follows a standard workflow:</p><ol><li><p>Visit the target web page, either via an HTTP client or a browser automation tool.</p></li><li><p>Parse the page using an HTML parser, either directly or after rendering in a controlled browser.</p></li><li><p>Select the HTML elements of interest and extract the data.</p></li><li><p>Store the scraped data in your desired format (e.g., XLS, CSV, JSON) or in a database.</p></li></ol><p>The main challenges involve generic anti-scraping mechanisms, such as CAPTCHAs, WAFs, IP bans, as well as <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">browser</a>, TLS, and <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">device fingerprinting</a>.</p><h3>Best Practices</h3><p>Based on my experience with financial web scraping, especially when focusing on historical data, these are the tips you should apply:</p><ul><li><p><strong>Normalize and validate data</strong>: Standardize formats (dates, currencies, units) and validate across sources to catch inconsistencies early.</p></li><li><p><strong>Be cautious with AI parsing</strong>: Avoid <a href="https://substack.thewebscraping.club/p/llms-ai-web-scraping">using AI for automatically parsing structured data</a> (tables, metrics, structured fields). It can introduce subtle errors and hallucinations, so prefer deterministic parsing. Harness AI mainly for retrieving unstructured text like news.</p></li><li><p><strong>Store raw HTML snapshots</strong>: Always keep the original page HTML. It lets you <a href="https://substack.thewebscraping.club/p/offline-web-scraping">re-parse data later and extract new signals without re-scraping</a>.</p></li><li><p><strong>Avoid single-source bias</strong>: When scraping news or market analysis pieces, pull data from multiple sources to reduce bias and improve reliability.</p></li><li><p><strong>Handle pagination properly</strong>: Many sites split historical data across pages or date ranges. Make sure your scraper fully traverses them all.</p></li><li><p><strong>Respect rate limits and retries</strong>: Even for historical data, implement retries and throttling to avoid blocks and incomplete datasets.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Understanding Real-Time Financial Data Scraping</h2><p>This is where things get a bit more interesting. Let me introduce you to real-time financial scraping!</p><h3>Main Types of Real-Time Financial Web Data</h3><p>The most relevant types of real-time financial web data are:</p><ul><li><p><strong>Live price tickers</strong>: Continuously updated &#8220;last trade&#8221; prices and bid/ask spreads for stocks, crypto, and forex, used to detect breakouts and short-term trading opportunities.</p></li><li><p><strong>Order book and market depth</strong>: Incoming buy/sell orders, liquidity levels, and spreads, fundamental for execution strategies and high-frequency trading.</p></li><li><p><strong>Breaking news</strong>: Immediate updates and announcements that trigger sentiment models as soon as key figures (CEOs, central banks, governments) release information.</p></li><li><p><strong>Corporate event triggers</strong>: Monitoring press releases or SEC feeds for earnings surprises, M&amp;A rumors, or sudden executive changes.</p></li><li><p><strong>Social media signals</strong>: <a href="https://substack.thewebscraping.club/p/how-to-scrape-reddit-with-scrapy">Tracking ticker mentions on platforms like Reddit</a> or X to detect retail-driven momentum, hype cycles, or panic selling in near real time.</p></li><li><p><strong>Institutional &#8220;whale&#8221; activity</strong>: Observing large trades or major wallet movements (especially in crypto) to identify where significant capital is flowing.</p></li><li><p><strong>Alternative digital signals</strong>: Web traffic spikes, app store ranking changes, or &#8220;out of stock&#8221; alerts on retail sites as proxies for real-world demand.</p></li></ul><p>As you can tell, this category is more varied than historical financial data, including social media tracking and other less conventional practices. Thus, the sources to monitor for live financial web scraping can be less standardized and intuitive.</p><h3>Most Popular Targets</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UIN0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UIN0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 424w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 848w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png" width="1456" height="1487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:289145,&quot;alt&quot;:&quot;Popular live financial data scraping sources&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Popular live financial data scraping sources" title="Popular live financial data scraping sources" srcset="https://substackcdn.com/image/fetch/$s_!UIN0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 424w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 848w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1272w, https://substackcdn.com/image/fetch/$s_!UIN0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadedce2f-a9d6-4850-98a1-09d8cba60383_1490x1522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Popular live financial data scraping sources</figcaption></figure></div><h3>Scraping Techniques</h3><p>Imagine applying a traditional scraping pattern to real-time financial data. You send a request to a target site, extract a stock price, and repeat the operation every few seconds or even milliseconds.</p><p>The problem is latency. By the time the server responds, the page is rendered or parsed, the target data field is collected, and stored or sent to your pipeline, that piece of data is already outdated.</p><p>On top of that, this approach requires a crazy number of requests in a very short time. That increases the risk of triggering rate limiting or even IP bans. You might think proxies solve that through IP rotation, but most proxy networks introduce additional latency, often 2/3/5 seconds per request. In real-time scenarios, that delay is simply not acceptable!</p><p>Even if you <a href="https://substack.thewebscraping.club/p/choosing-proxy-provider-scraping">switch to faster or dedicated proxies</a>, you may end up with a smaller IP pool, which increases the likelihood of those IPs getting blocked.</p><p>A more advanced idea is to rely on browser automation and keep a page open, capturing updates as they happen. This is smarter, but still problematic. Long-lived sessions with little or no user interaction are highly suspicious and can easily trigger anti-bot systems. Plus, browser automation at scale tends to be flaky, not really reliable for persistent connections.</p><p>Long story short, scraping real-time financial data this way quickly turns into a losing game.</p><p>The solution? Stop targeting the data presentation layer in HTML and instead go directly to the data source!</p><h4>API/WebSocket Scraping as The Solution</h4><p>Web pages showing real-time financial data aren&#8217;t doing anything magical. Behind the scenes, they either poll APIs at regular intervals or (more commonly) <a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSocket">maintain a persistent connection via WebSockets</a> to receive continuous updates. The page simply renders that incoming data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q9lT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 424w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 848w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1272w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif" width="1080" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the live price update&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the live price update" title="Note the live price update" srcset="https://substackcdn.com/image/fetch/$s_!Q9lT!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 424w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 848w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1272w, https://substackcdn.com/image/fetch/$s_!Q9lT!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b9343c-5d2d-4663-92f9-6a5fa7ab2d39_1080x582.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the live price update</figcaption></figure></div><p>As a result, a much better approach is to intercept and replicate those data flows. You can do this through<a href="https://substack.thewebscraping.club/p/apis-in-web-scraping"> AJAX/API request inspection</a> or WebSocket sniffing. Open the browser developer tools, go to the &#8220;Network&#8221; tab, and check where the data is coming from.</p><p>If it&#8217;s an API call, you&#8217;ll see it under the &#8220;Fetch/XHR&#8221; tab:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d22T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d22T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 424w, https://substackcdn.com/image/fetch/$s_!d22T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 848w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1272w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png" width="1456" height="1180" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1180,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the API used by Yahoo Finance to determine whether the market is open in real time&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the API used by Yahoo Finance to determine whether the market is open in real time" title="Note the API used by Yahoo Finance to determine whether the market is open in real time" srcset="https://substackcdn.com/image/fetch/$s_!d22T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 424w, https://substackcdn.com/image/fetch/$s_!d22T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 848w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1272w, https://substackcdn.com/image/fetch/$s_!d22T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd45f7536-0a5b-4d95-8773-9deb507fcef8_1754x1421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the API used by Yahoo Finance to determine whether the market is open in real time</figcaption></figure></div><p>If it&#8217;s a WebSocket, you&#8217;ll find it under the &#8220;Socket&#8221; tab:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!quUU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!quUU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 424w, https://substackcdn.com/image/fetch/$s_!quUU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 848w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1272w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png" width="1456" height="778" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/171d48c6-665a-4753-86bb-c30793609101_3059x1634.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:778,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page" title="Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page" srcset="https://substackcdn.com/image/fetch/$s_!quUU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 424w, https://substackcdn.com/image/fetch/$s_!quUU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 848w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1272w, https://substackcdn.com/image/fetch/$s_!quUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F171d48c6-665a-4753-86bb-c30793609101_3059x1634.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the live BTC price data retrieved from a message over a WebSocket connection established by the TradingView page</figcaption></figure></div><p>Once identified, replicate those API calls or connect directly to the WebSocket in your scraping script. This gives you access to near real-time financial data in a structured format (typically JSON) without the overhead of parsing HTML.</p><p>Of course, that&#8217;s not trivial. <a href="https://substack.thewebscraping.club/p/websocket-bot-detection-scraping">WebSockets require proper anti-bot bypass</a>, and APIs may still enforce rate limits, tracking, and TLS fingerprinting protections. However, this approach is generally faster, more reliable, and much easier to maintain than scraping rendered pages!</p><h4>And What About Live News or Social Media Scraping?</h4><p>When it comes to news, if available, it makes sense to connect to public RSS feeds exposed by websites to monitor updates. This allows you to trigger scraping only when new and relevant content is published, instead of constantly polling pages unnecessarily.</p><p>Otherwise, you can build a polling mechanism that periodically checks news sites, social media platforms, and similar sources to capture fresh data. In these cases, you usually can&#8217;t rely on techniques like API or WebSocket scraping, as that&#8217;s not how those platforms fetch data.</p><p>Instead, you need a solid and robust infrastructure built around speed and efficiency: fast connections, high-quality proxies, optimized parsing, and lightweight requests. The goal is to minimize latency while maintaining reliability at scale.</p><h3>Best Practices</h3><p>Scraping real-time financial data is a demanding art, but it becomes easier with the following best practices:</p><ul><li><p><strong>Prefer APIs and WebSockets over HTML parsing</strong>: Whenever possible, save time by extracting data directly from the underlying APIs or WebSocket streams utilized by web pages instead of scraping data from rendered pages.</p></li><li><p><strong>Choose clean, structured sources</strong>: Prioritize endpoints that return well-formatted JSON to minimize preprocessing and reduce latency.</p></li><li><p><strong>Stream data into pipelines immediately</strong>: Send incoming data directly to processing pipelines for real-time insights, while storing it in parallel for later analysis.</p></li><li><p><strong>Use specialized AI for sentiment analysis</strong>: Prefer AI/ML models tuned for finance/social media, as Reddit and X content often include slang, memes, and non-standard language.</p></li><li><p><strong>Optimize browser automation</strong>: Configure Playwright, Selenium, or similar browser automation tools to block images, stylesheets, and fonts. This reduces bandwidth usage and significantly speeds up rendering time.</p></li><li><p><strong>Design for low latency</strong>: Optimize your stack (async requests, streaming ingestion, fast JSON parsers) to minimize delays, as even milliseconds matter.</p></li><li><p><strong>Prefer high-quality premium proxies</strong>: Count on <a href="https://substack.thewebscraping.club/p/how-many-ip-needed-scraping">proxy providers with a proven track record of fast, stable connections</a> to minimize latency and avoid disruptions.</p></li><li><p><strong>Time-synchronize everything</strong>: Append timestamps to all scraped data to enable time-series analysis and accurately reconstruct events.</p></li><li><p><strong>Build fault-tolerant systems:</strong> Expect disconnections (especially with WebSockets) and issues, so add reconnection logic and configure fallback data sources.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Top 5 Open-Source Financial Web Scraping Libraries</h3><p>Below is a selected set of interesting, fully open-source libraries, packages, and projects for simplified financial web scraping:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XR35!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XR35!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 424w, https://substackcdn.com/image/fetch/$s_!XR35!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 848w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1272w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png" width="1456" height="1136" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1136,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267258,&quot;alt&quot;:&quot;Top open-source financial web scraping libraries&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192509947?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Top open-source financial web scraping libraries" title="Top open-source financial web scraping libraries" srcset="https://substackcdn.com/image/fetch/$s_!XR35!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 424w, https://substackcdn.com/image/fetch/$s_!XR35!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 848w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1272w, https://substackcdn.com/image/fetch/$s_!XR35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3fef06d-e3d2-4da8-8cd3-b309dd9b270f_1920x1498.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Top open-source financial web scraping libraries</figcaption></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve gone through the rabbit hole of financial web scraping, the task of collecting finance-related data from the Internet. This is one of the main use cases of corporate web scraping, powering enterprise data pipelines for decision-making and market analysis.</p><p>As you&#8217;ve seen, the main difference in the approach comes down to whether you&#8217;re targeting historical or real-time data. The first follows standard web scraping practices you&#8217;re likely already familiar with. The second is trickier and requires more advanced techniques.</p><p>I hope you found this helpful and insightful. If you have questions, feel free to share them in the comments below!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #104: Bypassing AWS WAF on IMDB with Scrapling ]]></title><description><![CDATA[An hands-on test on tools for TLS spoofing and Scrapling]]></description><link>https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 14 May 2026 22:23:56 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e0fccfe7-d622-4fe8-a6d2-d99c1a73a9d9_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AWS WAF is the protection we run into most often on Amazon&#8217;s public properties. It also sits in front of a long tail of third-party sites whose operators built on AWS and clicked the WAF checkbox. We wrote about it two years ago in <a href="https://substack.thewebscraping.club/p/bypassing-aws-waf-scraping">The Lab #53: Bypassing AWS WAF</a>, but this time our focus is just on AWS WAF. In fact, Traveloka used DataDome on top of AWS WAF, and our analysis had to account for both systems at once.</p><p>This time, we wanted AWS WAF on its own, in front of a target with nothing else in front of it, and we wanted to see what changes when the 2024 Scrapy-Playwright stack is replaced with the 2026 toolbox. </p><p>The target we picked is <a href="https://www.imdb.com">imdb.com</a>. It is an Amazon subsidiary, runs a standard AWS WAF deployment, and Wappalyzer confirms that there are not others antibot on the website. That makes IMDB a perfect use case for our article.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gCo7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png" width="626" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:390972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/197764148?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gCo7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!gCo7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7017ca-9ba7-4512-91e1-2682ab05ca13_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack&quot;,&quot;text&quot;:&quot;Visit Netnut&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/?utm_source=webscrapingclub&amp;utm_medium=newsletter&amp;utm_campaign=130526&amp;utm_content=product-stack"><span>Visit Netnut</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Today we&#8217;ll test three Python HTTP clients with strong TLS fingerprint impersonation: <code>curl_cffi</code>, the newer <code>httpx-curl-cffi</code>, and Rust-backed <code>rnet</code>. Each one produces a TLS handshake indistinguishable from real Chrome. Is that enough to scrape an AWS WAF target without spinning up a browser? And if not, what is the smallest browser step that gets us past the gate so the rest of the work can run on a cheap HTTP client?</p><h2>The tools we used</h2><p>Four libraries are in scope. Three are HTTP-only, one runs a real browser.</p><p><strong><a href="https://github.com/yifeikong/curl_cffi">curl_cffi</a></strong> is a Python binding for the <code>curl-impersonate</code> patched curl. It exposes a requests-like API and ships impersonation profiles for recent Chrome, Firefox, and Safari builds and works at the TLS layer. JA3 and JA4 fingerprints match the impersonated browser, along with HTTP/2 settings and header order. We tested with <code>chrome142</code>, the latest Chrome profile in version 0.14.0.</p><p><a href="https://github.com/vgavro/httpx-curl-cffi">httpx-curl-cffi</a> is a transport for <code>httpx</code> that delegates the actual HTTP work to <code>curl_cffi</code>. While it does not add new fingerprinting capability, it implements the <code>httpx</code> programming model: sync <code>Client</code>, async <code>AsyncClient</code>, event hooks, the same response object you get from the rest of an <code>httpx</code>-based codebase. We tested with the Chrome profile and <code>default_headers=True</code>.</p><p><strong><a href="https://github.com/0x676e67/rnet">rnet</a></strong><code> </code>is a Rust HTTP client with Python bindings. It implements its own impersonation stack rather than wrapping <code>curl-impersonate</code>. The enum <code>rnet.Impersonate</code> exposes a wide range of Chrome, Firefox, Safari, Edge, Opera, and OkHttp profiles. We tested with <code>Chrome137</code>.</p><p><a href="https://github.com/D4Vinci/Scrapling">Scrapling</a> is the only browser-driven tool in the set. Our <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide">Scrapling: A Complete Hands-On Guide</a> goes through the library in depth, with Cloudflare as the test target. Its <code>StealthyFetcher</code> drives a stealth-patched Chromium that runs JavaScript and applies fingerprint countermeasures. The library README only advertises Cloudflare Turnstile, but the same machinery handles AWS WAF&#8217;s challenge too.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h2>How AWS WAF protects IMDB</h2><p>A quick intro of the system helps interpret the results that follow. AWS WAF is not a dedicated anti-bot platform like DataDome or Kasada. It is a general-purpose web application firewall with a bot-control module that operators can enable per rule. When the bot-control rule is in challenge mode, AWS WAF inserts a single JavaScript gate at the start of a session.</p><p>A request without a valid cookie returns <code>HTTP 202</code> with <code>x-amzn-waf-action: challenge</code> and a short HTML body. The body contains <code>window.gokuProps</code> containing three base64 blobs (<code>key</code>, <code>iv</code>, <code>context</code>), a <code>&lt;script src&gt;</code> pointing to a customer-specific URL on <code>*.token.awswaf.com</code>, and a small inline script that calls <code>AwsWafIntegration.saveReferrer()</code>, <code>AwsWafIntegration.checkForceRefresh()</code>, and <code>AwsWafIntegration.getToken()</code>. The remote <code>challenge.js</code> tests the browser environment, posts a validation payload back to AWS, and on success, the response sets <code>Set-Cookie: aws-waf-token=...</code>. The inline script then reloads the page, and the second request, now carrying the token, gets the real content.</p><p>This works very differently from systems that score every request. Once the token is in our jar, AWS WAF lets us through with no further behavioral checks beyond IP reputation and rate limits. <br>What we want to discover with this article is if we&#8217;re able to bypass AWS WAF with &#8220;convincing&#8221; requests, with a proper TLS fingerprint and set of headers, or if we need a JS rendering engine.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Test setup</h2><p>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">104.IMDB</a>. If you&#8217;re not able to access the repository, <a href="https://twsc-private-form.lovable.app/">please use this form to request access.</a></strong></p><p>The libraries we pinned at the time of writing are <code>curl_cffi==0.14.0</code>, <code>httpx==0.28.1</code>, <code>httpx-curl-cffi==0.1.5</code>, <code>rnet==2.4.2</code>, <code>scrapling==0.4.7</code>. Python is 3.11.</p><p>Each HTTP test creates a <code>GET</code> against two URLs: the IMDB home page </p><p>https://www.imdb.com/</p><p> and a title page <code>https://www.imdb.com/title/tt0111161/</code>. We use two URLs to confirm the challenge fires the same way on both, not only on one entry point. We do not follow redirects (<code>follow_redirects=False</code>) because the AWS WAF response is a 202 with content rather than a redirect, and we want to see it raw. </p><p>We capture status code, HTTP version, the full response headers, any cookies, body length, and the first 600 characters of the body, and we saved everything to JSON under <code>aws_waf_imdb/responses/</code> for later inspection.</p><p>The baseline probe in <a href="../code/aws_waf_imdb/probe_plain.py">probe_plain.py</a> uses an unmodified <code>httpx.Client(http2=True)</code> with a generic Chrome User-Agent header and the standard <code>Accept</code> headers. This is the control: no TLS impersonation, no fingerprint trickery, just a normal Python HTTP client.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/bypassing-aws-waf-with-scrapling">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Use LLMs to Enhance Data Extraction From Unstructured Text]]></title><description><![CDATA[How combining LLMs with schema validation solves the extraction problem that NLP never could]]></description><link>https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-use-llms-data-extraction-unstructured-text</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 10 May 2026 19:06:21 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9c72c499-8522-4fcd-b662-e37cf857c78a_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>&#127465;&#127466; Before starting this article, let me remind you that on Friday the 15th, there will be the first TWSC meetup in Munich. For more details and to confirm your attendance, go to <a href="https://www.meetup.com/the-web-scraping-club/events/314567280/">the event page</a>  &#127465;&#127466; </em></p><div><hr></div><p>The web contains an extraordinary volume of information, the majority of which is in textual form. Blogs, forums, and newsletters alone generate millions of words of domain-specific knowledge every week. And they&#8217;re not the only sources of text on the web.</p><p>When you want to get insights from that kind of data, successfully extracting it from the web is only half of the battle, even now that <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">LLMs can use vision to scrape complex visual layouts</a>. The second part of the challenge is structuring this data to get it ready for analytics. Why? Because when you point a scraper at a news article, you get back a wall of text. But you cannot query it. You cannot aggregate it. You cannot feed it reliably into a machine learning pipeline or a database without significant preprocessing.</p><p>This article addresses the preprocessing problem of unstructured text when you scrape it from the web. It traces the evolution of solutions from classical NLP to large language models, identifies where each approach breaks down, and proposes a practical architectural solution.</p><p>Let&#8217;s get into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What &#8220;Unstructured&#8221; Really Means in Practice</h2><p>Unstructured text refers to content that carries no machine-readable schema. The information exists in the data you retrieved from the web, but no field boundaries exist, no consistent labels, and no guaranteed position for any given fact.</p><p>The following schema represents the difference between unstructured and structured text (machine-readable schema):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FNRZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 424w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 848w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1272w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png" width="1037" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1037,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58633,&quot;alt&quot;:&quot;The difference between unstructured and machine-readable text by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The difference between unstructured and machine-readable text by Federico Trotta" title="The difference between unstructured and machine-readable text by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!FNRZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 424w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 848w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1272w, https://substackcdn.com/image/fetch/$s_!FNRZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18887b6d-d211-4070-8555-c8beb4b95ad1_1037x456.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The difference between unstructured and machine-readable text</figcaption></figure></div><p>Let&#8217;s consider three concrete scraping targets to illustrate what this costs you in practice.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3>News Articles: When Signals Are Buried in Noise</h3><p>Consider you scraped a Reuters article about an ECB rate decision. The text you get back from the scraper could be something as follows:</p><pre><code><code>European Central Bank decides on rates.
Listen to this article. 2 min audio. 
You might also like: Eurozone inflation hits 3-year low. 
Christine Lagarde announced Thursday a 25 basis point reduction, bringing the main 
refinancing rate to 3.40%. 
SPONSORED: Track macro events with Bloomberg Terminal. 
The decision was widely anticipated after last month's CPI print. Share this article. 
4 comments. John M. writes: this was priced in already</code></code></pre><p>Your raw text contains the article body, a teaser for a related story, a sponsored insertion, and reader comments. The fact you want is buried in there: the ECB cut its main refinancing rate to 3.40% on a specific date. But your extractor gets the full content.</p><p>Such a wall of text, which, generally speaking, is way bigger than this and is useless for analytics purposes without preprocessing.</p><h3><strong>Financial Newsletters: When &#8220;Just Under Two Percent&#8221; Breaks Your Aggregation</strong></h3><p>Suppose you scrape a financial newsletter to extract an updated macroeconomic forecast. You need to capture a specific fact. Something like &#8220;Goldman Sachs revised its 2026 US GDP growth forecast down to 1.8%&#8221;. Your scraper captures the entire page output, which is similar to an article. Similarly to the previous example, the resulting raw text mixes the core facts with boilerplate and unrelated news:</p><pre><code><code>Market Daily Newsletter. November 12.
Jan Hatzius (Goldman Sachs) and his team were out with a note early Tuesday.
SPONSORED: Get 50% off your trading fees today. 
They see tariffs shaving roughly 0.7 points off the baseline. Meanwhile, 
European markets rallied on ECB news. 
Read our full coverage of the Eurozone here.
The revised number now sits just under two percent for the full year.
Subscribe for premium insights.</code></code></pre><p>The text distributes the target fact across the entire document. Also, the wording &#8220;just under two percent&#8221; requires numerical understanding to say that the text refers to the actual number you were searching for, that is, an exact 1.8%.</p><p>Now, imagine generalizing this after scraping hundreds of financial news and newsletters to regroup the information to summarize the numbers. Getting insight would be impossible. Why? Because some sources will give you the actual information you want (growth forecast down to 1.8%), others will use different phrasing to define the trend (&#8221;An expected growth under 2 percent&#8221;, &#8220;a slightly shrinking trend&#8221;, etc).</p><p>Without a way to create a structure for such data, you can&#8217;t get any insights from it.</p><h3>Job Posting Offers: They Are Always Messier Than They Look</h3><p>Consider the case when you want to scrape job offers to get an idea of what the market is paying on average for a specific position, given the expected technical skills, and considering the same day-to-day activity. Job offers can have the following ambiguities:</p><ul><li><p>A sentence might read &#8220;3+ years of experience with Python&#8221;. This establishes a floor and ignores a ceiling. Alternatively, the text might read &#8220;Senior-level candidates only&#8221;. This uses qualitative seniority as a proxy for an exact quantitative number.</p></li><li><p>Salary breaks in a different direction. One posting can say <em>&#8220;$120,000 - $145,000 base&#8221;</em>. Another can be <em>&#8220;competitive compensation commensurate with experience&#8221;</em>. A third could be<em>&#8220;&#8364;100,000&#8221;</em>, which you need to convert to dollars to make an actual comparison.</p></li><li><p>Employment type can introduce further ambiguity and difficulties. <em>&#8220;Full-time&#8221;</em>, <em>&#8220;FTE&#8221;</em>, <em>&#8220;permanent&#8221;</em>, and <em>&#8220;direct hire&#8221;</em> basically mean the same thing but are written differently. Also, the text might specify the role is &#8220;Hybrid&#8221;, which means multiple different things across companies. It could mean two days in the office. It could mean occasional travel with headquarters-optional rules.</p></li></ul><div><hr></div><blockquote><p>When sites get tough, skip the heavy lifting. Get clean, structured CSV datasets,  ready for Excel, BI or your apps</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KpSw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png" width="592" height="149.84467881112175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1043,&quot;resizeWidth&quot;:592,&quot;bytes&quot;:81723,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KpSw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 424w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 848w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1272w, https://substackcdn.com/image/fetch/$s_!KpSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16722680-673e-472e-a0dd-6aa9fe9d2acb_1043x264.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.databoutique.com/buy-data-list&quot;,&quot;text&quot;:&quot;Find your dataset&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.databoutique.com/buy-data-list"><span>Find your dataset</span></a></p></blockquote><div><hr></div><h2>How Classical NLP Tried to Solve This (and Where It Stopped)</h2><p>Before large language models were released, the standard answer to this problem was Natural Language Processing. The classical NLP toolkit gave developers a set of tools that could, with enough effort, extract meaningful structure from text using different, but often interconnected, processes like the following:</p><ul><li><p><strong>Named Entity Recognition (NER)</strong>: NER is a process used in <a href="https://substack.thewebscraping.club/p/using-nlp-scraped-data">NLP to extract entities from text corpora</a>. It can particularly identify spans of text as persons, organizations, locations, or dates. An NLP model trained on news corpora, for example, is able to scan an article and tag &#8220;Jane Doe&#8221; as a person and &#8220;Washington D.C.&#8221; as a geopolitical entity.</p></li><li><p><strong>Part-of-speech tagging</strong>: Is a process in which NLP models can identify nouns, verbs, and adjectives. This enables the downstream logic to focus on the right parts of a sentence.</p></li><li><p><strong>Dependency parsing:</strong> Maps grammatical relationships between words, helping to extract which subject performed which action on which object.</p></li><li><p><strong>Relation extraction:</strong> Identifies when two co-occurring entities have a specific relationship. For example, a person who was affiliated with an organization, or an event that occurred in a specific location.</p></li></ul><p>Libraries like <a href="https://spacy.io/">spaCy</a>, <a href="https://nlp.stanford.edu/">Stanford NLP</a>, and <a href="https://www.nltk.org/">NLTK</a> made these processes largely accessible. But they work well for well-defined, narrow tasks on consistent text domains. The problems and limitations of this solution appear quickly at the edges:</p><ul><li><p><strong>Domain shift breaks everything:</strong> A NER model trained on news articles performs poorly on scientific abstracts. A model tuned for English financial text fails on multilingual content. In other words, every new domain requires retraining, re-labeling, and re-evaluation. These processes are very costly, both in terms of money and time.</p></li><li><p><strong>Context is invisible:</strong> Classical NLP models operate at the token and sentence level. They have no mechanism for understanding that &#8220;Apple&#8221; in a technology article refers to a corporation, while &#8220;apple&#8221; in a nutrition blog refers to a fruit. Disambiguation requires hand-crafted rules or separate classification layers bolted on top (which, again, is costly).</p></li></ul><p>Before NLP, you could basically only use regex (with all the difficulties associated with manually cleaning data, standardizing it, and&#8230;using regex!). So, NLP was a genuine (big) step forward: it made large-scale text analysis possible in ways that pure pattern matching never could (which is a way to <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">find patterns in scraped data using AI</a>). But it still required substantial domain expertise, constant maintenance, and produced results that were narrow, fragile, and difficult to generalize.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>The Modern Solution: LLMs as Universal Structure Extractors</h2><p>Large language models fundamentally changed the extraction problem. On the side of the underlying technology, a classical NLP model learns the statistical patterns inside the text. An LLM, instead, learns to understand language. This distinction matters enormously because it opened the doors to the following:</p><ul><li><p><strong>Context disambiguation that works out of the box:</strong> Feed an LLM with a paragraph from a technology article containing the word &#8220;Apple&#8221; and it will correctly identify it as a company. Feed it with a paragraph from a recipe blog, and it will correctly identify it as a fruit. No separate disambiguation layer. The model resolves ambiguity the same way a human reader does: by reading the surrounding context.</p></li><li><p><strong>Semantic equivalence that is understood, not computed:</strong> An LLM knows that &#8220;$40,&#8221; &#8220;forty dollars,&#8221; &#8220;40 USD,&#8221; and &#8220;forty bucks&#8221; all express the same value. You don&#8217;t need to instruct it to understand that.</p></li><li><p><strong>Implicit information that becomes accessible:</strong> A sentence like &#8220;the study, conducted over three months at a Boston hospital, found no significant effect&#8221; contains a location, a duration, and a finding. An LLM can extract all three without requiring the text to follow any particular structure.</p></li><li><p><strong>Domain generalization that requires no retraining:</strong> The same LLM that extracts entities from political news articles can extract findings from scientific abstracts, event mentions from cultural journalism, and source attributions from investigative reporting. You just need to change the prompt, not the model.</p></li></ul><p>The practical workflow becomes straightforward:</p><ul><li><p>You scrape unstructured text from the web.</p></li><li><p>You pass the content to an LLM with a prompt that describes what you want to extract.</p></li><li><p>The model returns a response.</p></li><li><p>You use that response downstream.</p></li></ul><p>This process works. But using LLMs alone introduces a different class of problems:</p><ul><li><p><strong>Output format is not guaranteed:</strong> Ask an LLM to return a price, and it might return <em>$40</em> in one run, <code>40</code><em> dollars </em>in another, and <em>40 USD</em> in a third. The model understands the value when it retrieves it from scraped content. But it does not guarantee how it expresses that value unless you explicitly constrain it.</p></li><li><p><strong>Required fields can go missing:</strong> If the article you extracted the content from does not mention a publication date, the model might omit the field, return <em>null</em>, or return <code>"</code><em>not mentioned</em><code>"</code>, or invent a plausible date (which is way worse). Each behavior is different, and none of them is predictable without enforcement.</p></li><li><p><strong>Hallucination is a real risk:</strong> When the model is uncertain, it always generates a plausible answer. For extraction tasks, that means it can invent entity names, fabricate statistics, or fill in missing information with confident-sounding fiction. Without validation, these errors pass into your data, creating issues at the analytics level.</p></li></ul><p>Generalizing all of this, you also get scalability issues because you have no consistency guaranteed. A pipeline processing 10,000 articles requires every output to follow the same schema. But a model that returns slightly different structures across runs cannot feed a database reliably without significant error handling.</p><p>In other words, LLMs provide you with the understanding that NLP lacked. But they do not, on their own, provide the structural guarantees that production pipelines require.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>How to Get Semantic Power and Structural Guarantees at the Same Time: A Practical Approach</h3><p>One possible solution to the unpredictability of LLM outputs is to separate the two concerns that these models conflate: semantic understanding and structure enforcement.</p><p>To do so, you can:</p><ul><li><p>Use the LLM for what it does well: reading text, resolving ambiguity, extracting meaning, and normalizing inconsistent expressions.</p></li><li><p>Use specific libraries to define schemas, enforce types, validate outputs, and reject malformed data before it enters your pipeline.</p></li></ul><p>Below is how this solution works, at a high level:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zCax!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zCax!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 424w, https://substackcdn.com/image/fetch/$s_!zCax!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 848w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1272w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png" width="998" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2508960-4a9d-44cb-9877-0df6263956b9_998x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:998,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52459,&quot;alt&quot;:&quot;The high-level process of creating machine-readable content from unstructured text by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/195720195?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The high-level process of creating machine-readable content from unstructured text by Federico Trotta" title="The high-level process of creating machine-readable content from unstructured text by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!zCax!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 424w, https://substackcdn.com/image/fetch/$s_!zCax!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 848w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1272w, https://substackcdn.com/image/fetch/$s_!zCax!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2508960-4a9d-44cb-9877-0df6263956b9_998x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The high-level process of creating machine-readable content from unstructured text</figcaption></figure></div><p>Let&#8217;s see how to implement this process and how the two approaches differ in practice.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>The Baseline Approach: A Direct LLM Call (and What It Gives You)</h3><p>Consider the following content that can come from scraping a news article:</p><pre><code><code>Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May</code></code></pre><p>To directly pass it to a GPT model, asking it for a precise output, you can use the following code:</p><pre><code><code>import os
import json
from openai import OpenAI

# Scraped content
SCRAPED_TEXT = """
Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May.
"""

# Define LLM client
raw_client = OpenAI(api_key=os.environ.get("YOUR_OPENAI_API_KEY"))

# Define prompt for the LLM
raw_prompt = """
Extract the following information from the article below and return it as JSON:
- title
- author
- publication_date
- mentioned_organizations
- cpi_march_value
- key_claim
- market_sentiment

Article:
""" + SCRAPED_TEXT

# Get response from LLM
raw_response = raw_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": raw_prompt}]
)

raw_output = raw_response.choices[0].message.content

# Print results
print(raw_output)</code></code></pre><p>The result will be as follows:</p><pre><code><code>{
  "title": "Fed Signals Caution as Inflation Data Disappoints",
  "author": "Sarah M. Connelly",
  "publication_date": "April 14, 2026",
  "mentioned_organizations": [
    "Federal Reserve",
    "Labor Department",
    "Reuters",
    "Goldman Sachs",
    "JPMorgan"
  ],
  "cpi_march_value": "3.5 percent",
  "key_claim": "The Federal Reserve is in no rush to cut interest rates and needs greater confidence that inflation is moving sustainably toward its two-percent target before reducing borrowing costs.",
  "market_sentiment": "Negative"
}</code></code></pre><p>Now, at first sight, this seems good. The prompt asked the GPT model to create a JSON file with specific values, and the model was able to do so. But two major problems affect the next steps when analyzing this data. They are:</p><ul><li><p>The publication date is reported as &#8220;April 14, 2026&#8221;. This is not represented in ISO 8601 format and will break any date parser.</p></li><li><p>The CPI is reported as &#8220;3.5 percent&#8221;, which is a string. Not a number or a float, which is what is required for such data if you want to further analyze it (without any intermediate steps).</p></li></ul><p>So, the LLM was able to give structure to an unstructured text, after being specifically prompted to do so. But it failed at providing the data in the right format. To do so, you have to provide specific guidance to the model.</p><h3>What Changes When You Define The Schema</h3><p>To have guarantees on the output format, you can use the following code:</p><pre><code><code>import os
import json
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Optional, Literal

# Scraped content
SCRAPED_TEXT = """
Fed Signals Caution as Inflation Data Disappoints

By Sarah M. Connelly | April 14, 2026 | Economics &amp; Policy

The Federal Reserve signaled on Monday that it remains in no rush to cut interest rates,
after fresh inflation data showed consumer prices rose more than expected last month.
Jerome Powell, speaking at a conference in Washington, said the central bank needs
"greater confidence" that inflation is moving sustainably toward its two-percent target
before reducing borrowing costs.

The Consumer Price Index climbed 3.5 percent in March, up from 3.2 percent in February,
according to the Labor Department. Economists polled by Reuters had forecast a reading
of 3.3 percent. Core CPI, which strips out food and energy, rose 3.8 percent year-over-year.

Markets reacted sharply. The S&amp;P 500 fell nearly one point five percent by midday,
while the yield on the 10-year Treasury note jumped to four point six percent.
Goldman Sachs revised its forecast, pushing back its expected first rate cut from June
to September. JPMorgan analysts said two cuts in 2026 now look "optimistic."

Powell emphasized that the Fed is not considering rate hikes at this stage, but stressed
that the path back to two percent inflation "may take longer than previously thought."
The next Fed meeting is scheduled for the first week of May.
"""

# Validation schema
class ArticleExtraction(BaseModel):
    title: str = Field(description="The article's headline")
    author: Optional[str] = Field(description="Full name of the author if explicitly mentioned")
    publication_date: Optional[str] = Field(description="Publication date in ISO 8601 format (YYYY-MM-DD)")
    mentioned_organizations: list[str] = Field(description="All organizations referenced in the article")
    cpi_march_value: Optional[float] = Field(description="CPI value as a float (e.g. 3.5)")
    key_claim: str = Field(description="The central argument or finding of the article in one sentence")
    market_sentiment: Literal["positive", "negative", "neutral"] = Field(
        description="Overall market sentiment expressed in the article"
    )

structured_client = instructor.from_openai(OpenAI(api_key=os.environ.get("YOUR_OPENAI_API_KEY")))

extraction = structured_client.chat.completions.create(
    model="gpt-4o",
    response_model=ArticleExtraction,
    messages=[
        {
            "role": "user",
            "content": f"Extract structured information from the following article:\\n\\n{SCRAPED_TEXT}"
        }
    ]
)

print(extraction.model_dump_json(indent=2))

print("\\n" + "=" * 60)
print("INSPECTION OUTPUT")
print("=" * 60)
for field, value in extraction.model_dump().items():
    print(f"  {field}: {repr(value)}  &#8594;  type: {type(value).__name__}")</code></code></pre><p>The above code leverages two fundamental libraries:</p><ul><li><p><strong><a href="https://pydantic.dev/">Pydantic</a></strong>: This is a Python data validation library. You define a schema as a Python class, declare the fields and their types, and Pydantic enforces that any data you put into that class matches what you declared.</p></li><li><p><strong><a href="https://python.useinstructor.com/">Instructor</a></strong>: This is the bridge between Pydantic and the LLM. The core problem it solves is that LLMs&#8217; APIs return text, but Pydantic validates Python objects. So, something has to sit in the middle, take the LLM&#8217;s response, parse it into the structure your Pydantic model expects, and retry the call if the output doesn&#8217;t validate. That&#8217;s what Instructor does. Without Instructor, you would have to manually prompt the model to return JSON, parse that JSON yourself, handle malformed responses, write retry logic, and coerce types by hand.</p></li></ul><p>By using these two libraries, the <em>ArticleExtraction() </em>class does the following<code>:</code></p><ul><li><p><strong>Type enforcement:</strong> Defines <em>cpi_march_value</em> as a float.  This guarantees the model will return an actual number) instead of a  string (3.5 instead of "3.5 percent" as the previous example<code>)</code>.</p></li><li><p><strong>Controls formatting and vocabulary:</strong> The <em>Literal</em> type on <em>market_sentiment</em> restricts the LLM&#8217;s output to <em>"positive"</em>, <em>"negative"</em>, or <em>"neutral"</em>. The model cannot invent new categories. Similarly, the description for <em>publication_date</em> explicitly demands the ISO 8601 format.</p></li><li><p><strong>Built-in prompting:</strong> The <em>Field(description="...")</em> parameters serve a dual purpose. First, they document the code for developers. Secondly, under the hood, the Instructor library feeds these exact descriptions to the LLM as targeted instructions. This ensures the model understands <em>exactly </em>what &#8220;key claim&#8221; or &#8220;publication date&#8221; means in this context.</p></li><li><p><strong>Graceful omissions:</strong> Wrapping fields like <code>author</code> in <em>Optional[...]</em> gives the model permission to safely return a null value if the information isn&#8217;t present in the scraped text.  This highly reduces the risk of hallucinations.</p></li></ul><p>The JSON output is as follows:</p><pre><code><code>{
  "title": "Fed Signals Caution as Inflation Data Disappoints",
  "author": "Sarah M. Connelly",
  "publication_date": "2026-04-14",
  "mentioned_organizations": [
    "Federal Reserve",
    "Labor Department",
    "Reuters",
    "Goldman Sachs",
    "JPMorgan"
  ],
  "cpi_march_value": 3.5,
  "key_claim": "The Federal Reserve remains cautious about cutting interest rates because inflation has not yet shown sufficient progress toward its two-percent target.",
  "market_sentiment": "negative"
}</code></code></pre><p>As you can see, now the CPI is a float, and the publication date is in ISO 8601.</p><p>The inspection output is the following:</p><pre><code><code>============================================================
INSPECTION OUTPUT
============================================================
  title: 'Fed Signals Caution as Inflation Data Disappoints'  &#8594;  type: str
  author: 'Sarah M. Connelly'  &#8594;  type: str
  publication_date: '2026-04-14'  &#8594;  type: str
  mentioned_organizations: ['Federal Reserve', 'Labor Department', 'Reuters', 'Goldman Sachs', 'JPMorgan']  &#8594;  type: list
  cpi_march_value: 3.5  &#8594;  type: float
  key_claim: 'The Federal Reserve remains cautious about cutting interest rates because inflation has not yet shown sufficient progress toward its two-percent target.'  &#8594;  type: str
  market_sentiment: 'negative'  &#8594;  type: str</code></code></pre><p>This validation helps immediately see that the data types are correct.</p><h2>Conclusion</h2><p>In this article, you learned what unstructured text actually costs a data pipeline. You saw how classical NLP made structured extraction possible but fragile, and how LLMs removed the domain constraints that NLP never solved. You also learned why LLMs alone are not enough and saw a practical solution to provide &#8220;guardrails&#8221; for LLMs so that their output follows a defined schema.</p><p>So, let us know: how are you managing unstructured text after you scraped it?</p>]]></content:encoded></item><item><title><![CDATA[Cloudflare Crawl Endpoint: Everything You Need to Know]]></title><description><![CDATA[Is the Cloudflare /crawl endpoint a real game-changer?]]></description><link>https://substack.thewebscraping.club/p/cloudflare-crawl-endpoint-for-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/cloudflare-crawl-endpoint-for-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 03 May 2026 20:24:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/898316de-e54e-4a62-8089-2ad66bc363b8_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Cloudflare just shook the Web by announcing its first API for crawling entire websites. It&#8217;s built for RAG systems and website monitoring, but can it really be used for real-world web scraping scenarios?</p><p>In this article, you&#8217;ll find out this and more. I&#8217;ll walk you through a complete guided example of how to use it, and break down its (Spoiler: undoubtedly serious) limitations.</p><h2>An Introduction to the Cloudflare Crawl Endpoint</h2><p>Before exploring the technical aspects behind the Cloudflare <em>/crawl</em> endpoint and seeing it in action, let me first give you some context!</p><h3>What Is the Cloudflare <em>/crawl</em> Endpoint?</h3><p>The <em><a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/">/crawl</a></em> endpoint is a new addition to <a href="https://developers.cloudflare.com/fundamentals/api/">Cloudflare&#8217;s REST APIs</a>. Its goal is to crawl an entire website (or just a portion of it) starting from a single URL.</p><p><strong>Note</strong>: The Crawl endpoint is currently in beta and was introduced on March 10, 2026, <a href="https://developers.cloudflare.com/changelog/post/2026-03-10-br-crawl-endpoint/">as highlighted in the Cloudflare changelog</a>.</p><p>Under the hood, it automatically discovers and visits new pages, <a href="https://developers.cloudflare.com/browser-rendering/">rendering them in a headless browser</a>. It returns the discovered content as HTML, Markdown, or structured JSON, making it ideal for RAG pipelines, monitoring, or dataset creation.</p><p>As I&#8217;ll dive into later, it respects <em>robots.txt</em> and <em>doesn&#8217;t</em> bypass bot protection or captchas. Thus, it&#8217;s designed as a <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">compliant approach to web crawling!</a></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" loading="lazy" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>How It Works at a High Level</strong></h2><p>At a high level, the <em>/crawl</em> endpoint involves a two-step flow:</p><ol><li><p>You kick off an asynchronous crawl job, passing a starting URL. Cloudflare returns a job ID.</p></li><li><p>You use that job ID to periodically check the job&#8217;s status or fetch results as they become available, following typical <a href="https://en.wikipedia.org/wiki/Polling_(computer_science)">polling behavior</a>.</p></li></ol><p><strong>Important</strong>: A crawl job can run for <em>up to seven days!</em><strong> </strong>Results remain available for 14 days after completion, after which the job data is deleted.</p><p>Behind the scenes, the crawler expands outward from the starting URL. By default, the API follows a clear order:</p><ol><li><p>The initial page.</p></li><li><p>Sitemap URLs.</p></li><li><p>Links discovered within pages.</p></li></ol><p>Still, you can tweak that depending on whether you want to prioritize sitemaps, page links, or both.</p><h3>Supported Use Cases</h3><p>The officially promoted use cases for the Cloudflare <em>/crawl</em> API are just two:</p><ul><li><p>Creating knowledge bases or training AI systems (like <a href="https://substack.thewebscraping.club/p/ingest-web-data-rag-llm">RAG applications</a>) using up-to-date web content.</p></li><li><p>Collecting and analyzing content across multiple pages <a href="https://substack.thewebscraping.club/p/build-an-ai-agent-for-scraping-papers">for research</a>, summarization, or monitoring purposes.</p></li></ul><h3>Pricing</h3><p>Compared to most other web crawling or discovery APIs on the market, Cloudflare&#8217;s /<em>crawl</em> API doesn&#8217;t charge by the number of pages. Instead, costs are based on resource usage, which depends on whether you enable the headless browser rendering feature.</p><p>If headless rendering is active, pricing follows the <a href="https://developers.cloudflare.com/browser-rendering/pricing/">Browser Rendering model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vrIj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vrIj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 424w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 848w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1272w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png" width="1456" height="238" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:238,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48862,&quot;alt&quot;:&quot;The Browser Rendering pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Browser Rendering pricing model" title="The Browser Rendering pricing model" srcset="https://substackcdn.com/image/fetch/$s_!vrIj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 424w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 848w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1272w, https://substackcdn.com/image/fetch/$s_!vrIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba6210b-ad61-44b3-91fe-b61a82e72e75_1490x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Browser Rendering pricing model</figcaption></figure></div><p>If rendering isn&#8217;t active, pricing follows the <a href="https://developers.cloudflare.com/workers/platform/pricing/">Workers model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HVsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HVsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 424w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 848w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1272w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png" width="1456" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66389,&quot;alt&quot;:&quot;The Workers pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Workers pricing model" title="The Workers pricing model" srcset="https://substackcdn.com/image/fetch/$s_!HVsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 424w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 848w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1272w, https://substackcdn.com/image/fetch/$s_!HVsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51cfce0-bd1b-4846-b038-15ab24483733_1536x376.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Workers pricing model</figcaption></figure></div><p><em>Yeah, I know&#8230; It&#8217;s honestly a bit confusing, and it&#8217;s almost impossible to predict the exact cost of a crawling task. The good news? You can test it for free!</em></p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Cloudflare Crawl Endpoints: Technical Analysis</h2><p>Now that you know what Cloudflare is and what it brings to the table, it&#8217;s time to better understand its functioning, strengths, and limitations.</p><h3><strong>Endpoint Presentation</strong></h3><p>The Cloudflare Crawl API is built around two main endpoints. Both share the same base URL:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl</code></pre></div><p>Where <em>&lt;ACCOUNT_ID&gt;</em> is your <a href="https://developers.cloudflare.com/fundamentals/account/find-account-and-zone-ids/#copy-your-account-id">Cloudflare account ID</a>.</p><h4>1. Initiate the Crawl Job (POST)</h4><p>To start a new crawl, you need to send a POST request with the target URL (and optional parameters like depth, rendering mode, etc.) as below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -X POST 'https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl' \
  -H 'Authorization: Bearer &lt;CLOUDFLARE_API_TOKEN&gt;' \
  -H 'Content-Type: application/json' \
  -d '{ "url": "https://example.com" }'</code></pre></div><p>Keep in mind that the endpoint supports several parameters, allowing you to greatly customize the crawling behavior, output format (JSON, HTML, or Markdown), rendering options, caching, and more. Check out the <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#optional-parameters">full list of supported body parameters for all available options</a>.</p><p>Cloudflare immediately returns a job ID that you&#8217;ll use to track or retrieve results. A possible response looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{
  "success": true,
  "result": "9f1c2d3a-4b5e-6f7a-8c9d-0e1f2a3b4c5d"
}</code></pre></div><p>The UUID in the <em>result</em> field is the Crawl job ID you&#8217;ll use to poll for updates.</p><h4>2. Request Crawl Results (GET)</h4><p>Once the crawl is running, make a GET request with the job ID to check the status or fetch results:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -X GET 'https://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/crawl/&lt;JOB_ID&gt;' \
  -H 'Authorization: Bearer &lt;CLOUDFLARE_API_TOKEN&gt;'</code></pre></div><p>Here, the <em>&lt;JOB_ID&gt;</em> placeholder is the UUID retrieved before from the <em>result </em>field.</p><p>The response either includes a <em>status</em> field like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">{
  "result": {
    "id": "3e7a1c92-b4d8-4f6e-9a21-6c0d5b8e2f14",
    "status": "running"
    // ...
  }
}</code></pre></div><p>The possible <em>status</em> values are: <em>running</em>, <em>completed</em>, <em>errored</em>, or one of several cancellation states (<em>cancelled_due_to_timeout</em>, <em>cancelled_due_to_limits</em>, <em>cancelled_by_user</em>).</p><p>Or, once the job is completed, calling the API returns the full results in the <em>records</em> field:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">{
  "result": {
    "id": "3e7a1c92-b4d8-4f6e-9a21-6c0d5b8e2f14",
    "status": "completed",
    "browserSecondsUsed": 98.3,
    "total": 12,
    "finished": 12,
    "records": [
      {
        "url": "https://example.com/",
        "status": "completed",
        "markdown": "# Example Domain\nThis domain is for use in illustrative examples...",
        "metadata": {
          "status": 200,
          "title": "Example Domain",
          "url": "https://example.com/"
        }
      },
      {
        "url": "https://example.com/about",
        "status": "completed",
        "markdown": "## About\nLearn more about this example site...",
        "metadata": {
          "status": 200,
          "title": "About - Example Domain",
          "url": "https://example.com/about"
        }
      }
      // additional entries omitted for brevity...
    ],
    "cursor": 10
  },
  "success": true
}</code></pre></div><p>Note that the response will vary based on the specified query parameters. For example, you can filter by specific statuses, limit the number of results, and <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#polling-for-completion">navigate through them using a pagination system</a>.</p><h3>Features</h3><p>Below is a list of the main, most relevant capabilities provided by the Cloudflare Crawl API:</p><ul><li><p><strong>Asynchronous crawl jobs</strong>:<strong> </strong>Trigger crawling jobs and poll results when they are ready, enabling non-blocking, large-scale crawling workflows.</p></li><li><p><strong>Automatic URL discovery</strong>: Finds pages from the starting URL, sitemaps, and in-page links, with configurable source control.</p></li><li><p><strong>Flexible output formats</strong>: Returns HTML, Markdown, or structured JSON. JSON leverages <a href="https://developers.cloudflare.com/workers-ai/features/json-mode/">Workers AI for schema-driven data extraction</a>.</p></li><li><p><strong>Headless browser rendering</strong>: Control JavaScript execution with <em>render: true</em> or perform fast static HTML fetches with <em>render: false</em>.</p></li><li><p><strong>Fine-grained crawl control</strong>: Configure <em>limit</em>, <em>depth</em>, and URL inclusion/exclusion with the <em>includePatterns</em>/<em>excludePatterns </em>fields.</p></li><li><p><strong>Incremental and cache-aware crawling</strong>: Use <em>modifiedSince</em> and <em>maxAge </em>parameters to avoid re-fetching unchanged content, optimizing performance and cost.</p></li><li><p><strong>Advanced filtering and pagination</strong>: Retrieve results using <em>limit</em>, <em>cursor</em>, and <em>status</em> filters to handle large datasets efficiently.</p></li><li><p><strong>Authentication and custom headers</strong>: Supports HTTP auth, cookies, and custom headers for crawling protected or API-driven content.</p></li><li><p><strong>Dynamic content handling</strong>: Wait for JS-rendered content using <em>gotoOptions</em> and <em>waitForSelector</em>, ideal for SPAs and interactive pages.</p></li><li><p><strong>Resource skipping for performance</strong>: Optionally block images, media, fonts, or stylesheets to speed up crawling and reduce unnecessary bandwidth usage.</p></li></ul><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h3>Limitations</h3><p>Cloudflare&#8217;s <em>/crawl</em> API also comes with several important limitations, such as:</p><ul><li><p><strong>Respects bot protection</strong>: The crawler can&#8217;t <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">bypass CAPTCHAs (including Turnstile challenges) or Cloudflare bot protections</a>. As a rule of thumb, sites protected via Cloudflare Bot Management or other WAFs tend to block crawling tasks entirely, limiting automated access and leading to incomplete datasets.</p></li><li><p><strong>Fixed User-Agent</strong>: The <em>/crawl</em> endpoint sets a non-customizable <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent">User-Agent</a> </em>value<em> </em>(<em>CloudflareBrowserRenderingCrawler/1.0</em>). You can&#8217;t change it, which may cause sites to block requests or serve different content based on the <em>User-Agent</em>.</p></li><li><p><strong>Content Signals enforcement</strong>: If a site disallows AI usage via <a href="https://contentsignals.org/">Cloudflare Content Signals</a>, crawl requests for those purposes are rejected with a <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/400">400 Bad Request</a></em> error. Even if the site allows other uses, attempts to crawl disallowed categories will fail, limiting AI-specific data collection.</p></li><li><p><strong>Rate limiting and crawl pacing</strong>: Sites with strict rate limits can slow down crawling. The crawler respects the robots.txt <em>Crawl-delay </em>directive and implements backoff. Large crawls may need to be split into smaller jobs to avoid throttling or skipped URLs.</p></li><li><p><strong>Browser usage limits and job cancellation</strong>: Accounts on Workers free plans are capped at 10 minutes of browser time per day. Exceeding this limit results in a <em>cancelled_due_to_limits</em> status. To avoid that, you can upgrade to a paid plan.</p></li></ul><h2>How to Use the Cloudflare Crawl Endpoint: Step-by-Step Tutorial</h2><p>In this guided section, I&#8217;ll show you how to use the Cloudflare Crawl Endpoint to crawl a website in Python. The target site will be the &#8220;<a href="https://quotes.toscrape.com/">Quotes to Scrape</a>&#8221; sandbox. The goal here is to demonstrate how to use the API, rather than actually collecting relevant data.</p><p>Follow the instructions below!</p><h3>Prerequisites</h3><p>To follow this tutorial section, make sure you have:</p><ul><li><p>Your <a href="https://developers.cloudflare.com/fundamentals/account/find-account-and-zone-ids/#copy-your-account-id">Cloudflare account ID</a> at hand.</p></li><li><p>A <a href="https://developers.cloudflare.com/fundamentals/api/get-started/create-token/">Cloudflare API token</a> with the &#8220;Browser Rendering - Edit&#8221; permission.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nJvY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nJvY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 424w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 848w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission" title="Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission" srcset="https://substackcdn.com/image/fetch/$s_!nJvY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 424w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 848w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!nJvY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56a37dbb-602d-41fc-a6ea-2fbddf3fb48e_3027x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the API key with the required &#8220;Browser Rendering - Edit&#8221; permission</figcaption></figure></div><p>For the sake of simplicity and to keep this tutorial concise, I&#8217;ll assume you already have a Python project set up with <em><a href="https://substack.thewebscraping.club/p/python-http-request-explained">requests</a></em> installed. That said, you can use any programming language and any HTTP client, because the high-level logic remains the same.</p><h3>Step #1: Set Up the Configurations</h3><p>Start by importing the required libraries and reading the necessary secrets (your Cloudflare API token and account ID). Use these secrets to prepare the Cloudflare Crawl base URL and headers. Also, specify the starting target URL as a constant.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import requests
import time
import json

# The Cloudflare secrets required for authenticating Crawl API calls
CLOUDFLARE_API_TOKEN = "&lt;YOUR_CLOUDFLARE_API_TOKEN&gt;"
CLOUDFLARE_ACCOUNT_ID = "&lt;YOUR_CLOUDFLARE_ACCOUNT_ID&gt;"

# The base URL used for all Crawl API calls
CLOUDFLARE_CRAWL_BASE_URL = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl"

# Common headers shared by all API calls
HEADERS = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# The URL to crawl
START_URL = "https://www.ssense.com/en-us/men/product/acne-studios/silver-folded-leather-wallet/18169981"</code></pre></div><p><strong>Tip</strong>: In a production script, read the Cloudflare API token and account ID from environment variables rather than hardcoding them.</p><h3>Step #2: Trigger the Crawling Job</h3><p>Define a <em>start_crawl()</em> function to send a POST request to Cloudflare&#8217;s Crawl API:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def start_crawl(start_url):
    # Customize according to your needs
    payload = {
        "url": start_url,
        "limit": 20,
        "depth": 2,
        "formats": ["markdown"],
        "render": True
    }

    response = requests.post(CLOUDFLARE_CRAWL_BASE_URL, headers=HEADERS, json=payload)
    response.raise_for_status()

    job_id = response.json()["result"]
    return job_id</code></pre></div><p>This creates a new crawling job for the target URL. Then, it returns a <em>job_id</em> that identifies this specific crawl.</p><p><strong>Tip</strong>: In a production-level script, make the <em>payload</em> object configurable via function input arguments for greater flexibility and reusability.</p><h3>Step #3: Poll Over the Job</h3><p>Next, add a <em>wait_for_completion()</em> function to repeatedly check the job status every few seconds until the crawl finishes or times out:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def wait_for_completion(job_id, max_attempts=60, delay=5):
    status_url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}?limit=1"

    for attempt in range(max_attempts):
        res = requests.get(status_url, headers=HEADERS)
        res.raise_for_status()

        status = res.json()["result"]["status"]
        print(f"Attempt {attempt+1}: {status}")

        if status != "running":
            return status

        time.sleep(delay)

    raise Exception("Timeout waiting for crawl job")</code></pre></div><p>This makes GET calls to the Cloudflare <em>/crawl</em> endpoint. It ensures you&#8217;re waiting for the task to complete processing before fetching the crawled records.</p><p><strong>Tip</strong>: The <em>limit=1</em> query parameter is recommended to restrict the number of retrieved records, keeping the response lightweight. After all, at this stage, you&#8217;re only interested in checking the job status, not in retrieving the actual output data.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Step #4: Get the Crawled Content Pages</h3><p>Build a <em>fetch_records()</em> function to collect all crawled pages:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def fetch_records(job_id):
    print("Fetching results...")

    all_records = []
    cursor = None

    while True:
        url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}"
        # Getting 10 records at a time
        params = {
            "limit": 10
        }

        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()

        result = response.json()["result"]
        records = result.get("records", [])

        all_records.extend(records)
        print(f"+{len(records)} records (Total: {len(all_records)})")

        cursor = result.get("cursor")
        if not cursor:
            break

    return all_records</code></pre></div><p>This handles pagination using a <em>cursor</em>, accessing records in batches (<em>10</em> per request) until all results are returned.</p><h3>Step #5: Put It All Together</h3><p>Finally, in the <em>main()</em> function, orchestrate the workflow:</p><ol><li><p>Start the crawl</p></li><li><p>Wait for completion</p></li><li><p>Fetch all results</p></li></ol><p>Then, you can export the crawled records to a local JSON file for further use, store the retrieved data in a database, process it there, etc.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def main():
    job_id = start_crawl(START_URL)
    status = wait_for_completion(job_id)

    if status != "completed":
        raise Exception(f"Crawl failed: {status}")

    print("Crawl completed!")

    records = fetch_records(job_id)

    # Export the crawled pages to an output JSON file
    with open("records.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()</code></pre></div><h3>Step #6: Complete Code</h3><p>This is what your Python script for interacting with the Cloudflare Crawl API will look like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># pip install requests

import requests
import time
import json

# The Cloudflare secrets required for authenticating Crawl API calls
CLOUDFLARE_API_TOKEN = "&lt;YOUR_CLOUDFLARE_API_TOKEN&gt;"
CLOUDFLARE_ACCOUNT_ID = "&lt;YOUR_CLOUDFLARE_ACCOUNT_ID&gt;"

# The base URL used for all Crawl API calls
CLOUDFLARE_CRAWL_BASE_URL = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl"

# Common headers shared by all API calls
HEADERS = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# The URL to crawl
START_URL = "http://quotes.toscrape.com/"

def start_crawl(start_url):
    """
    Triggers the Cloudflare Crawl API job
    """

    # Customize according to your needs
    payload = {
        "url": start_url,
        "limit": 20,
        "depth": 2,
        "formats": ["markdown"],
        "render": True
    }

    response = requests.post(CLOUDFLARE_CRAWL_BASE_URL, headers=HEADERS, json=payload)
    response.raise_for_status()

    job_id = response.json()["result"]
    return job_id

def wait_for_completion(job_id, max_attempts=60, delay=5):
    """
    Waits for the crawling task to complete
    """

    status_url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}?limit=1"

    for attempt in range(max_attempts):
        res = requests.get(status_url, headers=HEADERS)
        res.raise_for_status()

        status = res.json()["result"]["status"]
        print(f"Attempt {attempt+1}: {status}")

        if status != "running":
            return status

        time.sleep(delay)

    raise Exception("Timeout waiting for crawl job")

def fetch_records(job_id):
    """
    Collects all records from the paginated results
    """

    print("Fetching results...")

    all_records = []
    cursor = None

    while True:
        url = f"{CLOUDFLARE_CRAWL_BASE_URL}/{job_id}"
        # Getting 10 records at a time
        params = {
            "limit": 10
        }

        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()

        result = response.json()["result"]
        records = result.get("records", [])

        all_records.extend(records)
        print(f"+{len(records)} records (Total: {len(all_records)})")

        cursor = result.get("cursor")
        if not cursor:
            break

    return all_records

def main():
    job_id = start_crawl(START_URL)
    status = wait_for_completion(job_id)

    if status != "completed":
        raise Exception(f"Crawl failed: {status}")

    print("Crawl completed!")

    records = fetch_records(job_id)

    # Export the crawled pages to an output JSON file
    with open("records.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()</code></pre></div><h3>Step #7: Test the Script</h3><p>Launch the script, and it&#8217;ll produce an output like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XDal!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XDal!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 424w, https://substackcdn.com/image/fetch/$s_!XDal!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 848w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1272w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png" width="1175" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1175,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output produced by the script in the terminal&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output produced by the script in the terminal" title="The output produced by the script in the terminal" srcset="https://substackcdn.com/image/fetch/$s_!XDal!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 424w, https://substackcdn.com/image/fetch/$s_!XDal!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 848w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1272w, https://substackcdn.com/image/fetch/$s_!XDal!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bc028e2-7b42-49b3-880d-60fad6598d6d_1175x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the script in the terminal</figcaption></figure></div><p>The polling mechanism required 5 attempts (~25 seconds), and the API discovered and retrieved 22 pages.</p><p>A <em>records.json</em> file will appear in your project directory. Open it, and you&#8217;ll see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZwCj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 424w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 848w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png" width="1456" height="1071" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1071,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZwCj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 424w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 848w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwCj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063ebe32-5c8d-4c01-bdae-e6c599631afa_2286x1682.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output produced by the script</figcaption></figure></div><p>Notice how the &#8220;Quotes to Scrape&#8221; entries contain a <em>markdown</em> field with the Markdown version of the page. Instead, external links like Zyte&#8217;s homepage and Goodreads.com are skipped, since <em>includeExternalLinks</em> is set to <em>false</em> by default. In other words, the Cloudflare Crawl API doesn&#8217;t automatically attempt to fetch data from different domains than the target source URL.</p><p>Et voil&#224;! Implementation complete.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Benchmark Against Protected Websites</h3><p>Cool! The Cloudflare Crawl endpoint works like a charm and is easy to use. However, I was particularly concerned about its documented limitations and wanted to verify whether they actually hold up in practice&#8230;</p><p>So, I ran tests against several well-known sites protected by common WAF and anti-bot solutions (from different providers). Here&#8217;s a summary of the results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!chL4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!chL4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 424w, https://substackcdn.com/image/fetch/$s_!chL4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 848w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1272w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111887,&quot;alt&quot;:&quot;Cloudflare Crawl API vs anti-bot solutions&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192320016?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cloudflare Crawl API vs anti-bot solutions" title="Cloudflare Crawl API vs anti-bot solutions" srcset="https://substackcdn.com/image/fetch/$s_!chL4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 424w, https://substackcdn.com/image/fetch/$s_!chL4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 848w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1272w, https://substackcdn.com/image/fetch/$s_!chL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da65ba2-fc8d-467f-aba1-b001b975b1ad_1490x619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cloudflare Crawl API vs anti-bot solutions</figcaption></figure></div><p>As you can tell, the limitations are very real. The results are quite discouraging:<strong> the Cloudflare Crawl API failed against all anti-bot&#8211;protected websites I tested.</strong></p><p>So, is this solution reliable for web scraping? When (and how) should you actually use it? Let me break that down in a final comment!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Final Comment</h2><p>In this article, I introduced you to one of the newest tools in Cloudflare&#8217;s growing ecosystem: the Crawl API! This endpoint is designed to help you crawl entire websites using distributed crawling tasks running on Cloudflare&#8217;s infrastructure.</p><p>Sure, the crawling mechanism works and is easy to launch, control, and implement. With just a few lines of code, you can get started. Still, several concerns should be raised:</p><ol><li><p><strong>Opaque pricing</strong>: Costs are tied to resource usage rather than the number of pages crawled, making them harder to predict.</p></li><li><p><strong>Fixed </strong><em><strong>User-Agent</strong></em>: The API doesn&#8217;t allow <em>User-Agent</em> customization, meaning even basic server-side checks can block it.</p></li><li><p><strong>Limited effectiveness on protected sites</strong>: The API has an intended very low success rate against anti-bot&#8211;protected websites (unless you specify in Cloudflare Bot Protection settings that you <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#robotstxt-and-bot-protection">allow it against your site</a>).</p></li><li><p><strong>Rate limiting constraints</strong>: It strictly respects <em>robots.txt</em> directives and crawl delays, which can significantly slow or limit large crawls.</p></li></ol><p>In simple terms, if you want to use it for general-purpose, large-scale web crawling, I wouldn&#8217;t recommend it. The market offers more effective solutions that can actually bypass anti-bot limitations. Plus, remember that around <em><a href="https://www.securitymagazine.com/articles/101188-65-of-websites-arent-protected-from-bots">35% of the entire Internet</a></em> is estimated to be protected against bots (i.e., you won&#8217;t be able to crawl it with this API).</p><p>Yet, if you know the target site is not protected, budget isn&#8217;t a concern, and you want to remain (<em>overly?</em>) ethical and compliant, the Cloudflare Crawl API can be an option.</p><p>I hope this breakdown helps you better understand this new solution and make an informed decision on whether to adopt it. Lastly, remember that the Cloudflare Crawl API is still in beta, so things may change soon. Just <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/">keep an eye on the docs for updates</a>. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #103: Bypassing DataDome-Protected Websites in the Agentic Era]]></title><description><![CDATA[Fifteen browser configurations, one tough anti-bot, and only a couple made it to the cart]]></description><link>https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 30 Apr 2026 21:34:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9e5dad0e-b094-41c0-942c-c76f3783b289_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This year every web infrastructure company seems to be shipping a browser. But not a regular browser,  one designed to be driven by an AI agent and to look human while doing it. We wanted to know if any of those browsers actually work against a serious anti-bot, so we picked a hard target, leroymerlin.fr behind DataDome, and tested more than a dozen different setups on the same four-step task: open the homepage, search for a product, open the first result, add it to the cart.<br></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>The short answer is that a couple of tools finished the task, just one with any consistency. The story behind why is worth telling, because it explains what is happening at the intersection of AI agents and web data right now. We ran a similar exercise <a href="https://substack.thewebscraping.club/p/bypassing-cloudflare-in-2026">against Cloudflare earlier this year</a>, and the conclusion is broadly the same: each anti-bot needs its own answer, and the answer changes every quarter.</p><h2>From workflows to agents, and why that changes the data problem</h2><p>Most code shipped under the AI banner is not really agentic. It is workflow code with an LLM dropped into a slot: generate a summary here, classify a record there, draft an email at the end. The control flow is hard-coded, and the model is one component among many.</p><p>The definition of an agent is quite different. The model decides the next action, observes the outcome, and decides again. The control flow lives inside the loop, not outside it. The agent has goals rather than scripts, and it picks tools and steps based on what it sees. That is what makes the engineering interesting, that is what makes it hard, and that is what sometimes makes it unreliable.</p><p>It also forces a different relationship with data. An agent that only sees its training corpus is stuck in the past. To make decisions worth anything, it has to read prices that change daily, stocks that move minute by minute, listings that did not exist last week. Some of that data sits behind APIs. Most of it does not. The web is still the largest and most current dataset in the world, and most of it is reachable only through a browser. So if we want our agents to act on real information, we have to give them a way to browse: opening a page, reading it, clicking a link, typing into a search bar, following a result, filling a form, all on sites that were never built for machines.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:69.35779816513761,&quot;width&quot;:630,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p>This is the constraint that produced the wave of &#8220;agentic browser&#8221; launches we have seen over the last twelve months. Y Combinator alone has backed a long string of them. <a href="https://www.hyperbrowser.ai/">Hyperbrowser</a> (S21) was an early entry: scalable cloud browser infrastructure with built-in CAPTCHA solving, proxy management, and now a multi-agent playground. The newer cohort followed the agent wave more directly: <a href="https://www.browseros.com/">BrowserOS</a> (S24) is an open-source agentic browser that runs the agent locally on the user&#8217;s machine; <a href="https://browser-use.com/">Browser Use</a> (W25) offers an open-source agent loop on top of Playwright, plus a cloud version. <a href="https://www.skyvern.com/">Skyvern</a> is a self-hostable browser agent that uses an LLM and computer vision instead of fixed selectors.  Outside the YC pipeline, <a href="https://lightpanda.io/">Lightpanda</a> is doing something different again, a headless browser engine written from scratch in Zig and aimed squarely at agents and crawlers (claiming roughly 9x faster execution and 16x lower memory than Chrome). It fits the &#8220;browser built for machines&#8221; line of thought we covered in <a href="https://substack.thewebscraping.club/p/rethinking-the-web-browser">Rethinking the web browser</a> earlier this year. <a href="https://www.browserbase.com/">Browserbase</a> ships a managed browser plus Stagehand for natural-language automation. And the big AI labs are now in the same space: OpenAI shipped Operator and the ChatGPT Atlas browser, Anthropic shipped Computer Use, Perplexity launched Comet. Each project attacks the same problem from a slightly different angle, but the goal is identical: a browser an agent can drive without immediately tripping every detection mechanism on the other side.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h2>The same problem scrapers have been chasing for a decade</h2><p>For anyone who has worked in web data, none of this is new. The fight over whether a request looks human or automated has been going on as long as commercial scraping has existed. The product names have changed but the purpose not.</p><p>What has changed is who is selling the bypass. The companies that have spent years selling residential proxies and unblockers noticed quickly that the agentic boom is good for their business. They already have the IP networks, the fingerprint research, the bypass code, the cat-and-mouse experience. They know what TLS handshake Chrome sends in October 2025 and what it sent in October 2024. Pivoting all of that into a managed browser is a smaller leap than building one from scratch. <a href="https://brightdata.com">Bright Data</a>, <a href="https://oxylabs.io">Oxylabs</a>, <a href="https://rayobyte.com">Rayobyte</a>, <a href="https://www.zenrows.com">ZenRows</a> have all added a managed browser product alongside the proxy. </p><p>The other side of the line is moving in the opposite direction. Bot traffic has grown faster than human traffic for years, and the operators of large public sites care more about it than ever. <a href="https://datadome.co">DataDome</a>, <a href="https://www.cloudflare.com/products/bot-management/">Cloudflare Bot Management</a>, <a href="https://www.akamai.com/products/bot-manager">Akamai Bot Manager</a>, <a href="https://www.humansecurity.com">HUMAN</a>, <a href="https://www.kasada.io">Kasada</a>: every one of them ships updates that target the exact tools we just listed. Fingerprint checks get stricter. Behavioral models get more sensitive. The JavaScript challenge changes shape every few weeks. There is no silver bullet, and there is no tool, browser, proxy, or service that bypasses every anti-bot on every site at all times. Anyone who claims otherwise is selling something that worked last quarter and might still work this week. The useful question is what works on a given target, today, at what cost.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Picking a hard target</h2><p>To answer that question concretely, we needed a target where the anti-bot was good and the signal was clean. We picked leroymerlin.fr, the French DIY retailer. Leroy Merlin runs DataDome standalone, with no other anti-bot layer on top, so attribution is straightforward. It also runs one of the more verbose DataDome configurations we have come across: response headers expose <code>x-datadome-riskscore</code>, <code>x-datadome-protection</code>, <code>x-datadome-cid</code>, and <code>x-datadome-endpointid</code>. Most DataDome-protected sites only show us the outcome. Here we see the score the engine assigns at every request, which is rare and very useful when comparing tools side by side.</p><p>The task we picked is small but realistic. From the homepage, the agent has to type &#8220;ampoule B22 led blanc&#8221; into the search bar, click the first product result, and add the product to the cart. Four steps. We dropped the login step on purpose: leroymerlin.fr requires an OTP to sign in, and we did not want OTP friction to confound an anti-bot test.</p><p>A run is a pass if the agent reaches the cart confirmation. Otherwise we record where it stopped and what DataDome said about it. Each tool runs ten times back to back, and we aggregate the results. Tools that support an external proxy use the same residential pool: Bright Data residential FR for the Bright Data runs, <a href="https://geonode.com">Geonode</a> residential FR for the Geonode runs. Tools that ship their own proxy use it. The reason behind two different providers was because we wanted to diversify the IP addresses, to be sure that blocks were not a matter of IP reputation.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The contestants</h2><p>As you&#8217;ve seen before, the browser landscape is quite crowded and we could not cover all the tools. We picked four open-source projects and seven commercial products. Let&#8217;s start with the open source.</p><p><a href="https://camoufox.com">Camoufox</a> is the stealth Firefox fork most people in the scraping world have already met (we <a href="https://substack.thewebscraping.club/p/open-source-python-libraries-scraping">introduced it</a> on TWSC back in September 2024). It rotates real-world fingerprints, patches the obvious automation tells, and ships a Playwright-compatible API. We pair it with both Bright Data and Geonode residential proxies in France. </p><p><a href="https://github.com/autoscrape-labs/pydoll">Pydoll</a> takes a different route: it drives Chromium directly over CDP without WebDriver, with built-in humanized cursor movement and typing. Importantly, Pydoll implements an explicit <code>Fetch.authRequired</code> handler, which lets it authenticate proxies that require Basic auth. </p><p><a href="https://scrapling.readthedocs.io">Scrapling</a> is a higher-level Python library. We use it in two modes. <code>DynamicFetcher</code> launches vanilla Playwright Chromium driven by Scrapling&#8217;s session manager. <code>StealthyFetcher</code> does the same, but under the hood uses an improved and customized version of <a href="https://github.com/Kaliiiiiiiiii-Vinyzu/patchright">patchright</a>, a stealth-patched Playwright fork. Each gets its own row in the comparison. </p><p><a href="https://github.com/rayobyte-data/rayobrowse">RayoBrowse</a> is the self-hosted stealth Chromium fork from Rayobyte, distributed as a Docker container that exposes a CDP endpoint on port 9222. Here we hit a wall worth flagging: for some reason RayoBrowse could not use the Bright Data residential proxy in our setup. Every navigation through that proxy failed instantly, even though the same credentials worked fine through <code>curl</code> from inside the same container. The same RayoBrowse setup worked fine with Geonode. We did not isolate the root cause, so we report RayoBrowse on Geonode only.</p><p>The commercial side is more crowded. </p><p><a href="https://browser-use.com/">Browser Use</a> exists in two flavors, and we tested both. The cloud version is the managed Browser Use, with its own residential proxy, its own stealth fingerprinting, and a fixed set of supported models; we drove it once in raw CDP mode (we steer it ourselves with Playwright) and once in agent mode (we hand the LLM the task in natural language and let it plan the steps). </p><p><a href="https://www.browserbase.com/">Browserbase</a> is a managed Chromium with optional residential proxies, Cloudflare Web Bot Auth verification, and the Stagehand agent SDK. We discovered during the test that the free tier excludes proxies entirely; without one, the session egresses from a US datacenter. We left this configuration in the test because it is what a free user would experience. </p><p><a href="https://www.browserless.io">Browserless</a> is a managed browser-as-a-service whose anti-bot story is a stealth path (<code>/chromium/stealth</code>) plus optional residential proxies for paid plans. The free plan caps sessions at 60 seconds, which is tight for a four-step flow. We tested it with the built-in residential proxy targeting France, and tried to test it with our external proxies via the <code>externalProxyServer</code> parameter; the external mode failed at connection time on every run, in the same Chromium-side authentication way that broke RayoBrowse, so we drop those configurations from the comparison. </p><p><a href="https://zenrows.com/">ZenRows</a> Scraping Browser is a managed Chromium with a built-in residential proxy network and built-in CAPTCHA solving; we connect via the WSS endpoint with <code>proxy_country=fr</code> to get a French exit point. </p><p><a href="https://brightdata.com/">Bright Data Browser API</a> sits at the other end of the same product category: a managed Chromium with built-in residential rotation and CAPTCHA solving, on a dedicated Browser API zone we configured on their dashboard.</p><p>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">103.BROWSERS</a>.</strong></p><h2>What we had to fix before the numbers made sense</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/the-lab-103-bypassing-datadome-protected">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Stop Paying for Bandwidth: How to Leverage IPv6 Subnets for Infinite Proxy Rotation]]></title><description><![CDATA[Escape metered residential proxy billing. Discover how to build a self-hosted, rotating proxy gateway using IPv6 /64 subnets to drastically cut your web scraping costs at scale.]]></description><link>https://substack.thewebscraping.club/p/use-ipv6-scraping-nyxproxy</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/use-ipv6-scraping-nyxproxy</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 26 Apr 2026 20:30:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/21b6b18a-a1f6-4511-aec6-c5fc9ba435cd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p style="text-align: justify;">When your data extraction pipelines scale from a few thousand requests a day to thousands of requests per second, the bottleneck becomes network egress and IP reputation. Modern web architectures are defended by sophisticated Web Application Firewalls (WAFs) that deploy strict rate limiting, fingerprinting, and behavioral analysis.</p><p style="text-align: justify;">This means that if you route all your traffic through a single egress IP, you will be rate-limited in seconds and blacklisted in minutes. To survive at scale, you need to distribute your requests across a massive pool of IP addresses.</p><p style="text-align: justify;">Traditionally, the web scraping industry has solved this issue thanks to commercial proxy providers. However, this is not the only approach. This article responds to the following question: &#8220;<em>Is there a way to scrape at scale without burning budget on proxies</em>?&#8221;</p><p style="text-align: justify;">The answer is yes. But let&#8217;s be clear from the beginning: This approach is not a universal silver bullet. Let&#8217;s see how it works, how to build it, and what its limitations are.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>The Typical Solution for Scraping at Scale: Proxy Provider Services</h2><p style="text-align: justify;">Let&#8217;s start this discussion with the typical choice for scraping at scale. IP bans and rate limits are the #1 operational problem in scraping, especially at scale. The typical solution every web scraping engineer integrates is using proxy servers, for a simple reason: <a href="https://substack.thewebscraping.club/i/164246773/what-are-proxies-and-why-are-they-used">proxies act as intermediaries between your scrapers and the Internet</a>, avoiding your scrapers from getting banned. To do so, companies buy proxy IPs from proxy providers. The most common categories, both with their flaws, are the following:</p><ul><li><p style="text-align: justify;"><strong>Datacenter proxies:</strong> These are cheap and fast, but their ASNs(Autonomous System Numbers) are heavily scrutinized. WAFs maintain databases of known datacenter CIDR (Classless Inter-Domain Routing) blocks, so hitting a target with a static list of 100 datacenter proxies usually results in those IPs being flagged and blocked within hours.</p></li><li><p style="text-align: justify;"><strong>Residential proxies:</strong> These route traffic through actual consumer devices. They have highly trusted IP reputations, making them excellent for bypassing anti-bot systems. However, they are priced by bandwidth, so they are very expensive, especially when scraping at scale.</p></li></ul><p style="text-align: justify;">The main limitation of this approach is that it is highly expensive. So, what if you need to scrape at scale but don&#8217;t have enough budget for doing so?</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>An Alternative Approach: Scraping at Scale With Dedicated Infrastructure</h2><p style="text-align: justify;">To escape metered billing, you can move egress back to dedicated infrastructure. But before presenting the solution, let&#8217;s first point out shortly what happens when you buy and use proxies, at the infrastructure level.</p><h3>Buying Proxies Means Delegating Your Infrastructure</h3><p style="text-align: justify;">When you buy proxies from providers, you are delegating 100% of your infrastructure. When your scrapers make the requests, under the hood, the proxy provider connects to a gateway, which is a massive load balancer controlled entirely by the provider itself.</p><p style="text-align: justify;">Let&#8217;s consider the case of residential proxies, for simplicity. Behind the gateway is a peer-to-peer (P2P) network of millions of consumer devices that the provider has acquired bandwidth from. When your request hits the gateway, <strong>their proprietary routing algorithm decides which consumer device in which country will act as your final exit node</strong>.</p><p style="text-align: justify;">The second you route traffic through their gateway is the exact moment where you delegate the 100% of your scraping infrastructure.</p><div><hr></div><blockquote><p>Your scraping workflows deserve a proxy infrastructure that just works. With <strong>Swiftproxy</strong> on your side, consistency is built-in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g3qW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png" width="670" height="83.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:670,&quot;bytes&quot;:445760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/193806031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g3qW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 424w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 848w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1272w, https://substackcdn.com/image/fetch/$s_!g3qW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49af1519-bfa5-4c03-afea-f45743bd057b_2880x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.swiftproxy.net/?ref=webscrapingclub&quot;,&quot;text&quot;:&quot;Try Swiftproxy today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.swiftproxy.net/?ref=webscrapingclub"><span>Try Swiftproxy today</span></a></p></blockquote><div><hr></div><h3>NyxProxy: The Infrastructural Solution</h3><p style="text-align: justify;"><a href="https://github.com/phanes-io/nyxproxy-oss?tab=readme-ov-file">NyxProxy</a> is a self-hosted HTTP/SOCKS5 proxy server that exploits a well-known IPv6 networking trick: When a cloud provider gives you a <em>/64</em> subnet, you legally own 18.4 <em>quintillion</em> IPv6 addresses.</p><p style="text-align: justify;">Let&#8217;s explain the number and the trick around IPv6s. An IPv6 address looks like this:</p><pre><code><code> 2a05:f480:1800:25db:0000:0000:0000:0001</code></code></pre><p style="text-align: justify;">They are 128 bits long. That gives <em>2^128</em> possible addresses. The number is so large that the designers said: &#8220;W<em>e can afford to give every organization a massive block and never worry about running out&#8221;.</em></p><p style="text-align: justify;">Now, here is the trick. An IPv6 address is split into two halves, 64 bits each:</p><pre><code><code>2a05:f480:1800:25db : 0000:0000:0000:0001
|___________________|   |_________________|
   Network prefix            Host part
   (your subnet)          (you control this)</code></code></pre><p style="text-align: justify;">The <em>/64</em> notation means: the first 64 bits identify the network, the last 64 bits are yours to assign however you want. The last 64 bits can be any value from <em>0000:0000:0000:0000</em> to <em>ffff:ffff:ffff:ffff</em>: That&#8217;s <em>2^64</em> = 18.4 quintillion combinations. All valid addresses, all routable to your server.</p><p style="text-align: justify;">Thanks to this trick, NyxProxy can assign a pool of those addresses to your network interface at startup, then rotate your outgoing traffic across them. This means having a fresh IP per request. The tool handles pool management, background rotation, NDP proxying via <em>ndppd</em>, and exposes a monitoring endpoint.</p><p style="text-align: justify;">The best part is, indeed, in the NDP proxying. When your server uses a random address like <em>2a05:f480:1800:25db:a3f1:9922:beef:1234</em> as a source IP, your router upstream needs to know <em>your server is responsible for that address</em>. Otherwise, the response packets have nowhere to go.</p><p style="text-align: justify;">IPv6 uses NDP (Neighbor Discovery Protocol) for this. The router sends an NDP query: <em>&#8220;who has 2a05:f480:1800:25db:a3f1:9922:beef:1234?&#8221;</em> and your server must answer.</p><p style="text-align: justify;"><em><a href="https://github.com/DanielAdolfsson/ndppd">ndppd</a></em> (NDP Proxy Daemon) runs on your server and answers those queries automatically for your entire /64 subnet, essentially saying <em>&#8220;yes, all of those addresses are mine&#8221;</em>. Without it, your packets go out, but responses never come back.</p><p style="text-align: justify;">Below is a summary schema of how this whole process works:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;ac241add-8e8d-40d0-a7df-518bccfc20bc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Provider gives you:  2a05:f480:1800:25db::/64
                     &#8595;
Your server can use: 2a05:f480:1800:25db:[anything]
                     &#8595;
NyxProxy assigns 200 random IPs to your interface
                     &#8595;
Each outgoing request binds to a different one
                     &#8595;
Target sees 200 different source IPs
                     &#8595;
ndppd makes sure responses route back correctly</code></pre></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>How To Use NyxProxy</h2><p>Let&#8217;s now see how to use NyxProxy with a practical implementation.</p><h3>Environment Setup &amp; Prerequisites</h3><p style="text-align: justify;">To replicate this tutorial for deploying NyxProxy and utilizing it in your scraping scripts, you must have the following system and hardware requirements:</p><ul><li><p style="text-align: justify;"><strong>Hardware</strong>: A Virtual Private Server (VPS) or bare-metal server with at least 512 MB of RAM and 100 MB of disk space. Supported architectures are <em>amd64</em> or <em>arm64</em>.</p></li><li><p style="text-align: justify;"><strong>Subnet</strong>: A cloud provider that natively delegates a full IPv6 <em>/64</em> subnet to your network interface. Note that not all the VPS providers are supported: Check out the <a href="https://github.com/phanes-io/nyxproxy-oss?tab=readme-ov-file#network-requirements">NyxProxy documentation to learn more about supported VPSs</a>.</p></li><li><p style="text-align: justify;"><strong>Operating system</strong>: A modern Linux distribution, specifically Ubuntu or Debian, to ensure compatibility with the automated setup scripts and <em>sysctl</em> kernel modifications.</p></li><li><p style="text-align: justify;"><strong>Python</strong>: <a href="https://www.python.org/downloads/">Python 3.7 or higher</a> installed on your local machine to run the scraping scripts.</p></li></ul><p style="text-align: justify;">To get your server ready to run the proxy daemon, you need to verify your IPv6 setup and gain root access. Ensure you are logged into your VPS via SSH as the <em>root</em> user, or have <em>sudo</em> privileges.</p><p style="text-align: justify;">First, verify that your server has a globally routable IPv6 <em>/64</em> subnet assigned to it. You can check this by running the following command in your server&#8217;s terminal:</p><pre><code><code>ip -6 addr show | grep "scope global"</code></code></pre><p>If done correctly, you should see an output similar to the following:</p><pre><code><code>inet6 2a05:f480:1800:25db::1/64 scope global</code></code></pre><p>If you do not see a <em>/64</em> subnet, you will not be able to rotate IPs, and you must review your cloud provider&#8217;s network settings.</p><p>Next, prepare your local development environment. Suppose you call the main folder of your Python project <em>nyxproxy_scraper/</em>. At the end of this step, the folder will have the following structure:</p><pre><code><code>nyxproxy_scraper/
    &#9500;&#9472;&#9472; main.py
    &#9492;&#9472;&#9472; venv/</code></code></pre><p>Where:</p><ul><li><p><em>main.py</em> is the Python file that will store your proxy request logic.</p></li><li><p><em>venv/</em> contains the standard Python virtual environment.</p></li></ul><p>You can create the <em>venv/</em> <a href="https://docs.python.org/3/library/venv.html">virtual environment</a> directory like so:</p><pre><code><code>python -m venv venv</code></code></pre><p>To activate it, on Windows, run:</p><pre><code><code>venv\Scripts\activate</code></code></pre><p>Equivalently, on macOS and Linux, execute:</p><pre><code><code>source venv/bin/activate</code></code></pre><p>As a final prerequisite, install the <a href="https://requests.readthedocs.io/en/latest/">Requests library</a> in your activated virtual environment so your Python script can make HTTP calls:</p><pre><code><code>pip install requests</code></code></pre><p>Well done! You are now ready to test and use Nyxproxy.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3><strong>Installing and Configuring NyxProxy</strong></h3><p style="text-align: justify;">NyxProxy provides a quick setup script that handles the infrastructural heavy lifting. It auto-detects your network interface, installs <em>ndppd</em>, tweaks the Linux kernel parameters via <em>sysctl</em> to allow non-local binding, and downloads the compiled Go binary.</p><p style="text-align: justify;">You can launch it with the following single command:</p><pre><code><code>wget &lt;https://raw.githubusercontent.com/jannik-schroeder/nyxproxy-oss/main/scripts/quick-setup.sh&gt; &amp;&amp; chmod +x quick-setup.sh &amp;&amp; sudo ./quick-setup.sh</code></code></pre><p style="text-align: justify;">During the setup, you will be prompted to configure your proxy credentials and set your rotation rules. Behind the scenes, the script generates a <em>config.yaml</em> file. Let&#8217;s look at the crucial subset of that configuration:</p><pre><code><code>network:
  rotate_ipv6: true
  ipv6_subnet: "2a05:f480:1800:25db::/64"

  # The rotation mechanics:
  ipv6_pool_size: 200
  ipv6_max_usage: 100
  ipv6_max_age: 30</code></code></pre><p style="text-align: justify;">Below is an explanation of what these three parameters mean for your scraping pipeline:</p><ul><li><p style="text-align: justify;"><em>ipv6_pool_size</em>: NyxProxy keeps 200 mathematically unique IPs &#8220;hot&#8221; and bound to your network interface at any given time. This keeps proxy startup times under 100ms while maintaining IP diversity.</p></li><li><p style="text-align: justify;"><em>ipv6_max_usage</em>: After a specific IP has been utilized for 100 requests, it is considered &#8220;burned.&#8221; NyxProxy destroys the route and spins up a fresh address to dynamically replace it.</p></li><li><p style="text-align: justify;"><em>ipv6_max_age:</em> If an IP hasn&#8217;t hit 100 requests but has been alive for 30 minutes, it gets forcefully rotated out. This prevents time-based algorithmic tracking by the target WAF.</p></li></ul><p style="text-align: justify;">Once the daemon is running as a systemd service, your VPS is officially acting as a rotating proxy gateway. When NyxProxy receives a scraper request, the underlying Go binary takes over. It looks at its internal memory, picks one of the 200 rotating IPv6 addresses in its pool, and binds to that specific address to establish the outbound connection.</p><p>The expected output is as follows:</p><pre><code><code>IPv6 rotation mode: IP Pool with dynamic rotation
  Interface: enp1s0
  Subnet: 2a05:f480:1800:25db::/64
  Pool size: 200 IPs
  Rotation: Every 100 uses or 30m0s
  Initializing IP pool...
  Progress: 50/200 IPs added
  Progress: 100/200 IPs added
  Progress: 150/200 IPs added
  Progress: 200/200 IPs added
  IP pool ready with 200 addresses
  Background IP rotation started

Starting https proxy on 0.0.0.0:8080 (Protocol: IPv6)</code></code></pre><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3><strong>Testing the Proxy Logic</strong></h3><p style="text-align: justify;">At this point, NyxProxy has done its job. To verify it works correctly, you can use the following Python script that hits <em><a href="https://www.ipify.org/">api6.ipify.org</a></em>, which is an API that simply bounces back the IP address it sees:</p><pre><code><code>import requests

# Point this to your VPS IP and the credentials you set during setup
proxies = {
    'http': '&lt;http://admin:password@your-vps-ip:8080&gt;',
    'https': '&lt;http://admin:password@your-vps-ip:8080&gt;'
}

# Test 5 consecutive scraping requests
for i in range(5):
    response = requests.get('&lt;https://api6.ipify.org&gt;', proxies=proxies)
    print(f"Request {i+1}: Target sees IP -&gt; {response.text}")
</code></code></pre><p style="text-align: justify;">(NOTE: If you are already familiar with ipify.org, note that the &#8220;api6&#8221; prefix can be used for IPv6 requests only.)</p><p>The result should be similar to the following:</p><pre><code><code>Request 1: Target sees IP -&gt; 2a05:f480:1800:25db:1a2b:3c4d:5e6f:7890
Request 2: Target sees IP -&gt; 2a05:f480:1800:25db:9988:7766:5544:3322
Request 3: Target sees IP -&gt; 2a05:f480:1800:25db:aaaa:bbbb:cccc:dddd
Request 4: Target sees IP -&gt; 2a05:f480:1800:25db:1122:3344:5566:7788
Request 5: Target sees IP -&gt; 2a05:f480:1800:25db:dead:beef:cafe:babe</code></code></pre><p style="text-align: justify;">This shows that every single HTTP request utilizes a completely different, globally routable IPv6 address generated from your subnet block. To the target server, these look like entirely distinct users connecting from across the internet.</p><p style="text-align: justify;">Perfect! You have successfully built a self-healing, infinitely rotating proxy pool without handing over your budget for metered residential bandwidth.</p><h2>The Illusion of Infinity: Critical Limitations of IPv6 Subnet Rotation</h2><p style="text-align: justify;">At this point, you may think you have found a solution to all of your budgeting problems for scraping at scale. But before you tear down your commercial proxy infrastructure, you must understand that a $5/Mo VPS and an open-source rotation daemon are not a universal silver bullet. If it were that simple, the commercial proxy industry would not exist.</p><p>This architecture has the following main limitation:</p><ul><li><p style="text-align: justify;"><strong>The IPv4 compatibility wall:</strong> This entire architecture is built on one absolute prerequisite: Your target endpoint must support IPv6. If you are scraping legacy enterprise systems or platforms that haven&#8217;t migrated to dual-stack networking, this setup is useless. You cannot route an IPv6 packet to an IPv4-only server.</p></li><li><p style="text-align: justify;"><strong>Subnet-level bans (</strong><em><strong>/64</strong></em><strong> prefix blocking):</strong> Enterprise WAFs are fully aware of IPv6 prefix delegation standards. They know that hosting providers allocate a <em>/64</em> subnet to a single client. If their heuristics detect highly concurrent behavioral patterns (like missing <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">browser fingerprints</a> or anomalous TLS handshakes) originating from <em>2a05:f480...:1a2b</em>, they will ban the entire <em>/64</em> CIDR block. Once your <em>/64</em> prefix is banned, all 18 quintillion of your &#8220;infinite&#8221; IPs are simultaneously dead. To recover, you must physically destroy the VPS and provision a new one in a different IP range.</p></li><li><p style="text-align: justify;"><strong>ASN reputation:</strong> No matter how many IPs you rotate, your traffic still originates from a Datacenter Autonomous System Number (ASN). Target firewalls assign a baseline trust score to every ASN. Traffic originating from a Datacenter ASN always starts with a highly degraded trust score compared to a Residential ASN. For highly restrictive targets, any request from a datacenter IP is instantly met with an unpassable CAPTCHA or a hard <em>403 Forbidden</em>, regardless of whether it&#8217;s IPv4 or IPv6.</p></li><li><p style="text-align: justify;"><em>nf_conntrack</em><strong> and hardware exhaustion:</strong> You cannot push enterprise-grade throughput on a $5, 1-vCPU server without consequence. Rotating thousands of IPv6 addresses requires the Linux kernel to aggressively maintain the <em><a href="https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.txt">nf_conntrack</a></em> table and the NDP proxy table. At high concurrencies, the overhead of establishing, tracking, and tearing down thousands of TCP sockets across rotating interfaces will exhaust the memory or CPU of a low-tier VPS. The kernel will begin dropping packets natively, your latency will spike to useless levels, and your scrapers will be greeted with errors.</p></li></ul><h2>Conclusion</h2><p style="text-align: justify;">In this article, you learned how to leverage your hosting provider&#8217;s IPv6 <em>/64</em> subnets to build an infinitely rotating proxy pool with NyxProxy, escaping the metered billing of residential proxy networks.</p><p style="text-align: justify;">The competitive advantage of engineering your own proxy infrastructure is in your unit economics and architectural control. However, you also learned that this solution is not a universal silver bullet for every scraping scenario: It comes with trade-offs and constraints.</p><p style="text-align: justify;">So, let us know: Have you already experimented with bare-metal IPv6 rotation for your scraping pipelines? What targets did it work best for? Let&#8217;s discuss in the comments!</p>]]></content:encoded></item><item><title><![CDATA[The Trick to Scrape Next.js Websites in Seconds]]></title><description><![CDATA[Scraping data from the most widely used full-stack framework in the world with just 3 lines of code!]]></description><link>https://substack.thewebscraping.club/p/scrape-nextjs-websites</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/scrape-nextjs-websites</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 19 Apr 2026 19:18:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/17ee7337-9a3d-445a-a255-2895a6ed8235_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Next.js is one of the most widely adopted full-stack JavaScript frameworks on the planet. If you&#8217;ve ever built or deployed a web app, you definitely know it&#8212;or at least you&#8217;ve heard of it.</p><p>Behind the scenes, it relies on hydration to make server-rendered pages interactive. And here&#8217;s the interesting part: the same mechanism that makes Next.js fast and popular also exposes a significant amount of structured data in the HTML sent by the server. From a scraping perspective, that&#8217;s a huge opportunity!</p><p>In this post, I&#8217;ll show you a simple trick to scrape data from virtually any Next.js website. Follow along as I break down how it works and how you can apply it yourself.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>Next.js in Numbers</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1F7B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1F7B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 424w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 848w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1272w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Next.js&#8217; GitHub star growth&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Next.js&#8217; GitHub star growth" title="Next.js&#8217; GitHub star growth" srcset="https://substackcdn.com/image/fetch/$s_!1F7B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 424w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 848w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1272w, https://substackcdn.com/image/fetch/$s_!1F7B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9314d495-cf86-4469-8672-23f6cb568aef_3156x2309.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Next.js&#8217; GitHub star growth</figcaption></figure></div><p>Next.js needs no introduction, but it&#8217;s worth giving some context to truly understand how popular it is (<em>and therefore how useful the trick I&#8217;m about to present for Next.js web scraping can be</em>):</p><ul><li><p>According to the <a href="https://survey.stackoverflow.co/2025/">2025 Stack Overflow Developer Survey</a>, 20.8% of respondents used Next.js extensively over the past year.</p></li><li><p>Next.js is the 14th largest repository on GitHub, with <a href="https://github.com/vercel/next.js">over 138k stars</a> (and still growing!).</p></li><li><p><a href="https://w3techs.com/technologies/overview/javascript_library">According to W3Techs</a>, Next.js has a 2.9% market share among JavaScript libraries.</p></li><li><p>Major brands such as <a href="https://nextjs.org/showcase">Nike, Stripe, and Notion have chosen this full-stack framework</a> to build their official websites.</p></li></ul><h2>Before Getting Started: A Bit of Context on Hydration</h2><p>I know you probably just want the trick&#8230; Still, let me take a minute to explain why it works in the first place, why it&#8217;s even possible, and what kind of data you&#8217;ll actually retrieve with it!</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><h3>What Is Hydration?</h3><p><a href="https://en.wikipedia.org/wiki/Hydration_(web_development)">Hydration</a> is the process that makes a server-rendered page interactive in the browser.</p><p>Frameworks like Next.js, Remix, Nuxt, and SvelteKit employ this mechanism to combine the performance benefits of <a href="https://nextjs.org/docs/pages/building-your-application/rendering/server-side-rendering">server-side rendering (SSR)</a> with the interactivity of client-side applications.</p><p>The idea is that the server first sends fully rendered static HTML to the browser. Then, hydration happens next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jt2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 424w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 848w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png" width="1227" height="633" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:633,&quot;width&quot;:1227,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)" title="The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)" srcset="https://substackcdn.com/image/fetch/$s_!Jt2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 424w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 848w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb216295b-09dc-4a62-afc3-14d96f1ccc00_1227x633.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The JavaScript hydration mechanism (source: https://thefrontenddev.medium.com/demystifying-hydration-and-streaming-in-react-19-core-concepts-for-modern-web-development-bef8479b8a26)</figcaption></figure></div><p>The browser downloads the JavaScript bundle, and the frontend framework reconstructs the component tree in memory, attaches event listeners, and links that virtual tree to the existing DOM instead of re-rendering it from scratch. The result is a fully interactive application built on top of server-rendered HTML.</p><h3>How Does the Hydration Mechanism Work?</h3><p>It&#8217;s now clear that in Next.js and similar frameworks, hydration is the process where a static, server-rendered HTML page &#8220;comes to life&#8221; and becomes fully interactive in the browser. But what&#8217;s actually happening under the hood?</p><p>At a high level, hydration is a 3-step process:</p><ol><li><p>The server generates and sends a fully rendered HTML snapshot. The user immediately sees the content (great for <a href="https://web.dev/articles/fcp">First Contentful Paint</a>). At this point, though, the page is just static HTML. Buttons, forms, and other interactive elements are visible, but they don&#8217;t work yet because no JavaScript is attached.</p></li><li><p>The client&#8217;s browser downloads the JavaScript bundle (which includes React and your frontend application code) and executes it.</p></li><li><p>React rebuilds the component tree in memory and attaches event listeners to the existing DOM nodes. Instead of discarding the HTML and re-rendering everything from scratch, React &#8220;hydrates&#8221; the existing markup, meaning it reuses it and wires it up with state and interactivity.</p></li></ol><p>Once hydration completes, the page behaves like a normal single-page application: it responds to clicks, manages state, and updates dynamically.</p><p>And here&#8217;s an important detail: if the browser doesn&#8217;t support JavaScript (or it fails to load), the user still sees the server-rendered HTML. It won&#8217;t be interactive, but the core content is there. That&#8217;s great for SEO and perceived performance!</p><h3>Why It Matters for Scraping Next.js (and Other Full-Stack Frameworks&#8230;)</h3><p>The key insight you need to understand is simple: <strong>hydration requires data</strong>, and that data must be embedded somewhere in the HTML sent by the server!</p><p>In Next.js, when the server renders a page, it doesn&#8217;t only send markup. It also serializes the data required to rebuild the React component tree on the client. That serialized payload is embedded directly into the page&#8217;s HTML.</p><p>That&#8217;s exactly why hydration matters for scraping. Instead of parsing the DOM or simulating user interactions through <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation</a>, you can extract the structured data that React itself uses to hydrate the page.</p><p>In many cases, hydration data is cleaner and easier to parse than the rendered HTML. It can also contain more information than what&#8217;s visibly displayed on the page, including hidden and interesting metadata.</p><p>Keep in mind that this principle applies not only to Next.js! All other full-stack frameworks that rely on hydration, such as Remix, Nuxt, Angular Universal, and SvelteKit, tend to dehydrate state on the server and rehydrate it on the client.</p><p>So remember this simple rule. If a framework hydrates, it must serialize data. And if it serializes data into the HTML, you can scrape it.</p><h2>How to Scrape Next.js Websites: 2 Approaches</h2><p>The approach to scraping Next.js by targeting hydration data depends on how that data is embedded in the HTML generated on the server side.</p><p>I won&#8217;t go too deep into framework internals here (if you&#8217;re a Next.js dev, you already know things shift depending on whether you&#8217;re using the<a href="https://nextjs.org/docs/app/getting-started"> </a><em><a href="https://nextjs.org/docs/app/getting-started">App Router</a></em> or the<a href="https://nextjs.org/docs/pages/getting-started"> </a><em><a href="https://nextjs.org/docs/pages/getting-started">Pages Router</a></em>), but there are essentially two scenarios you&#8217;ll run into.</p><p>In this section, I&#8217;ll walk through both of them and show you exactly how I retrieve data from each!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Approach #1: Target the __NEXT_DATA__ Script</h3><p>As a target site, I&#8217;ll use a <a href="https://www.nike.com/t/air-jordan-5-retro-wolf-grey-mens-shoes-0M9kM1yX/DD0587-002">Nike product page</a> as a reference:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uJcE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uJcE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 424w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 848w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target Nike page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target Nike page" title="The target Nike page" srcset="https://substackcdn.com/image/fetch/$s_!uJcE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 424w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 848w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!uJcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a7a8ad3-e83b-4b70-9c1f-c017f10c452e_3018x1628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target Nike page</figcaption></figure></div><p>That&#8217;s actually a great example because Nike.com is even showcased on the Next.js homepage as a real-world site built with the framework.</p><p>Now, right-click on the page and select the &#8220;Inspect&#8221; option in your browser to open the DevTools. Scroll through the DOM and get familiar with the page structure. If the Next.js site is using the <em>Pages Router</em>, you&#8217;ll notice a <em>&lt;script&gt;</em> tag with the id <em>__NEXT_DATA__</em> containing a large JSON blob:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rV1e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rV1e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the JSON data inside the #__NEXT_DATA__ element&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the JSON data inside the #__NEXT_DATA__ element" title="Note the JSON data inside the #__NEXT_DATA__ element" srcset="https://substackcdn.com/image/fetch/$s_!rV1e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!rV1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477af555-f3dd-487e-ab55-d7f15e91c990_1746x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the JSON data inside the #__NEXT_DATA__ element</figcaption></figure></div><p>That JSON data is precisely the hydration data I was referring to earlier.</p><p>When a site uses the Pages Router approach in Next.js, the server embeds all the page data directly into that <em>&lt;script&gt;</em> tag. From a scraping perspective, that&#8217;s gold, as the data is already structured and ready to be captured.</p><p>Below&#8217;s a simple JavaScript snippet to extract it:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">const hydartionScript = document.querySelector("#__NEXT_DATA__")
const hydrationData = JSON.parse(hydartionScript.innerHTML)
console.log(hydrationData)</code></pre></div><p>What&#8217;s happening here is straightforward. The JS script:</p><ul><li><p>Selects the <em>&lt;script&gt;</em> element with <em>id</em> <em>__NEXT_DATA__</em>.</p></li><li><p>Reads its inner HTML (which is a JSON string).</p></li><li><p>Parses it into a JavaScript object.</p></li><li><p>Logs it to the console.</p></li></ul><p>Run this directly in the DevTools Console, and you&#8217;ll immediately see the result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2AK7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2AK7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dc28944-8842-4605-be59-b746fef469db_1746x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2AK7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 424w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 848w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!2AK7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc28944-8842-4605-be59-b746fef469db_1746x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the structured hydration data</figcaption></figure></div><p>What&#8217;s interesting is how much structured data you get right away. This includes product details, images, metadata, and more. All is neatly organized, and it only took three lines of code!</p><p>If you want to store the JSON hydration object, just right-click the object in the Console and select the &#8220;Copy object&#8221; option:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m1uv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m1uv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 424w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 848w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1272w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png" width="1456" height="483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Selecting the &#8220;Copy object&#8221; option&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Selecting the &#8220;Copy object&#8221; option" title="Selecting the &#8220;Copy object&#8221; option" srcset="https://substackcdn.com/image/fetch/$s_!m1uv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 424w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 848w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1272w, https://substackcdn.com/image/fetch/$s_!m1uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64676fbc-8bbe-49df-a872-cf2eebef77b2_1683x558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Selecting the &#8220;Copy object&#8221; option</figcaption></figure></div><p>From there, you can paste it wherever you need (e.g., into a local <em>.json</em> file, a MongoDB collection, etc.).</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Approach #2: Target the self.__next_f.push() Elements</h3><p>Another, more complex approach to scraping Next.js involves pages built with the <em>App Router</em>.</p><p>Even if the <em>App Router</em> has been the recommended direction for a while, in my experience, it&#8217;s still not as widely adopted as the <em>Pages Router</em>. And honestly, that&#8217;s a bit of a gift for us (as scraping hydration data in <em>App Router</em> sites is definitely more complex!)</p><p>As a reference, let&#8217;s look at the &#8220;<a href="https://openai.com/business/">Business Overview</a>&#8221; page on the OpenAI website, which is built with Next.js <em>App Router</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CAEI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CAEI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 424w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 848w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1272w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png" width="1456" height="709" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:709,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The target page&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The target page" title="The target page" srcset="https://substackcdn.com/image/fetch/$s_!CAEI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 424w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 848w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1272w, https://substackcdn.com/image/fetch/$s_!CAEI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31106eaa-4ef3-4c93-9e5b-88efdbd58029_3022x1471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The target page</figcaption></figure></div><p>Just like before, open DevTools and inspect the page. This time, focus on the <em>&lt;script&gt;</em> tags inside the <em>&lt;body&gt;</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LTkB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LTkB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png" width="1456" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the hydration script elements&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the hydration script elements" title="Note the hydration script elements" srcset="https://substackcdn.com/image/fetch/$s_!LTkB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LTkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a29000-447c-4e23-babd-68b121bd6c0b_1746x1417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the hydration script elements</figcaption></figure></div><p>You&#8217;ll notice several scripts containing content like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">self.__next_f.push(&lt;some_data&gt;)</code></pre></div><p>That &#8220;<em>&lt;some_data&gt;</em>&#8221; is serialized using the <a href="https://tonyalicea.dev/blog/understanding-react-server-components/">React Flight protocol for React Server Components (RSC)</a>. I won&#8217;t go too deep into the internals here (it&#8217;s a dense topic!), but what matters is that <strong>deserializing that data is </strong><em><strong>not</strong></em><strong> straightforward!</strong></p><p>React Flight isn&#8217;t plain JSON. It mixes control records (<em>HL</em>, <em>I</em>, <em>J</em>, etc.), module references, streaming boundaries, and serialized model fragments into a transport format that React incrementally resolves at runtime.</p><p>You might think: &#8220;Why not just reuse the frontend deserialization library?&#8221; In practice, that doesn&#8217;t work well because:</p><ul><li><p>The client decoder (<em><a href="https://www.npmjs.com/package/react-server-dom-webpack">react-server-dom-webpack</a></em>) expects a full React runtime.</p></li><li><p>It relies on module maps and webpack IDs generated at build time.</p></li><li><p>It resolves component references against the exact bundle that produced the stream.</p></li><li><p>It assumes streaming semantics and internal React wiring.</p></li></ul><p>Basically, outside that exact environment, you don&#8217;t have the module graph, build manifest, or hydration context. So even if you import the decoder, you can&#8217;t reconstruct the component tree the way the browser does.</p><p>There have been recent security issues in the React Flight payload deserialization system, highlighting just how sensitive and complex this layer is. For more details, refer to:</p><ul><li><p><em><a href="https://nextjs.org/blog/CVE-2025-66478">Security Advisory: CVE-2025-66478</a></em></p></li><li><p><em><a href="https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components">Critical Security Vulnerability in React Server Components</a></em></p></li></ul><p>Thus, instead of fighting the protocol, I&#8217;d simplify and accept that in this case, it&#8217;s better to extract the unparsed React Flight string data. Achieve that with the JS script below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">const nextFlightScripts = [...document.querySelectorAll("script")]
  .filter(script =&gt; script.textContent.includes("self.__next_f"))
  .map(script =&gt; script.textContent.trim())
console.log(nextFlightScripts)</code></pre></div><p>This selects all <em>&lt;script&gt;</em> elements containing &#8220;self.__next_f&#8221; and builds an array of their raw contents.</p><p>Run it in the Console, and you&#8217;ll get an array of React Flight chunks:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LBAG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LBAG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png" width="1456" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the React Flight strings&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the React Flight strings" title="Note the React Flight strings" srcset="https://substackcdn.com/image/fetch/$s_!LBAG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 424w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 848w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!LBAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e8ece6-5fb7-45e3-bc35-2a640e7360ff_1746x1417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the React Flight strings</figcaption></figure></div><p>From there, the simplest way to extract structured data is often to copy the array, feed it to an AI, and ask it to reconstruct a parsed JSON representation of the meaningful payload sections:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!08ee!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!08ee!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 424w, https://substackcdn.com/image/fetch/$s_!08ee!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 848w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1272w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png" width="1456" height="973" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:973,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the parsed version of the source data produced by Gemini&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the parsed version of the source data produced by Gemini" title="Note the parsed version of the source data produced by Gemini" srcset="https://substackcdn.com/image/fetch/$s_!08ee!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 424w, https://substackcdn.com/image/fetch/$s_!08ee!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 848w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1272w, https://substackcdn.com/image/fetch/$s_!08ee!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee9957a-70cb-4d9a-9017-8de87ce20d8d_2422x1619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the parsed version of the source data produced by Gemini</figcaption></figure></div><p>Is this more complicated than the <em>__NEXT_DATA__</em> trick? Absolutely! Yet, it&#8217;s still a powerful way to access a large amount of page data with just a few lines of code.</p><h2>Final Script to Quickly Access Data From Next.js Sites</h2><p>If you combine the two approaches, you can build a production-ready script for brute-force hydration data scraping in Next.js:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// Pages Router approach (__NEXT_DATA__)
const hydrationScript = document.querySelector("#__NEXT_DATA__")
let nextData = null
if (hydrationScript) {
  try {
    nextData = JSON.parse(hydrationScript.textContent)
    console.log("__NEXT_DATA__ found:")
    console.log(nextData)
  } catch (err) {
    console.warn("Failed to parse __NEXT_DATA__:", err)
  }
} else {
  console.log("No __NEXT_DATA__ script found.")
}

// App Router approach (self.__next_f)
const nextFlightScripts = [...document.querySelectorAll("script")]
  .map(script =&gt; script.textContent.trim())
  .filter(content =&gt; content.includes("self.__next_f.push"))

if (nextFlightScripts.length &gt; 0) {
  console.log("React Flight scripts found:")
  console.log(nextFlightScripts)
} else {
  console.log("No React Flight scripts found.")
}</code></pre></div><p>To test it, just open the Console in DevTools, paste the script, and run it.</p><p><strong>Important</strong>: The <em>&lt;script&gt;</em> components containing hydration data aren&#8217;t loaded dynamically via client-side rendering. They&#8217;re embedded directly in the HTML generated by the server.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Km-I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Km-I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Km-I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!Km-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e97102e-8234-48f2-b232-6e0ff0ad273c_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the #__NEXT_DATA__ element in the page source</figcaption></figure></div><p>That means you can:</p><ol><li><p>Fetch the target Next.js-powered page with an HTTP client.</p></li><li><p>Parse the HTML using an HTML parsing library like Beautiful Soup or Cheerio.</p></li><li><p>Apply a similar version of the JavaScript script above, but adapt it to the API provided by your HTML parser.</p></li></ol><p>In other words, this trick for scraping Next.js doesn&#8217;t only work in the browser DevTools. It also works perfectly in regular scraping scripts!</p><h2>Pros and Cons of This Approach to Next.js Scraping</h2><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Simple and effective, requiring only a few lines of code.</p></li><li><p>Works on all Next.js websites (and, more generally, on most sites that rely on hydration).</p></li><li><p>Can let you access more data than what is actually displayed on the page.</p></li><li><p>No need for browser automation, waiting for client-side rendering, or simulating user interactions.</p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>You may only get partial data, meaning you might still need to complement it with a more traditional scraping approach.</p></li><li><p>React Flight data is difficult to parse and may require custom logic or even <a href="https://substack.thewebscraping.club/p/llms-ai-web-scraping">AI-assisted parsing</a>.</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I&#8217;ve shared <a href="https://brightdata.com/blog/how-tos/web-scraping-with-next-js">a trick I personally documented years ago</a>, and that still works to this day. It allows you to quickly scrape data from virtually any Next.js site by targeting the hydration data embedded in the HTML document generated by the server and sent to the client for rendering.</p><p>As you&#8217;ve seen, with just a few lines of JavaScript, you can extract hydration data from any Next.js-powered page. What you get back is clean, or at least almost clean, data that you can process directly in your data pipelines.</p><p>Instead of fighting the frontend, this Next.js web scraping approach helps you leverage the data the framework itself needs to function!</p><p>I hope you found this useful and insightful. If you have questions or thoughts, feel free to share them in the comments below!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #102: How Fast Can You Call Polymarket's APIs?]]></title><description><![CDATA[Three languages, four locations, 1,000 requests. The biggest speed gain has nothing to do with code.]]></description><link>https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 16 Apr 2026 14:08:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/dd002f1e-6fe6-4cde-8c7d-1fdaa94d11d3_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is a platform name that has been bouncing in the news for over a year now. A new military action? Someone predicted it on Polymarket. An event that moves the price of oil or shakes a currency? Someone else, or maybe the same person, placed a bet a few hours before and walked away with a pile of money. Every time a headline breaks, Polymarket seems to have already priced it in, or worse, someone appears to have known in advance. <br>Even here on Substack, you can share the predictions coming from the platform.<br></p><div class="polymarket-embed" data-attrs="{&quot;eventSlug&quot;:&quot;claude-5-released-by&quot;,&quot;marketSlug&quot;:&quot;&quot;,&quot;profileName&quot;:&quot;&quot;,&quot;belowTheFold&quot;:false,&quot;fullEmbedUrl&quot;:&quot;https://substack.com/embed/polymarket/claude-5-released-by&quot;,&quot;isGraphMode&quot;:false}" data-component-name="PolymarketToDOM"></div><p><br>But what is Polymarket, and how does it work?<br></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What Polymarket is, and why it keeps making headlines</h2><p>Polymarket is a prediction market, a platform where you buy and sell shares tied to the outcome of real-world events. If the event happens, your share pays $1. If it doesn&#8217;t, it pays $0. The trading price at any moment reflects what the market collectively believes the probability of that outcome is. You can bet on elections, geopolitics, sports, crypto prices, and increasingly anything else with a verifiable resolution. </p><p>It is the largest prediction market by volume, built on the Polygon blockchain. Its main competitor, Kalshi, operates as a CFTC-regulated exchange in the US. Both are attracting billions in volume, and Wall Street firms are now building dedicated trading desks around them.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><p>The platform handled <a href="https://www.ccn.com/news/crypto/polymarket-7-5-billion-2025-prediction-markets/">at least $7.5 billion in volume during 2025</a> (a conservative figure, since <a href="https://www.paradigm.xyz/2025/12/polymarket-volume-is-being-double-counted">Polymarket volume is commonly double-counted</a> due to how OrderFilled events are summed, <a href="https://www.trmlabs.com/resources/blog/how-prediction-markets-scaled-to-usd-21b-in-monthly-volume-in-2026">and set a single-day record of $425 million in February 2026</a> when Iran-related markets resolved simultaneously. Those are not toy numbers. And with that kind of money flowing through, the headlines have followed.</p><p>In January 2026, a newly created Polymarket account invested $30,000 <a href="https://www.npr.org/2026/01/05/nx-s1-5667232/polymarket-maduro-bet-insider-trading">and walked away with $436,759</a> after correctly betting on Maduro&#8217;s removal from power. The account was created less than a week before the U.S. military operation, and the bulk of bids were placed hours before Trump&#8217;s announcement. In a separate case, <a href="https://www.haaretz.com/israel-news/israel-security/2026-03-28/ty-article/.premium/court-clears-air-force-officer-charged-with-leaking-iran-strike-for-online-bets/0000019d-2f2e-d868-a1bd-7fef78860000">an Israeli Air Force reservist was indicted for leaking classified detail</a>s about a strike on Iran to guide Polymarket bets, netting roughly $244,000. <a href="https://www.cnn.com/2026/03/24/politics/iran-war-bets-prediction-markets">A different trader has made nearly $1 million since 2024</a> from dozens of well-timed bets correctly predicting U.S. and Israeli military actions against Iran, winning 93% of five-figure wagers. <a href="https://www.cnbc.com/2026/04/15/kalshi-and-polymarket-congress-regulation-washington-influence.html">These incidents triggered at least eight prediction market bills in Congress</a> since January 2026, and federal prosecutors in Manhattan are <a href="https://www.cnn.com/2026/03/30/politics/prediction-markets-justice-department">actively exploring whether certain prediction market bets violate insider trading laws</a>.</p><p>But insider trading is not the only way people make money on Polymarket. There is a quieter, more interesting story happening in parallel.</p><p></p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EOo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png" width="630" height="69.35779816513761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:1090,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:133037,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/194398731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EOo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 424w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 848w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EOo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2905ffa2-e3cd-4be3-ab69-02cceb4073a9_1090x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong><a href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE">Start your scraping journey with Byteful</a></strong>: 10GB New Customer Trial | Use TWSC for 15% OFF | $1.75/GB Residential Data | ISP Proxies in 15+ Countries</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE&quot;,&quot;text&quot;:&quot;Claim your 10GB here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://byteful.com?utm_source=twsc&amp;utm_medium=new_banner&amp;utm_id=twsc&amp;promotion_public_id=10GB_TWSC_TRIAL&amp;promotion_code=NJKLBIOCWYENIPFE"><span>Claim your 10GB here</span></a></p></blockquote><div><hr></div><p><br></p><h2>The efficiency gap</h2><p><a href="https://www.princeton.edu/~ceps/workingpapers/91malkiel.pdf">The Efficient Market Hypothesis</a>, formalized by Eugene Fama in the 1960s, states that asset prices reflect all available information, making it impossible to consistently beat the market. In traditional equity markets, this largely holds because massive institutional capital from hedge funds, pension funds, and proprietary trading firms constantly hunts for and eliminates mispricings. The S&amp;P 500 trades roughly $500 billion daily. Any pricing error gets corrected in milliseconds by algorithms running in colocated data centers.</p><p>Polymarket&#8217;s individual markets often have only tens of thousands of dollars in liquidity. The ratio of &#8220;smart money&#8221; to &#8220;total market cap&#8221; is fundamentally different from equity markets, and that is why edges persist longer than they would on Wall Street. <a href="https://arxiv.org/abs/2508.03474">A 2025 study by IMDEA Networks Institute</a> documented $40 million in arbitrage profits extracted from Polymarket alone between April 2024 and April 2025, analyzing 86 million bets. <a href="https://www.financemagnates.com/trending/prediction-markets-are-turning-into-a-bot-playground/)">Arbitrage opportunities on the platform last an average of 4 seconds, with 73% of profits captured by bots</a>, executing in under 100 milliseconds.</p><p>The institutional side is catching up. <a href="https://www.financemagnates.com/fintech/wall-street-quants-move-into-prediction-markets-to-hunt-for-arbitrage-not-to-bet/">DRW is hiring dedicated prediction market traders</a> at a $200,000 base salary. Susquehanna International Group became the first official market maker on Kalshi (a competing platform). Jump Trading is building specialized desks. But the market is not there yet. Liquidity is too thin for these firms to deploy serious capital without moving prices, leaving room for smaller, faster actors.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Speed as edge: what people are building</h2><p>I&#8217;ve been studying Polymarket for some months now, and I&#8217;ve probably ended up on a bubble on Instagram and other social media. I&#8217;m seeing a growing number of traders share systems designed to exploit exactly this kind of market inefficiency. The approaches vary, but the pattern is the same: faster information, faster execution, profit.</p><p>One notable case involves a trader who claims to use computer vision models processing live football match video feeds. His system watches the match in real time, detects events (goals, red cards, penalties) through frame analysis, and places bets on prediction markets seconds before the event registers on official data feeds and bookmaker odds adjust. He claims an 8-second advantage over other traders (unfortunately, I cannot find the post on Instagram about it anymore). Whether that specific claim holds up or not, this is nothing new: courtsiders have been doing this in tennis for years, <a href="https://fivethirtyeight.com/features/inside-the-shadowy-world-of-high-speed-tennis-betting">attending live matches and transmitting scores</a> faster than official data feeds reach bookmakers. In 2016, tennis umpires from Kazakhstan, Turkey, and Ukraine were banned for deliberately delaying score updates for courtside accomplices.</p><p>The same principle applies at a larger scale. <a href="https://www.bloomberg.com/news/features/2018-05-03/the-gambler-who-cracked-the-horse-racing-code">Bill Benter built a multinomial logit model</a> with over 120 variables per horse for Hong Kong racing and extracted over $1 billion between 1987 and 2001. <a href="https://www.racingpost.com/news/britain/high-court-case-alleges-tony-blooms-betting-empire-makes-600m-a-year-so-what-do-we-know-about-his-starlizard-syndicate-aNlkE7t8daxQ/">Tony Bloom&#8217;s Starlizard syndicate employs 160 people </a>to model Asian handicap football markets and reportedly generates 600 million GBP per year. <a href="https://www.financemagnates.com/trending/prediction-markets-are-turning-into-a-bot-playground/">On Polymarket itself, 14 of the top 20 most profitable wallets are bots</a>. <a href="https://www.coindesk.com/markets/2026/02/21/how-ai-is-helping-retail-traders-exploit-prediction-market-glitches-to-make-easy-money">One bot turned $313 into $414,000</a> in a single month, exploiting temporal arbitrage in 15-minute crypto markets.</p><p>All of these systems share two requirements: data and speed. They need real-time access to market prices, order books, and event outcomes, and they need to act on that data faster than everyone else. All of this is possible because Polymarket provides a full set of APIs that can be used to operate programmatically on the platform.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Polymarket&#8217;s API architecture</h2><p>Polymarket exposes <a href="https://docs.polymarket.com/api-reference/introduction">three distinct APIs</a>, each serving a different purpose. Understanding which one to use and when is the first step toward building anything that trades or monitors this market.</p><h3>Gamma API: market discovery</h3><p>The Gamma API is the browsing layer. It returns human-readable market data: questions, descriptions, outcome prices, volume, liquidity, event metadata. No authentication required.</p><p><strong>Base URL</strong>: https://gamma-api.polymarket.com</p><p>Key endpoints:</p><p>- <code>GET /markets</code> returns a paginated list of markets with filtering options (limit, offset, closed, tag_id)</p><p>- <code>GET /markets/{id} </code>returns a single market by ID or slug</p><p>- <code>GET /events</code> and <code>GET /events/{id}</code> return event-level data (events group related markets)</p><p>- <code>GET /search?query=... </code>performs keyword search across markets and events</p><p>A single call to /markets?limit=1&amp;closed=false returns something like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7e25266f-af9b-4b4c-b40f-660d9c8e031f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{
  "id": "540816",
  "question": "Russia-Ukraine Ceasefire before GTA VI?",
  "conditionId": "0x9c1a953fe92c8357f1b646ba25d983aa83e90c525992db14fb726fa895cb5763",
  "outcomes": "[\"Yes\", \"No\"]",
  "outcomePrices": "[\"0.545\", \"0.455\"]",
  "volume": "1516211.89",
  "liquidity": "62104.61",
  "clobTokenIds": "[\"850149715...\", \"252731249...\"]"
}</code></pre></div><p>The `clobTokenIds` field is the bridge to the trading layer. Each outcome (Yes/No) gets its own token ID, which is what you pass to the CLOB API to get real-time prices and order book data.</p><p>The Gamma API is rate-limited at roughly 60 requests per minute. It is useful for discovery and metadata, not for real-time price monitoring.</p><h3>CLOB API: the order book</h3><p>The CLOB (Central Limit Order Book) API is where trading happens. It has both public and authenticated endpoints.</p><p><strong>Base URL</strong>: https://clob.polymarket.com</p><p><strong>Public endpoints (no authentication):</strong></p><p>- <code>GET /price?token_id=X&amp;side=BUY|SELL</code> returns the current best price</p><p>- <code>GET /midpoint?token_id=X</code> returns the midpoint between best bid and ask</p><p>- <code>GET /spread?token_id=X</code> returns the current spread</p><p>- <code>GET /book?token_id=X</code> returns the full order book with all bids and asks</p><p>- <code>GET /last-trade-price?token_id=X</code> returns the last executed trade price</p><p>- <code>GET /tick-size?token_id=X</code> returns the minimum price increment</p><p>A call to /midpoint returns a minimal payload:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;5b93dac3-b924-4bde-9cca-74db20b575d9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{"mid": "0.545"}</code></pre></div><p>The <code>/book</code> endpoint returns the full depth:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7ba6f05f-de54-4ad7-ba8b-50d1d9f4eddc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">{
  "market": "0x9c1a95...",
  "asset_id": "850149715...",
  "bids": [
    {"price": "0.54", "size": "15234.50"},
    {"price": "0.53", "size": "8920.00"}
  ],
  "asks": [
    {"price": "0.55", "size": "12100.00"},
    {"price": "0.56", "size": "6500.00"}
  ]
}</code></pre></div><p>These public endpoints are what matters for price monitoring. They are lightweight, return small payloads, and have no authentication overhead.</p><p><strong>Authenticated endpoints</strong> require a <a href="https://docs.polymarket.com/developers/CLOB/authentication">two-level authentication system</a>:</p><p><strong>Level 1 (L1)</strong> uses EIP-712 wallet signatures. You sign a structured message proving you control a specific Ethereum wallet address. This is a one-time operation that generates API credentials:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;a46d1352-1fcd-4eb2-98dd-5baea0815327&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">POST /auth/api-key
Headers: POLY_ADDRESS, POLY_SIGNATURE, POLY_TIMESTAMP, POLY_NONCE
Returns: { apiKey, secret, passphrase }</code></pre></div><p><code>Level 2 (L2) </code>uses HMAC-SHA256 signing on every request. Every authenticated call requires five headers: <code>POLY_ADDRESS</code>, <code>POLY_SIGNATURE</code> (computed HMAC of the request), <code>POLY_TIMESTAMP</code>, <code>POLY_API_KEY</code>, and <code>POLY_PASSPHRASE</code>. </p><p>Even with L2 auth, placing an order requires the user to sign the order payload locally with their private key. Three cryptographic operations total: key derivation (once), request signing (per call), order signing (per order).</p><p>The authenticated endpoints are:</p><p>- <code>POST /order</code> places a single order</p><p>- <code>POST /orders</code> places a batch of orders</p><p>- <code>DELETE /order</code> cancels an order</p><p><strong>WebSocket feeds</strong> provide real-time streaming at <code>wss://ws-subscriptions-clob.polymarket.com/ws/ </code>for order book updates, price changes, and user-specific events.</p><h3>Data API: analytics</h3><p>The Data API at <code>https://data-api.polymarket.com</code> provides analytics-oriented data: user positions, trade history, leaderboards, and holder information. It is less documented and less stable than the other two. Some endpoints returned 404 or empty responses during our testing. Useful for research, not reliable for production.</p><h2>The speed game: calling the APIs as fast as possible</h2><p>If arbitrage opportunities on Polymarket last 4 seconds on average, and 73% of profits go to bots executing in under 100 milliseconds, then the speed at which you can read prices and place orders is a direct competitive advantage. We set up a benchmark to answer two questions: where should you run your code, and which language and HTTP strategy gets you there fastest?</p><p>We did our tests and chose the <code>/midpoint </code>endpoint for the benchmark because it requires no authentication, returns the smallest possible payload, and isolates HTTP client performance from payload parsing. Each benchmark runs 1,000 requests in two modes: sequential (one request at a time, measuring per-request latency) and concurrent (50 simultaneous workers, measuring throughput).<br><br>As always, the code can be found <a href="https://github.com/TheWebScrapingClub/thelab">in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">102.POLYMARKET</a>.</strong></p>
      <p>
          <a href="https://substack.thewebscraping.club/p/how-to-get-data-from-polymarket-fast">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Stealth Stack: A Guide to Preventing Data Leaks in Web Scraping Infrastructure]]></title><description><![CDATA[A four-layer defense strategy for making your web scraping infrastructure indistinguishable from real users]]></description><link>https://substack.thewebscraping.club/p/the-stealth-stack-web-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/the-stealth-stack-web-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 12 Apr 2026 03:00:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ef273b12-ade2-4ba6-a14a-701876041775_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When hearing about &#8220;data leaks&#8221;, I&#8217;m sure you think about cybersecurity, databases, and personal information lost due to malicious intent. But what if I tell you your web scraper is leaking data? But in the specific context of web scraping, no one is stealing your data. Rather, this means that your scraper is revealing its automated nature through a set of signals. </p><p>In particular, your scrapers leak information at four distinct layer levels. Modern anti-bot systems, in fact, fingerprint your browser, analyze your TLS handshake, trace your network infrastructure, and track your behavioral patterns. And a single inconsistency across these layers triggers permanent blocking.</p><p>This means your scrapers aren&#8217;t competing only against rate limits anymore. Today, they are competing against <a href="https://substack.thewebscraping.club/p/machine-learning-for-detecting-bots">machine learning models trained on billions of legitimate requests</a>, and any deviation from the expected pattern is a signal. So, if you want to scrape at scale, your infrastructure must be indistinguishable from a real user&#8217;s browser, network stack, and behavior.</p><p>This article guides you through a systematic approach: First, understanding where leaks occur, then learning how anti-bot systems detect them, and finally building a layered defense that makes your scraper invisible.</p><p></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2><strong>Identifying the Leaks: Where Your Scraper Exposes Itself</strong></h2><p>Before fixing anything, you need to understand the complete attack surface. Modern anti-bot systems analyze your scraper at four distinct layers, and a leak at any layer can expose you.</p><h3><strong>Layer 1: The Browser Level</strong></h3><p>Headless browsers are loud by default. Launch a <a href="https://pptr.dev/">Puppeteer</a> instance and check the  <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">navigator.webdriver</a> </em>flag. It surely returns <em>true</em>, and that&#8217;s a signal every major anti-bot system checks in the first 100ms of page load.</p><p>But this obvious flag is just the beginning. Anti-bot systems probe deeper:</p><ul><li><p><strong>Error messages and stack traces</strong>: They differ between headless and headed modes. The execution context leaves fingerprints in error objects.</p></li><li><p><strong>Window dimensions</strong>: Properties like <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/outerWidth#:~:text=outerWidth%20read%2Donly%20property%20returns,and%20window%20resizing%20borders%2Fhandles.">window.outerWidth</a></em> and <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Window/outerHeight">window.outerHeight</a></em> reveal a headless operation because headless mode doesn&#8217;t render a visible window frame.</p></li><li><p><strong>Canvas rendering</strong>: They can produce pixel-level differences. Software rendering (headless) creates different anti-aliasing and color values than GPU-accelerated rendering (headed). Color channels can differ by 1-2 units per pixel.</p></li><li><p><strong><a href="https://developer.mozilla.org/en-US/docs/Web/API/WebGLShader">WebGL shader</a> timing</strong>: This can vary a lot, depending on the underlying technology. GPU-accelerated browsers complete WebGL operations in microseconds. Software-rendered headless browsers take milliseconds.</p></li><li><p><strong>Font rendering</strong>: Headless environments often lack the full system font stack. This creates detectable layout differences when JavaScript measures text dimensions.</p></li><li><p><strong>Performance benchmarks</strong>: When run, they can reveal software rendering. For example, there are websites that run JavaScript stress tests, creating thousands of DOM elements, calculating layouts, and triggering reflows. In such scenarios, real browsers with GPU acceleration show consistent performance. Headless browsers, instead, show different timing patterns.</p></li><li><p><strong>The </strong><em><strong><a href="https://developer.chrome.com/docs/extensions/reference/api/windows">window.chrome</a></strong></em><strong> object behaves differentl</strong>y: Real Chrome populates this object with specific properties for extension management and runtime APIs. Headless Chrome, instead, either lacks this object or provides an incomplete implementation.</p><p></p></li></ul><h3><strong>Layer 2: The Network Level</strong></h3><p>Your SSL/TLS handshake identifies you before you send any application data. When your scraper connects over HTTPS, it sends a TLS Client Hello message containing supported encryption methods, protocol versions, and extensions. All in a specific order.</p><p>Here&#8217;s what makes this dangerous:</p><ul><li><p><strong>Every browser and HTTP library has a unique TLS pattern:</strong> Real browsers send their TLS parameters in a specific sequence that matches their version and underlying platform. Python&#8217;s standard HTTP libraries send a completely different pattern. So do Node.js, Go, and any other programming language you use for coding your scrapers.</p></li><li><p><strong>Anti-bot systems fingerprint your TLS handshake:</strong> They capture these patterns and convert them into a fingerprint, commonly called a <a href="https://github.com/salesforce/ja3">JA3 hash</a>. They maintain databases of known fingerprints for every major browser and HTTP library.</p></li><li><p><strong>Mismatches between User-Agent and TLS fingerprint are instant red flags:</strong> When you claim to be Chrome in your User-Agent header but your TLS handshake matches Python&#8217;s urllib library, that inconsistency triggers blocking.</p></li><li><p><strong>Detection happens before you send any application data:</strong> The first TCP connection already identifies you as automated traffic.</p></li><li><p><strong>HTTP/2 fingerprinting adds another layer:</strong> Beyond TLS, the order and priority of HTTP/2 frames, settings, and window updates create additional fingerprints. Your HTTP library&#8217;s frame ordering must match your claimed browser identity.</p></li></ul><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo </strong>with high reputatation IPs<strong>,</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h3><strong>Layer 3: The Infrastructure Level</strong></h3><p>Your proxy configuration can expose your real infrastructure through network-level leaks via the following main mechanisms:</p><ul><li><p><strong>DNS leaks:</strong> They happen when your browser resolves domain names using your local DNS server instead of routing through the proxy. Your scraper might send requests through a Miami residential proxy, but if DNS queries go through your AWS datacenter in Virginia, the target site knows your real location.</p></li><li><p><strong>WebRTC leaks:</strong> <a href="https://webrtc.org/">WebRTC </a>is a browser API designed for peer-to-peer communication. Even with a proxy configured, WebRTC will attempt to discover your real local IP and public IP through STUN servers, completely bypassing your proxy.</p></li><li><p><strong>IP reputation:</strong> Not all IPs are created equal. Cloudflare and similar services maintain databases of every AWS, Google Cloud, and Azure IP range. Requests from known cloud providers receive instant higher suspicion scores before any other analysis happens.</p></li></ul><h3><strong>Layer 4: The Behavioral Level</strong></h3><p>Even if your browser, network, and infrastructure are perfectly disguised, your behavior patterns can still expose you:</p><ul><li><p><strong>Timing patterns:</strong> Requesting data at fixed and precise intervals creates a perfect periodicity. No human browses with mathematical precision.</p></li><li><p><strong>Mouse and scroll behavior:</strong> Real humans accelerate and decelerate smoothly. Instant jumps from point A to point B are mechanically impossible.</p></li><li><p><strong>Session state:</strong> Stateless scrapers that never accumulate cookies or maintain persistent sessions across days look like fresh bots on every run.</p></li><li><p><strong>Interaction sequences:</strong> The time between page load and first click, between mouse-over and click, or the pattern of how you scroll through content. They all follow detectable human patterns.</p></li></ul><h2><strong>Understanding the Detection: How Anti-Bot Systems Catch You</strong></h2><p>Now that you know where leaks occur, let&#8217;s understand how anti-bot systems actually detect them.</p><h3><strong>Fingerprint Consistency Checks</strong></h3><p>Anti-bot systems cross-reference your claimed identity with actual behavior. If your <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent">User-Agent</a> says &#8220;Chrome 120 on Windows 10,&#8221; they verify that your JavaScript features, WebGL capabilities, canvas rendering, and TLS handshake all match Chrome 120 on Windows 10.</p><p>A single mismatch anywhere flags the entire request. You can&#8217;t be Chrome in your User-Agent, Firefox in your TLS handshake, and headless Chrome in your canvas fingerprint. <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">Anti-bot systems create composite fingerprints combining dozens of properties</a>, then compare them against databases of known legitimate and bot patterns.</p><h3><strong>Machine Learning Pattern Recognition</strong></h3><p>Modern anti-bot systems use ML models trained on billions of requests. They learn what &#8220;normal&#8221; looks like for each type of visitor. This means that consumer browsers from residential IPs have different behavioral patterns than datacenter scrapers.</p><p>For ML models, statistical anomalies trigger investigation. Perfect timing intervals, impossible mouse movements, or timing patterns that don&#8217;t match human variance distributions are scored as anomalous. These models adapt continuously, so when new stealth techniques emerge, the models retrain on that data. This means that what works today might fail tomorrow.</p><h3><strong>Progressive Trust Scoring</strong></h3><p>Anti-bot systems block or allow requests, but they also score. This means that lower trust scores receive degraded service: slower response times, rate limits, or <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">CAPTCHA challen</a>ges before blocking.</p><p>Also, scores accumulate across sessions. If you leak information across multiple visits, the system builds a profile associating your various identities. In other words, one leak can poison future requests, and even fixing the leak might not restore trust if your IP or fingerprint is already marked.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2><strong>Building the Defense: A Layered Approach to Stealth</strong></h2><p>Building a defense from data leaks in web scraping requires addressing each layer systematically. Your stealth stack must work from the inside out: browser &#8594; network &#8594; infrastructure &#8594; behavior. Each layer must remain consistent with your claimed identity.</p><h3><strong>Defense Layer 1: Hardening the Browser</strong></h3><p>The goal at this layer is to make the browser fingerprint indistinguishable from a real user&#8217;s browser and ensure every property is consistent with your claimed identity.</p><p><strong>Step 1: Mask Automation Signals</strong></p><p>Start with stealth libraries that patch the most common detection vectors:</p><ul><li><p><strong>For Puppeteer:</strong> Use <em><a href="https://www.npmjs.com/package/puppeteer-extra-plugin-stealth">puppeteer-extra-plu</a>gin-stealth</em> to automatically override <em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">navigator.webdriver</a></em><a href="https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver">,</a> DevTools Protocol signatures, and plugin arrays.</p></li><li><p><strong>For <a href="https://www.selenium.dev/">Selenium</a>:</strong> Use <em><a href="https://pypi.org/project/undetected-chromedriver/">undetected-chromedriver</a>,</em> which patches automation signals and uses real Chrome binaries instead of ChromeDriver.</p></li><li><p><strong>For Playwright:</strong> Leverage native evasion features that handle many detection vectors out of the box.</p></li></ul><p>Additionally, disable automation flags at launch. For example, in Playwright:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--no-sandbox'
        ]
    )</code></code></pre><p>But remember: Stealth libraries handle the most common 20-30 leak vectors but miss advanced fingerprinting techniques. They&#8217;re your foundation, not your complete solution.</p><p><strong>Step 2: Spoof Hardware Signatures</strong></p><p>Cloud server canvas and WebGL fingerprints are obvious red flags. AWS, GCP, and Azure rendering signatures are well-known to anti-bot systems.</p><p>You have two approaches for your defense here:</p><ul><li><p><strong>Add consistent noise:</strong> Inject deterministic noise into canvas operations so the fingerprint remains stable across sessions but doesn&#8217;t match your server&#8217;s real hardware. Override canvas methods to modify pixel data slightly before it&#8217;s read back. Keep noise minimal: just enough to mask the real hardware signature without appearing obviously manipulated.</p></li><li><p><strong>Emulate common consumer hardware:</strong> Spoof WebGL parameters to mimic common consumer GPUs. Override vendor and renderer strings returned by WebGL APIs to match your chosen hardware profile. Use existing libraries designed for canvas fingerprint defense or implement your own parameter overrides.</p></li></ul><p><strong>Step 3: Ensure Version Consistency</strong></p><p>This is where most scrapers fail, even with stealth libraries. Your User-Agent string must match your actual browser engine behavior precisely. Consider the following rules of thumb:</p><ul><li><p><strong>Use real browser binaries instead of spoofing:</strong> Tools like Playwright can launch actual Chrome, ensuring perfect consistency between claimed version and actual behavior.</p></li><li><p><strong>If you must spoof, maintain complete version profiles:</strong> Track which JavaScript features, WebGL capabilities, and API behaviors correspond to each browser version. Every property must align.</p></li><li><p><strong>Never mix components from different versions:</strong> If you claim Chrome 120 on Windows 10, every single API, from JavaScript features to WebGL renderers, must behave exactly like Chrome 120 on Windows 10.</p></li></ul><h3><strong>Defense Layer 2: Hardening the Network Stack</strong></h3><p>Your goal at this layer is to make your TLS handshake and HTTP traffic indistinguishable from the browser you&#8217;re claiming to be.</p><p><strong>Step 4: Match TLS Fingerprints to Your Browser Identity</strong></p><p>Standard HTTP libraries can&#8217;t mimic browser TLS fingerprints because they use different SSL/TLS implementations. The solution requires specialized libraries that replicate browser behavior at the protocol level:</p><ul><li><p><strong>For Python:</strong> Use <em><a href="https://curl-cffi.readthedocs.io/en/latest/">curl_cffi</a></em> or similar wrappers. These libraries use <em><a href="https://curl.se/libcurl/">libcurl</a></em> compiled with <em><a href="https://github.com/google/boringssl">BoringSSL</a></em>, which is the same SSL library Chrome uses. This creates identical JA3 fingerprints to real browsers.</p></li><li><p><strong>For Node.js:</strong> Use <em><a href="https://www.npmjs.com/package/cycletls">cycletls</a></em> or equivalent libraries that allow you to specify exact JA3 fingerprint strings matching real browsers.</p></li></ul><p><strong>Critical requirement:</strong> Your TLS fingerprint must match your User-Agent. Chrome 120&#8217;s JA3 fingerprint is different from Firefox 115&#8217;s fingerprint. The browser identity must be consistent across all layers.</p><p><strong>Step 5: Match HTTP/2 Fingerprints</strong></p><p>Beyond TLS, HTTP/2 frame ordering creates additional fingerprints. Libraries like <em>curl_cffi</em> handle this automatically when you specify a browser to impersonate, but verify that:</p><ul><li><p>Settings frames match your target browser.</p></li><li><p>Window update sequences align.</p></li><li><p>Priority headers follow the correct pattern.</p></li></ul><p>In Python, you can do so with the following code:</p><pre><code><code>response = requests.get(
    '&lt;https://tls.peet.ws/api/all&gt;',
    impersonate='chrome120'
)
print(response.json()['http2']['sent_frames'])
</code></code></pre><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3><strong>Defense Layer 3: Hardening Infrastructure</strong></h3><p>Your goal at this layer is to ensure your network traffic originates from legitimate-looking IPs and doesn&#8217;t leak your real location or identity.</p><p><strong>Step 6: Choose the Right Proxy Type</strong></p><p>IP reputation is the first filter that anti-bot systems check. This means that your<a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies"> proxy choice determines your baseline trust score</a>. Consider the following guidelines:</p><ul><li><p><strong>Datacenter IPs = instant red flag:</strong> Requests from AWS, Google Cloud, and Azure IP ranges receive instant higher suspicion scores. </p></li><li><p><strong>Residential proxies = highest legitimacy:</strong> These IPs come from real ISP connections, so they look legitimate because they are legitimate consumer connections.</p></li><li><p><strong>Mobile proxies = premium legitimacy</strong>: These IPs originate from cellular networks (4G/5G) and receive the highest trust scores. Mobile IPs rotate naturally as devices move between cell towers, making them appear even more organic than static residential connections.</p></li></ul><p><strong>Step 7: Prevent DNS Leaks</strong></p><p>Force all DNS resolution through your proxy tunnel. For SOCKS5 proxies, use the SOCKS5h protocol variant, which forces DNS resolution on the remote proxy server instead of locally.</p><p>For example, in Python, write the following:</p><pre><code><code>import requests

proxies = {
    'http': 'socks5h://proxy.example.com:1080',
    'https': 'socks5h://proxy.example.com:1080'
}

response = requests.get('&lt;https://example.com&gt;', proxies=proxies)
</code></code></pre><p>For browser automation, configure DNS-over-HTTPS to prevent local DNS leakage. The following is an example that applies to Playwright:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        args=[
            '--dns-over-https-server=https://cloudflare-dns.com/dns-query'
        ]
    )
</code></code></pre><p><strong>Step 8: Disable WebRTC Completely</strong></p><p>WebRTC will expose your real IP unless you completely disable it in browser automation. For example, in Playwright, you can do so as follows:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    
    # Remove WebRTC entirely
    await page.add_init_script("""
        delete window.RTCPeerConnection;
        delete window.RTCSessionDescription;
        delete window.RTCIceCandidate;
        delete navigator.mediaDevices;
    """)
</code></code></pre><p>When you&#236;ve done this, verify it&#8217;s actually disabled before deploying your scraper. Visit <a href="http://browserleaks.com/webrtc">browserleaks.com/webrtc</a> with your scraper. You should see &#8220;WebRTC is not supported by your browser&#8221;, or only your proxy IP should be visible. Never your real IP.</p><h3><strong>Defense Layer 4: Mimicking Human Behavior</strong></h3><p>Your goal at this layer is to make your interaction patterns indistinguishable from those of real human users.</p><p><strong>Step 9: Add Timing Jitter and Randomization</strong></p><p>Humans are inconsistent. Perfect patterns are robotic. The solution here is not to just add randomness. You also need to match the statistical distribution of real human behavior. To do so, consider the following example in Python:</p><pre><code><code>import numpy as np
import time

# Wrong example (do not use this)

# Fixed interval
time.sleep(5)  # Always 5 seconds - DETECTABLE

# Random uniform
time.sleep(random.uniform(3, 7))  # Still doesn't match human patterns

------------

# Correct example (use this!)

# Log-normal distribution (matches real human reaction times)
delay = np.random.lognormal(mean=1.5, sigma=0.5)
time.sleep(delay)
</code></code></pre><p>For improving randomization, model different action types with appropriate distributions. Use the following rules of thumb:</p><ul><li><p>Clicks: 0.3-2 seconds (short delays)</p></li><li><p>Reading: 5-45 seconds (high variance)</p></li><li><p>Scrolling: 1-8 seconds (irregular intervals)</p></li></ul><p><strong>Step 10: Implement Realistic Mouse and Scroll Behavior</strong></p><p>High-security sites like banking, ticketing, and heavily protected e-commerce websites track interaction patterns in real-time. To defend from leaking your information on such websites, you have to define mouse movements and scrolling for your automated scripts.</p><p>For mouse movements, you can:</p><ul><li><p>Use Bezier curves to create natural arcing movements between points.</p></li><li><p>Add slight randomness to destination coordinates.</p></li><li><p>Include hover delays before clicking.</p></li><li><p>Vary the number of intermediate steps based on distance.</p></li></ul><p>The following is an example you can try in Python:</p><pre><code><code>import numpy as np
from playwright.sync_api import sync_playwright

def bezier_curve(start, end, control_points, num_steps=20):
    """Generate points along a Bezier curve for natural mouse movement"""
    t = np.linspace(0, 1, num_steps)
    points = []
    
    # Simplified cubic Bezier
    for t_val in t:
        x = (1-t_val)**3 * start[0] + \\
            3*(1-t_val)**2*t_val * control_points[0][0] + \\
            3*(1-t_val)*t_val**2 * control_points[1][0] + \\
            t_val**3 * end[0]
        y = (1-t_val)**3 * start[1] + \\
            3*(1-t_val)**2*t_val * control_points[0][1] + \\
            3*(1-t_val)*t_val**2 * control_points[1][1] + \\
            t_val**3 * end[1]
        points.append((x, y))
    
    return points

async def human_like_click(page, selector):
    element = await page.query_selector(selector)
    box = await element.bounding_box()
    
    # Add slight randomness to destination
    target_x = box['x'] + box['width']/2 + np.random.normal(0, 2)
    target_y = box['y'] + box['height']/2 + np.random.normal(0, 2)
    
    # Move mouse along curve
    current_pos = await page.mouse.position()
    control_points = [
        (current_pos['x'] + np.random.uniform(-50, 50), 
         current_pos['y'] + np.random.uniform(-50, 50)),
        (target_x + np.random.uniform(-20, 20), 
         target_y + np.random.uniform(-20, 20))
    ]
    
    points = bezier_curve(
        (current_pos['x'], current_pos['y']), 
        (target_x, target_y), 
        control_points
    )
    
    for x, y in points:
        await page.mouse.move(x, y)
        await page.wait_for_timeout(np.random.uniform(5, 15))
    
    # Hover briefly before clicking
    await page.wait_for_timeout(np.random.uniform(100, 300))
    await page.mouse.click(target_x, target_y)
</code></code></pre><p>For scrolling, you can:</p><ul><li><p>Pause between scroll actions for variable amounts of time (simulating reading).</p></li><li><p>Scroll in chunks of varying size, not uniform pixels.</p></li><li><p>Occasionally scroll backwards (humans re-read).</p></li><li><p>Don&#8217;t scroll in perfect increments or at constant speeds.</p></li></ul><p>Use the following Python code to try such scrolling behaviour:</p><pre><code><code>async def human_like_scroll(page, total_distance):
    """Scroll with human-like patterns"""
    scrolled = 0
    
    while scrolled &lt; total_distance:
        # Vary chunk size
        chunk = np.random.randint(100, 400)
        
        await page.mouse.wheel(0, chunk)
        scrolled += chunk
        
        # Pause to simulate reading
        pause = np.random.lognormal(mean=1.2, sigma=0.8)
        await page.wait_for_timeout(pause * 1000)
        
        # Occasionally scroll backwards (humans re-read)
        if np.random.random() &lt; 0.15:
            await page.mouse.wheel(0, -np.random.randint(50, 150))
            await page.wait_for_timeout(np.random.uniform(500, 1500))
</code></code></pre><p><strong>Step 10: Maintain Persistent Session State</strong></p><p>Stateless scrapers look like stateless bots. Real browsers, instead, accumulate state over time because:</p><ul><li><p>Cookies persist across requests and sessions.</p></li><li><p>LocalStorage accumulates tracking data over time.</p></li><li><p>Session IDs remain stable across days or weeks.</p></li></ul><p>To mimic real browser states, you can use the following Python code:</p><pre><code><code>import pickle
import requests

# Save cookies to disk after each session
session = requests.Session()

# ... perform scraping ...

with open('cookies.pkl', 'wb') as f:
    pickle.dump(session.cookies, f)

# Before next scraping session
with open('cookies.pkl', 'rb') as f:
    session.cookies.update(pickle.load(f))
</code></code></pre><p>In case you use a browser automation tool:</p><pre><code><code>from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    
    # Save browser storage state
    context = browser.new_context()
    # ... perform scraping ...
    context.storage_state(path='state.json')
    
    # Reload in next session
    context = browser.new_context(storage_state='state.json')
</code></code></pre><p>As a final note, consider keeping sessions alive for weeks to allow third-party tracking cookies to build up. Long-lived sessions with accumulated tracking data appear more legitimate than constantly refreshed clean states.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Conclusion</strong></h2><p>In this article, you learned that, if you don&#8217;t want your data to be leaked while scraping, you have to take several defensive measures, as no single technique makes you invisible. Anti-bot systems analyze multiple signals simultaneously, and any inconsistency across layers triggers detection and blocks your scrapers.</p><p>Also, detection methods evolve. So, what works today might fail tomorrow. This means you should also monitor the defenses you implemented and test new ones.</p><p>Now, let us know: How do you prevent data leaks in your scrapers? Did we miss some technique?</p>]]></content:encoded></item><item><title><![CDATA[rayobrowse: A Hands-On Look at the Stealth Browser From Rayobyte]]></title><description><![CDATA[Looking for a Camoufox alternative? Here&#8217;s an interesting stealth browser worth checking out!]]></description><link>https://substack.thewebscraping.club/p/rayobrowse-browser-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/rayobrowse-browser-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 05 Apr 2026 03:00:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/442d19ad-ddc9-4b14-afda-71c81a91ffc4_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The open&#8209;source nature of Camoufox is what made the project so popular and appealing. Unfortunately, that same openness is also what allowed anti&#8209;bot giants to study it closely and eventually crack down on it.</p><p>Rayobyte, the proxy and web scraping solutions provider, has taken a different approach. They recently released <em>rayobrowse</em>, a closed&#8209;source yet Docker&#8209;based, self&#8209;hostable stealth browser built for local browser automation and web scraping.</p><p>In this post, I&#8217;ll take a deep look at this solution and walk you through everything you need to know about it. By the end, you&#8217;ll understand what rayobrowse is, how its stealth browser approach works, how to set it up, and whether it&#8217;s actually worth paying attention to.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>An Introduction to rayobrowse</h2><p>Let me introduce you to the world of rayobrowse, helping you understand what it is and what makes this project special.</p><h3>What is rayobrowse?</h3><p><a href="https://github.com/rayobyte-data/rayobrowse">rayobrowse</a> is a self-hosted, Chromium-based stealth browser engineered for web scraping, AI agents, and automation workflows. It&#8217;s available as a Docker image, with optional support via a Python SDK (<em><a href="https://pypi.org/project/rayobrowse">rayobrowse</a></em> on PyPI) for simplified connection. The project is developed and maintained by Rayobyte.</p><p>The stealth browser runs inside Docker and is available via the <a href="https://substack.thewebscraping.club/p/webdriver-vs-cdp-vs-bidi">Chrome DevTools Protocol (CDP)</a>. That means tools like Playwright, Puppeteer, and Selenium (or any other tool that speaks CDP) can natively connect to it for automation purposes.</p><p>What makes it noteworthy is its approach to <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">device fingerprinting</a>. User agents, screen size, WebGL, fonts, timezone, and other signals are tuned so each session looks like a real browser. That way, it helps your automation avoid detection on protected websites.</p><h3>Core Principles Driving the Solution</h3><p>These are the core principles and goals behind the project:</p><ol><li><p>It should run on Linux server environments without GPUs or a GUI/desktop interface.</p></li><li><p>It should patch Chromium at the C++ level, rather than at higher layers like CDP, which are easier for anti-bot systems to detect.</p></li><li><p>It should work with Playwright, a common framework in <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browsing automation stacks</a>.</p></li><li><p>It should support both headful mode (via <a href="https://www.x.org/archive/X11R7.7/doc/man/man1/Xvfb.1.xhtml">Xvfb</a>) and headless mode.</p></li><li><p>It should emulate fingerprints from real-world devices across different regions.</p></li><li><p>It should be self-hostable, so you can run it locally without relying on cloud infrastructure.</p></li><li><p>It should be free to test and use for certain user segments.</p></li><li><p>It should reliably bypass major anti-bot systems and scraping targets, including complex ecommerce and SERP platforms.</p></li></ol><p><strong>Note</strong>: If you&#8217;re not familiar with Xvfb, that&#8217;s an in&#8209;memory display server for Unix-like systems that implements the X11 display protocol without requiring a physical display or input devices. In simpler terms, it allows GUI applications to run in headless environments. rayobrowse relies on it to launch headful browser sessions even on servers without a graphical interface (that&#8217;s beneficial as headful sessions are harder to detect than purely headless ones).</p><h2>Main Features for Stealth Browsing and More</h2><p>Here is a list of the most relevant rayobrowse features:</p><ul><li><p><strong>Fingerprint spoofing</strong>:<strong> </strong>Each browser session comes with a real-world realistic device fingerprint drawn from a database of thousands of profiles. Signals include user agent, OS metadata, screen resolution, fonts, WebGL, hardware concurrency, and timezone.</p></li><li><p><strong>Human&#8209;like mouse movement</strong>: Optional human&#8209;style cursor behavior (inspired by <a href="https://github.com/riflosnake/HumanCursor">HumanCursor</a>) makes automation appear more natural. When using standard Playwright actions like <em>page.click()</em> or <em>page.mouse.move()</em>, the library applies realistic curves and timing.</p></li><li><p><strong>Proxy Integration</strong>: Traffic can be routed through any HTTP proxy, including authenticated and rotating proxies.</p></li><li><p><strong>Headless and headful Support</strong>: rayobrowse supports both execution modes, even on GUI-less Linux servers.</p></li><li><p><strong>Live session viewer</strong>:<strong> </strong>A built&#8209;in noVNC interface (available at http://localhost:6080) lets you watch browser sessions in real time directly from the browser. This is particularly useful for debugging scraping flows and visually verifying fingerprint behavior.</p></li><li><p><strong>Official integrations</strong>:<strong> </strong>The browser integrates with common automation frameworks, namely Playwright, Puppeteer, Selenium, and Scrapy (via <em><a href="https://substack.thewebscraping.club/p/basic-scrapy-configuration">scrapy-playwright</a></em>), as well as emerging <a href="https://substack.thewebscraping.club/p/my-first-week-with-openclaw">AI&#8209;driven tools such as OpenClaw</a>. As of this writing, additional integrations (e.g., Firecrawl and LangChain) are planned.</p></li><li><p><strong>Remote/Cloud mode</strong>: rayobrowse can run as a <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#remote--cloud-mode-beta">remote browser service</a>. Your server requests new browser instances through a REST API, and workers connect directly to the returned CDP WebSocket endpoint. This is still a beta feature.</p></li><li><p><strong>API&#8209;driven browser management</strong>:<strong> </strong>The daemon exposes REST endpoints for creating, listing, and deleting browser sessions, allowing you to orchestrate multiple browsers across a distributed scraping infrastructure.</p></li></ul><h2>Technical Details About the Project</h2><p>Now that you know what the project is and the features it provides, you&#8217;re ready to dive into the technical aspects.</p><h3>How rayobrowse Works</h3><p>At a high level, rayobrowse follows these steps:</p><ol><li><p><strong>Chromium patching</strong>:<strong> </strong>The project tracks upstream Chromium releases and applies a focused set of patches (relying on an <a href="https://github.com/brave/brave-core/blob/master/tools/cr/plaster.py">approach similar to Brave&#8217;s &#8220;plaster&#8221; model</a>). These patches normalize exposed browser APIs, reduce fingerprint entropy leaks, improve automation compatibility, and preserve native Chromium behavior whenever possible.</p></li><li><p><strong>Fingerprint assignment</strong>: When a browser session starts, rayobrowse assigns a realistic device fingerprint.</p></li><li><p><strong>Automation integration</strong>: Browser automation libraries connect to rayobrowse through the native CDP.</p></li></ol><h3>Architecture</h3><p>Architecturally, rayobrowse follows a clean separation between the browser runtime and the automation code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vdVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vdVO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 424w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 848w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png" width="1456" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;rayobrowse&#8217;s architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="rayobrowse&#8217;s architecture" title="rayobrowse&#8217;s architecture" srcset="https://substackcdn.com/image/fetch/$s_!vdVO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 424w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 848w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!vdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde554c2-0d14-41dc-b594-bca040c1b0a4_2123x1016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">rayobrowse&#8217;s architecture</figcaption></figure></div><p>In particular, the system runs as a Docker container that bundles three core components:</p><ol><li><p>A daemon server that manages browser sessions.</p></li><li><p>A browser manager that downloads and retrieves the correct version of Chromium, a fingerprint engine that injects realistic device profiles, and a stealth browser layer containing a custom Chromium build with stealth patches.</p></li><li><p>A <a href="https://github.com/novnc/noVNC">noVNC viewer</a>, which lets you watch browser sessions in real time. This is useful for debugging and demos.</p></li></ol><p>As you can see, the automation scripts don&#8217;t run inside the container. Instead, they run on the host machine and connect to the browser remotely through the Chrome DevTools Protocol.</p><p>When a new session starts, rayobrowse assigns a real-user-looking fingerprint from a large database of actual devices, containing thousands of permutations collected from websites Rayobyte owns.</p><h3>Requirements</h3><p>The rayobrowse project is designed to run on Linux servers without GPUs (which is a common deployment environment).</p><p>These are the required prerequisites:</p><ul><li><p>Docker, as the browser runs entirely inside a container.</p></li><li><p>~2GB of available RAM, as each browser instance uses ~300MB.</p></li></ul><p>The main benefit of this Docker-based approach is that you don&#8217;t need to install Chromium locally, configure fonts, or set up Xvfb manually. All of those dependencies live inside the container, which keeps the host machine clean, portable, and reproducible.</p><p>It also makes the project well-suited for self-hosted environments without exposing its internal Chromium patching logic, making it much harder for anti-bot solution providers to reverse engineer how it works.</p><p>In terms of compatibility, rayobrowse works on Linux, Windows (native or WSL2), and macOS. The supported architectures are <em>x86_64 (amd64)</em> and <em>ARM64</em> (Apple Silicon and AWS Graviton). Still, you don&#8217;t have to worry about the architecture, as Docker automatically pulls the correct image for the host machine.</p><p><strong>Optional</strong>: If you plan to use the stealth browser through the Python SDK, an additional requirement is Python 3.10+.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>How to Access rayobrowse</h2><p>There are two main ways you can access rayobrowse:</p><ol><li><p>The <em>/connect</em> endpoint.</p></li><li><p>The built-in Python SDK.</p></li></ol><h3>Method #1: Use the /connect Endpoint</h3><p>The first rayobrowse usage method involves connecting directly to the <em>/connect</em> endpoint. This allows any CDP&#8209;compatible tool (including Selenium, Playwright, and Puppeteer) to open a browser session simply by pointing to a WebSocket URL like <em>ws://localhost:9222/connect</em>.</p><p>For instance, take a look at the Playwright connection example below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Connect to rayobrowse via CDP
    browser = p.chromium.connect_over_cdp("ws://localhost:9222/connect")
    page = browser.new_context().new_page()

    # Automation logic...

    browser.close()</code></pre></div><p>Keep in mind that the WebSocket browser connection URL can be customized using query parameters, as follows:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">ws://localhost:9222/connect?headless=false&amp;os=android&amp;proxy=http://user:pass@host:port</code></pre></div><p>This URL creates a rayobrowse Chromium browser session in headful mode, using Android-based fingerprints, while routing all requests through the proxy <em><a href="http://user:pass@host:port">http://user:pass@host:port</a></em>.</p><p>Explore all <em>/connect</em> query parameters <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#using-connect-simplest">in the docs</a>.</p><h3>Method #2: Use the Python SDK</h3><p>You can also interact with rayobrowse through the built-in Python SDK. This exposes a <em><a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#api-reference">create_browser()</a></em> function that returns a CDP WebSocket URL for a newly created browser instance. From there, connect using Playwright or another automation framework, as shown below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from rayobrowse import create_browser
from playwright.sync_api import sync_playwright

# Configure the rayobrowse connection to run in headful mode 
# while simulating a Windows-based fingerprint
ws_url = create_browser(headless=False, target_os="windows")

with sync_playwright() as p:
    # Connect to rayobrowse with the configured URL via CDP
    browser = p.chromium.connect_over_cdp(ws_url)
    page = browser.contexts[0].pages[0]
 
    # Automation logic...

    browser.close()</code></pre></div><p>This approach gives you more control over the browser lifecycle, but it also involves more configuration and setup.</p><p>For more examples (e.g., proxy integration, multi-browser management, etc.), <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#using-the-python-sdk">check out the docs</a>.</p><h2>Get Started with rayobrowse: Step-by-Step Guide</h2><p>In this guided section, I&#8217;ll show you how to build a simple Playwright script that connects to rayobrowse.</p><p>For the sake of simplicity, I&#8217;ll assume you already have:</p><ul><li><p>A Unix-based system (Linux, macOS, or Windows via WSL).</p></li><li><p>Docker installed and running on your machine.</p></li><li><p>Git installed locally.</p></li><li><p>A Python environment set up <a href="https://substack.thewebscraping.club/p/scraping-vs-playwright-web-scraping">with Playwright installed</a>.</p></li></ul><p>Follow the instructions below!</p><h3>Step #1: Clone the rayobrowse Repository</h3><p>The first step is to clone the rayobrowse repository to your machine:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">git clone https://github.com/rayobyte-data/rayobrowse</code></pre></div><p>Then, enter the project folder with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">cd rayobrowse</code></pre></div><p>The cloned folder already includes everything you need to get started, including:</p><ul><li><p><em>docker-compose.yml</em>:<strong> </strong>For running the browser container.</p></li><li><p><em>requirements.txt</em>: For installing the Python SDK.</p></li></ul><h3>Step #2: Set Up the Environment</h3><p>rayobrowse requires a .env file that contains the configuration needed to run the browser daemon. For a full list of available environment variables and what they enable, <a href="https://github.com/rayobyte-data/rayobrowse?tab=readme-ov-file#environment-variables">explore the official documentation</a>.</p><p>Start by creating a <em>.env</em> file as a copy of the <em>.env.example</em> file coming with the repository:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">cp .env.example .env</code></pre></div><p>Then open the <em>.env</em> file and make sure it contains:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">STEALTH_BROWSER_ACCEPT_TERMS=true</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zjWr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zjWr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Setting the STEALTH_BROWSER_ACCEPT_TERMS env&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Setting the STEALTH_BROWSER_ACCEPT_TERMS env" title="Setting the STEALTH_BROWSER_ACCEPT_TERMS env" srcset="https://substackcdn.com/image/fetch/$s_!zjWr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!zjWr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61227ac4-7b8f-48ae-9224-04ca00a85c5f_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Setting the STEALTH_BROWSER_ACCEPT_TERMS env</figcaption></figure></div><p>This confirms that you accept the project&#8217;s <a href="https://github.com/rayobyte-data/rayobrowse/blob/main/LICENSE">LICENSE</a>. Without that setting, the daemon will refuse to create browser sessions.</p><h3>Step #3: Start the Docker Container</h3><p>Launch the rayobrowse Docker container:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">docker compose up -d</code></pre></div><p>Docker will automatically pull the appropriate image for your system architecture (<em>x86_64</em> or <em>ARM64</em>). Then, it&#8217;ll start the container, as explained earlier.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FB1x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FB1x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 424w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 848w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1272w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png" width="1456" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The output of the &#8220;docker compose up -d&#8221; command&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The output of the &#8220;docker compose up -d&#8221; command" title="The output of the &#8220;docker compose up -d&#8221; command" srcset="https://substackcdn.com/image/fetch/$s_!FB1x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 424w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 848w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1272w, https://substackcdn.com/image/fetch/$s_!FB1x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c661685-9c52-4469-8241-d86dc7e9a0ef_2282x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The output of the &#8220;docker compose up -d&#8221; command</figcaption></figure></div><h3>Step #4: Connect via CDP and Apply the Automation Logic</h3><p>You can now connect to the running rayobrowse instance through the <em>/connect</em> endpoint using any CDP-compatible client. In this example, I&#8217;ll use Playwright with Python:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Connect to the rayobrowse browser through the CDP WebSocket endpoint
    browser = p.chromium.connect_over_cdp(
        "ws://localhost:9222/connect?headless=false&amp;os=windows"
    )

    # Create a new browser context and page
    page = browser.new_context().new_page()

    # Navigate to the target (sample) page
    page.goto("https://quotes.toscrape.com/")

    # Print the page title to verify the session is working
    print(page.title()) # Output: "Quotes to Scrape"

    # Add your scraping logic here...

    # Close the browser session
    browser.close()</code></pre></div><p>At this point, write your scraping or automation logic, which will run inside the stealth Chromium browser provided by rayobrowse.</p><p>For debugging, you can watch the browser session live through noVNC at <em><a href="http://localhost:6080/vnc.html">http://localhost:6080/vnc.html</a></em>. While the script is running, you should see a headful Chromium session opening and navigating to the target page specified in the script:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v1V8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v1V8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Monitoring the target browser session at http://localhost:6080/vnc.html&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Monitoring the target browser session at http://localhost:6080/vnc.html" title="Monitoring the target browser session at http://localhost:6080/vnc.html" srcset="https://substackcdn.com/image/fetch/$s_!v1V8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 424w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 848w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1272w, https://substackcdn.com/image/fetch/$s_!v1V8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21ecb3c-22ea-4515-b3f7-70630429ecd4_3072x1824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Monitoring the target browser session at http://localhost:6080/vnc.html</figcaption></figure></div><p>As you can tell, the server creates a headful Chromium session (due to the <em>headless=false</em> query parameter) and connects it to the page requested by the script.</p><p><strong>Optional</strong>: If you want more control over the browser lifecycle, install the Python SDK with:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">pip install -r requirements.txt</code></pre></div><p>Take a look at the <a href="https://github.com/rayobyte-data/rayobrowse/tree/main/examples">official examples in the repository</a> for more guidance.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Pricing and Limitations</h3><p>This is how the rayobrowse pricing model works:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bDvq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bDvq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 424w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 848w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202193,&quot;alt&quot;:&quot;rayobrowse&#8217;s pricing model&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/190103610?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="rayobrowse&#8217;s pricing model" title="rayobrowse&#8217;s pricing model" srcset="https://substackcdn.com/image/fetch/$s_!bDvq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 424w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 848w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!bDvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456ac570-01b5-4251-87e5-f0c339deeccc_1536x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">rayobrowse&#8217;s pricing model</figcaption></figure></div><p>What matters most for us, developers, is that you can run rayobrowse for free via self&#8209;hosting. In practice, the only real cost comes from proxies, which are necessary for scaling scraping workloads and avoiding IP bans (something that&#8217;s standard in most production scraping setups).</p><p>The main thing to keep in mind is that rayobrowse is still in beta. Rayobyte already uses it to scrape millions of pages per day, but results can vary depending on the target site and configuration.</p><p>Fingerprint coverage is currently strongest for Windows and Android, while macOS and Linux profiles are less mature. In addition, Canvas and WebGL fingerprinting are still evolving, which means some websites may detect the current implementation.</p><h2>Benchmarks and Final Comment</h2><p>To put rayobrowse to the test, I ran a simple script against a single page for each of the most popular anti&#8209;bot detection systems. These are the results I obtained:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lZAd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lZAd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 424w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 848w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1272w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png" width="1456" height="369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80344,&quot;alt&quot;:&quot;Playright vs rayobrowse: Benchmark comparison table&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/190103610?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Playright vs rayobrowse: Benchmark comparison table" title="Playright vs rayobrowse: Benchmark comparison table" srcset="https://substackcdn.com/image/fetch/$s_!lZAd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 424w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 848w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1272w, https://substackcdn.com/image/fetch/$s_!lZAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4e16bf2-3b9d-447e-8490-2d9ff4b403b5_1920x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Playwright vs rayobrowse: Benchmark comparison table</figcaption></figure></div><p><strong>Note:</strong> These tests were performed on my local machine using my ISP&#8217;s IP address.</p><p>As you can see, in this simple experiment rayobrowse achieved a 100% success rate, while Playwright failed consistently in headless mode and even struggled in some headful scenarios.</p><p>This suggests that the project is definitely worth keeping an eye on, especially thanks to its self&#8209;hosted nature.</p><p><em>To be honest, and this is just my personal opinion as an expert who works in this field, I don&#8217;t usually get very excited about projects like this&#8230;. In my experience, many libraries of this type either get cracked down on or simply don&#8217;t receive the long&#8209;term support they deserve. In this case, however, things are a bit different. The project is closed&#8209;source and backed by a well&#8209;known company in the industry, which makes the expectations for its future understandably much higher!</em></p><p>Here, I covered what the project is about, what it offers, how it works, and how to use it. As always, remember to use rayobrowse only for legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical web scraping</a>. Until next time!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>FAQ</h2><h3>Why is rayobrowse based on Chromium and not Chrome?</h3><p>rayobrowse is based on Chromium simply because Chrome is closed-source. Plus, tests performed on difficult websites show no meaningful difference in detection rates between Chrome and Chromium. Using Chromium also avoids false positives and reflects the broader ecosystem of Chromium-based browsers like Brave, Edge, and Samsung Internet.</p><h3>Is rayobrowse open source?</h3><p>rayobrowse isn&#8217;t open-source to prevent anti-bot companies from reverse-engineering it. Similar projects, like <a href="https://github.com/daijro/camoufox">Camoufox</a>, were quickly studied and countered once their code became public. Rayobyte decided to keep the project closed-source to help maintain its effectiveness and reliability over the long term.</p><h3>Can everyone use rayobrowse?</h3><p>No, not all companies can use rayobrowse. Its license prohibits organizations listed in <a href="https://cdn.sb.rayobyte.com/list-of-prohibited-companies.txt">Rayobyte&#8217;s restricted list</a> from using the software. For everyone else, the project is free to download and run locally.</p><h3>Does rayobrowse support proxy integration?</h3><p>Yes, Rayobrowse fully supports proxy integration. You can route traffic through any HTTP proxy using the <em>proxy </em>query parameter on the <em>/connect</em> endpoint or via the <em>proxy </em>option exposed by the <em>create_browser() </em>function from the Python SDK. The proxy support includes authentication and rotating proxies.</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #101: Building an Internal Knowledge Base for Your Scraping Team]]></title><description><![CDATA[Every scraping team that survives long enough develops the same disease.]]></description><link>https://substack.thewebscraping.club/p/building-knowledge-base-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/building-knowledge-base-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 02 Apr 2026 19:17:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3dba6c6a-f027-4c60-ad27-2c2378c217c6_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every scraping team that survives long enough develops the same disease. Someone figures out how to bypass Cloudflare&#8217;s latest challenge, writes it up in Notion, and moves on. Three months later, a teammate runs into the same problem, spends two days reinventing the solution, and documents it in a Google Doc. Meanwhile, the original Notion page has become outdated because Cloudflare changed its challenge flow, and nobody updated it.</p><p>We have seen this pattern in every scraping operation we have worked with. The knowledge exists. It is just scattered across wikis, Slack threads, internal repos, and people&#8217;s heads. The real problem is not documentation; it is retrieval. People write things down. They just cannot find them when it matters.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>In <a href="https://substack.thewebscraping.club/p/ingest-web-data-rag-llm">THE LAB #77</a>, we explored the concept of RAG (Retrieval-Augmented Generation) applied to scraped data and showed how to build a basic knowledge assistant using FAISS. That was a proof of concept. This time we are going deeper. We are showing the production system we actually built and use daily, and we are explaining the reasoning behind each design choice: why markdown, how embeddings work, which chunking strategy actually performs better, and what role auto-tagging plays in retrieval.</p><p>After reading this article, we hope you will understand the mechanics well enough to build the same system for your team.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>What we are building and why</h2><p>At TWSC, we have published around 300 articles over the past four years. Tutorials, reverse-engineering deep dives, tool comparisons, anti-bot analysis. When we sit down to write a new article, we need to remember what we have already covered, find previous work to link to, and check whether a technique we are about to describe was already explained in a past issue. Doing this by memory or by searching Substack&#8217;s archive stops working after the first hundred articles. </p><p>We also follow what the broader community publishes. Projects like <a href="https://crawl4ai.dev">Crawl4AI</a>, which appeared on Hacker News, show that the need to ingest web content into structured, LLM-ready knowledge bases is shared across the industry. The tools for crawling and extracting content keep getting better, but the retrieval side, finding the right piece of information in a growing archive, still requires a purpose-built system.</p><p>So we built one. Here is what the complete pipeline looks like:<br></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;631f6d4b-586d-4ef5-ba12-640a3cb186b0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Sources                                  Processing              Storage &amp; Retrieval
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;                                &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;              &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
Substack articles                   &#9472;&#9472;&#9488;
                                      &#9500;&#9472;&#9472;&gt; HTML-to-Markdown &#9472;&#9472;&gt; Frontmatter + Tagging &#9472;&#9472;&gt; Markdown files
Hacker News and other sources       &#9472;&#9472;&#9496;

Markdown files &#9472;&#9472;&gt; Chunker &#9472;&#9472;&gt; Embedder (e5-large-v2) &#9472;&#9472;&gt; PostgreSQL + pgvector

Search query &#9472;&#9472;&gt; Query embedding &#9472;&#9472;&gt; Cosine similarity search &#9472;&#9472;&gt; Ranked results</code></pre></div><p>Three stages, each independent and replaceable. You scrape content from your sources. You process and embed it. You search it. </p><p>If your team writes in Confluence instead of Substack, you swap the scraper. If you prefer Qdrant over pgvector, you swap the vector store. The architecture remains the same.<br><br>And here&#8217;s the hardware used for most of the steps, from embedding to the storage and retrieval: my DGX Spark.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yhsw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg" width="566" height="511.689557855127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:961,&quot;width&quot;:1063,&quot;resizeWidth&quot;:566,&quot;bytes&quot;:181504,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/192358785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40edcbd4-e6c6-4172-bf4c-ee62da325b0f_1280x1707.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yhsw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yhsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5842da-4730-47a8-9e4a-283198c42263_1063x961.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yes, I know, probably an overkill.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>The tools</h2><p><strong>Playwright</strong> handles browser-based scraping for our own Substack articles. Substack serves content dynamically and requires authentication for premium posts, so a plain HTTP client is not an option.</p><p><strong>Algolia API</strong> (via Hacker News) provides structured search over HN stories. No scraping needed: HN exposes its full search index through public endpoints.</p><p><strong><a href="https://scrapegraphai.com/">ScrapegraphAI</a> and <a href="https://www.firecrawl.dev/">Firecrawl</a></strong> convert external article URLs into clean markdown. ScrapegraphAI is the primary extractor, Firecrawl is the fallback.</p><p><strong>sentence-transformers</strong> with the <code>intfloat/e5-large-v2</code> model generates 1024-dimensional embeddings. We will explain why we chose this model later in the article.</p><p><strong>PostgreSQL with pgvector</strong> stores embeddings and handles similarity search. We chose it over dedicated vector databases because we already need PostgreSQL for metadata, and pgvector with HNSW indexing handles our scale without adding infrastructure.</p><p><strong>Docker Compose</strong> ties everything together as three containers: PostgreSQL, the API server, and the indexer.</p><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">101.KNOWLEDGE_BASE</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Why markdown as the universal format</h2><p>The first design choice we had to make was what format our knowledge base would store. We had content from Substack (HTML), Hacker News links (various formats), and potentially Confluence, Google Docs, or Slack in the future. We needed a common representation.</p><p>We chose markdown for three reasons.</p><p><strong>First</strong>, markdown preserves document structure without carrying rendering noise. An HTML page contains navigation bars, ad slots, JavaScript, CSS classes, and layout dividers. None of that is content. When you convert to markdown, you keep headings, paragraphs, code blocks, links, and lists. Everything the embedding model needs, nothing it would choke on.</p><p><strong>Second</strong>, markdown is readable by humans and machines alike. When something goes wrong in the pipeline, you can open a markdown file and immediately see what the system is working with. Try doing that with a serialized HTML DOM or a JSON blob from an API response.</p><p><strong>Third</strong>, YAML frontmatter is a natural fit for markdown and gives us a structured metadata header without mixing it into the content. Each file gets an `id`, `type`, `title`, `publish_date`, `topics`, and `visibility` field. This metadata drives filtering at search time and never enters the embedding model. The separation is important: embeddings capture meaning, frontmatter captures facts.</p><p>There are two paths to get content into markdown. You can build your own converter using open-source libraries, or you can use commercial services that handle extraction and conversion for you. In this article we show both approaches deliberately. For our own Substack articles, we built a converter from scratch with BeautifulSoup and markdownify. It costs nothing, we control every detail, and it works because we know the source HTML structure intimately. For external content discovered on Hacker News, we use commercial services like ScrapegraphAI and Firecrawl instead, because every URL leads to a different site with a different HTML structure. Building custom converters for thousands of unknown domains would be impractical. The trade-off is clear: when you control the source, build your own; when you are scraping the open web, commercial extraction services save an enormous amount of development time.</p><p>Our Substack HTML-to-markdown converter is deliberately simple. It strips scripts, styles, buttons, navigation, and footers, then converts the remaining HTML:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;aa028f4f-e1d2-412f-88bc-29153974e70e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def html_to_markdown(html: str) -&gt; str:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.find_all(["script", "style", "button", "form", "nav", "footer"]):
        tag.decompose()

    md = markdownify(
        str(soup),
        heading_style="ATX",
        bullets="-",
        strip=["script", "style", "button", "form", "nav"],
    )
    md = re.sub(r"\n{4,}", "\n\n\n", md)
    return md.strip()</code></pre></div><p>The final output for each document looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;95188632-dd66-4b1e-a5fe-167c1807dcdc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">---
id: a1b2c3d4e5f6...
type: twsc_article
title: "THE LAB #94: Using Cookies and Session Persistence"
slug: the-lab-94-using-cookies-and-session
canonical_url: https://substack.thewebscraping.club/p/the-lab-94-using-cookies-and-session
publish_date: 2025-11-15
visibility: premium
topics:
  - browser-automation
  - cloudflare
  - scraping-infra
---

[article body in markdown]</code></pre></div><h2>Scraping your own content</h2><p>The first source we built was a scraper for our own Substack articles. The pattern applies to any CMS: discover URLs, authenticate if needed, extract content, convert to markdown with frontmatter.</p><h3>URL discovery and authentication</h3><p>Most publishing platforms expose a sitemap. We fetch it, filter for article URLs (Substack uses <code>/p/</code> in the path), and track the <code>lastmod</code> date to detect changes:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;cf6dd5c0-8f99-4466-bc88-5bfe8f8b109a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def fetch_sitemap(sitemap_url: str) -&gt; list[dict]:
    req = Request(sitemap_url)
    req.add_header("User-Agent", "Mozilla/5.0 ...")
    with urlopen(req) as response:
        content = response.read()

    root = ET.fromstring(content)
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    articles = []
    for url_elem in root.findall("sm:url", ns):
        loc = url_elem.find("sm:loc", ns)
        lastmod = url_elem.find("sm:lastmod", ns)
        if loc is not None and "/p/" in loc.text:
            articles.append({"url": loc.text.strip(), "lastmod": lastmod.text or ""})
    return articles</code></pre></div><p>Substack gates premium content behind authentication. We handle this with a persistent Playwright browser context that stores cookies across runs. On the first run you log in manually; after that, the saved session keeps you authenticated. For cron jobs, we verify the session by loading a known premium article and checking if the full content appears.</p><p>We try multiple CSS selectors for extraction because Substack has changed its DOM structure over time. The extracted HTML goes through the markdown converter we showed earlier.</p><h2>Ingesting external sources: Hacker News</h2>
      <p>
          <a href="https://substack.thewebscraping.club/p/building-knowledge-base-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Data Scraping for Market Research: A Developers Guide]]></title><description><![CDATA[Build scrapers that deliver real market intelligence, not just raw data dumps]]></description><link>https://substack.thewebscraping.club/p/data-scraping-market-research</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/data-scraping-market-research</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 29 Mar 2026 20:38:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e95388da-deb3-4a33-9e90-438b2658fddd_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Market research has always been about answering a simple question: &#8220;<em>What&#8217;s happening in the market, and how do I use that to make better decisions?&#8221;</em></p><p>The traditional way to answer that question involved surveys, focus groups, and expensive reports from firms that charge you a fortune for data that&#8217;s already a few months old by the time you read it. Today, the data you need is sitting on public web pages: You just need to collect it.</p><p>In this article, we&#8217;ll discuss how to scrape data for market research, what sources actually matter, how to build a pipeline that doesn&#8217;t fall apart after a week, and where the legal lines are.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What &#8220;Market Research&#8221; Actually Means Web Scraping Professionals</h2><p>Market research needs to answer three questions:</p><ul><li><p>&#8220;<em>What are our competitors doing?</em>&#8221;</p></li><li><p>&#8220;<em>What are our customers saying?</em>&#8221;</p></li><li><p>&#8220;<em>How is the market moving?</em>&#8221;</p></li></ul><p>That&#8217;s it. Everything else is a variation of those three. And if you think about it, the web gives you access to all three, if you know where to look.</p><p>In practice, scraped market intelligence sits on three pillars:</p><ul><li><p><strong>Competitive data</strong>: Pricing, product catalogs, feature changes, hiring signals. This is the &#8220;what are they doing?&#8221; pillar.</p></li><li><p><strong>Customer sentiment</strong>: Reviews, forum discussions, social media posts. This is the &#8220;what are people saying?&#8221; pillar.</p></li><li><p><strong>Market signals</strong>: Job postings, regulatory filings, trend volumes, new product launches. This is the &#8220;where is the market going?&#8221; pillar.</p></li></ul><p>Now, why scraping instead of traditional research? Because scraping is real-time, it&#8217;s continuous, and it doesn&#8217;t depend on people filling out forms. A survey tells you what 500 people said last month. A scraper tells you what thousands of customers are saying right now, every single day, without anyone having to opt in.</p><p>That&#8217;s the competitive advantage. And it&#8217;s a big one.</p><div><hr></div><blockquote><p><em>For your scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>Where to Scrape: Sources That Actually Matter</h2><p>Not all sources are worth your time. You could scrape the entire Internet and still end up with nothing useful if you&#8217;re not targeting the right places. Below is a list of high-value targets for market research and what you can extract from each:</p><ul><li><p><strong>Competitor websites</strong>: Pricing pages, product pages, feature matrices, changelog, and blog posts. This is your primary source for understanding what competitors are offering and how they position themselves. Pricing pages, in particular, are gold. They change more often than you&#8217;d think, and tracking those changes over time tells you a lot about a competitor&#8217;s strategy.</p></li><li><p><strong>Review platforms</strong> <strong>(G2, Trustpilot, Amazon, Yelp)</strong>: Customer pain points, feature requests, sentiment shifts. Reviews are unfiltered customer feedback. Nobody writes a G2 review because they were asked nicely in a survey. They write it because they feel strongly about something&#8212;and that&#8217;s exactly the kind of signal you want.</p></li><li><p><strong>Job boards</strong> <strong>(LinkedIn, Indeed)</strong>: Hiring patterns reveal where a company is investing. If a competitor suddenly posts 20 machine learning engineer roles, that tells you something no press release will. Job postings are one of the most underrated market research signals out there.</p></li><li><p><strong>Social media and forums (Reddit, X, niche communities)</strong>: Unfiltered opinions, emerging trends, early complaints about products. Reddit threads and niche forums are where people say what they actually think, not what they&#8217;d say in a focus group.</p></li><li><p><strong>Government and public data portals</strong>: SEC filings, patent databases, import/export records. These are slower-moving signals, but they&#8217;re authoritative. A patent filing can tell you what a competitor is building 18 months before it ships.</p></li></ul><p>Here&#8217;s the key question to ask yourself before adding a source to your scraper: <em>&#8220;Does this data answer a specific research question, or am I just hoarding?&#8221;</em>. If you can&#8217;t tie a source to a concrete insight, skip it. You&#8217;ll save yourself storage costs, maintenance headaches, and potential legal issues.</p><h2>Building the Pipeline: From Raw HTML to Market Intelligence</h2><p>A market research scraper is not a one-off script you run from your terminal. It&#8217;s a pipeline. And pipelines need structure. If you treat it like a quick script, you&#8217;ll end up with a mess of cron jobs, inconsistent data formats, and no idea whether your data is fresh or stale. So, build it properly from the start.</p><p>A scraping for market intelligence pipeline should have four stages:</p><ol><li><p><strong>Collection</strong>: Fetch the pages, extract the fields you need, throw the rest away. Don&#8217;t store raw HTML &#8220;just in case&#8221; (you&#8217;ll learn why in the legal section of this article).</p></li><li><p><strong>Storage</strong>: Store facts and metadata (source URL, timestamp, extracted fields). Use a structure that makes deduplication and versioning easy. In practice, this means designing your schema around a composite key (for example: <em>source </em>+ <em>entity ID</em> + <em>scraped timestamp</em>) so you can track how a data point changes over time without overwriting previous records.</p></li><li><p><strong>Transformation</strong>: Normalize the data across sources, deduplicate records, and enrich with additional context (geocoding, industry classification, entity linking).</p></li><li><p><strong>Analysis</strong>: Turn rows into insights. This is where the actual market research happens. And to be clear: &#8220;Analysis&#8221; doesn&#8217;t mean opening a CSV and scrolling through it. The goal is to turn your pipeline&#8217;s output into dashboards, scheduled reports, or Slack alerts that reach the people who make decisions. If the data sits in a database and nobody looks at it, the whole pipeline is wasted effort.</p></li></ol><h3>Scheduling Matters More Than You Think</h3><p>Different data types have different freshness requirements. Getting this wrong means either wasting resources or working with stale data. The main ideas to consider when engineering the triggering times are the following:</p><ul><li><p><strong>Price tracking</strong>: Daily or hourly, depending on the market. Consider that e-commerce prices can change multiple times a day. SaaS pricing pages, instead, change less often. But when they do, it&#8217;s significant.</p></li><li><p><strong>Review monitoring</strong>: Monitoring reviews daily is usually enough. Reviews don&#8217;t appear in real-time, and sentiment trends are measured in weeks, not minutes.</p></li><li><p><strong>Job postings</strong>: A weekly schedule works for trend analysis of the job market. Remember that you&#8217;re looking for patterns, not individual listings.</p></li><li><p><strong>Social media</strong>: This depends on your use case. If you&#8217;re tracking a product launch or a PR crisis, you might need near-real-time. For general trend analysis, daily or even weekly batches work fine.</p></li></ul><h3>Tools That Work Well for Market Research Scraping</h3><p>You don&#8217;t need to reinvent the wheel. The software industry already provides you with the best tools for your market research scraping pipeline. Here&#8217;s a solid stack for a market research pipeline:</p><ul><li><p><strong><a href="https://www.scrapy.org/">Scrapy</a></strong> for structured crawling. <a href="https://substack.thewebscraping.club/p/scrapy-ten-years-of-scraping-framework">Scrapy&#8217;s architecture is designed for exactly this kind of work</a>: You define spiders per source, plug in middleware for proxy rotation and retry logic, and use item pipelines to clean and store data as it flows through. For market research specifically, Scrapy&#8217;s built-in feed exports let you dump results straight to JSON, CSV, or even S3 without writing custom I/O code. And if you need to coordinate multiple spiders (say, one per competitor), Scrapy&#8217;s project structure keeps things organized as your source list grows.</p></li><li><p><strong><a href="https://playwright.dev/">Playwright</a></strong> or <strong><a href="https://pptr.dev/">Puppeteer</a></strong> for JS-heavy pages. The key difference from Scrapy is that <a href="https://substack.thewebscraping.club/p/handling-infinite-scrolling-python-js">you&#8217;re running a real browser, which means you can handle dynamic content, infinite scroll</a>, and client-side rendering. The trade-off is resource cost: Each browser instance eats memory and CPU, so you don&#8217;t want to use this for targets that serve static HTML.</p></li><li><p><strong>A</strong> <strong>task queue</strong> for scheduling and orchestration. This is what turns a collection of scrapers into an actual pipeline. Instead of running scripts manually or relying on cron jobs, a task queue lets you schedule scrapes per source at different intervals, retry failed jobs automatically, and <a href="https://substack.thewebscraping.club/p/python-async-for-faster-scraping">control concurrency so you&#8217;re not overwhelming a target site with parallel requests.</a> It also gives you visibility: you can see what&#8217;s queued, what&#8217;s running, what failed, and why.</p></li><li><p><strong><a href="https://www.postgresql.org/">PostgreSQL</a></strong> for structured market data that needs querying and versioning. Relational databases shine here because market research data is inherently relational: competitors have products, products have prices, prices change over time.</p></li></ul><p>The point is this: Pick tools that let you build a maintainable system, not just a working script. Every tool in this stack solves a specific problem, and none of them requires you to build infrastructure from scratch. The best market research pipeline is the one that&#8217;s boring to operate, because boring means reliable.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Scaling Without Getting Blocked</h2><p>If you&#8217;re scraping one competitor once a week, you don&#8217;t need this section. If you&#8217;re tracking 50 competitors daily across thousands of pages, you do.</p><p>Here&#8217;s the reality: The moment you start scraping at scale, you become visible. But sites don&#8217;t like bots, even polite ones. So you need to be smart about how you scale. Consider the following rules of thumb to avoid getting blocked:</p><ul><li><p><strong>Proxy rotation</strong>: <a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies">Residential proxies for sensitive targets (sites with aggressive anti-bot systems), datacenter proxies for everything else</a>. Rotate per request or per session, depending on the site&#8217;s detection mechanisms. The key is to not send thousands of requests from the same IP in an hour.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/rate-limit-scraping-exponential-backoff">Rate limiting and backoff</a></strong>: Be a good citizen. If you hammer a site with concurrent requests, you&#8217;ll get blocked, and you&#8217;ll deserve it. Implement exponential backoff on failures, and set reasonable delays between requests. A 2-3 second delay between requests is a good starting point for most sites.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">Fingerprint management</a></strong>: Headers, TLS fingerprint, and browser-level signals matter on sites with serious anti-bot systems. Make sure your request headers look consistent and realistic.</p></li><li><p><strong>CAPTCHAs</strong>: <a href="https://substack.thewebscraping.club/p/are-captchas-still-a-thing">If you&#8217;re hitting CAPTCHAs regularly, your approach is too aggressive</a>. Fix the root cause (rate, fingerprint, proxy quality) before reaching for solver services. CAPTCHA solvers are a band-aid, not a solution.</p></li></ul><p>The general principle is simple: Scrape at a pace that doesn&#8217;t degrade the target site&#8217;s performance.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Turning Scraped Data into Actual Market Insights</h2><p>Let&#8217;s be clear about something: Raw scraped data is not market research. It&#8217;s just data. A CSV with 50&#8217;000 rows of competitor prices is not an insight. A chart showing that competitor X has dropped their enterprise tier price by 15% over three months: That&#8217;s an insight.</p><p>Here&#8217;s where the value gets created:</p><ul><li><p><strong>Price tracking and competitive benchmarking</strong>: Track changes over time, visualize trends, and set alerts for significant moves. The goal is not to know what a competitor charges today. It&#8217;s to understand their pricing trajectory. Are they moving upmarket? Are they running more frequent discounts? Are they simplifying their tier structure? This is where predictive <a href="https://substack.thewebscraping.club/p/predictive-analytics-web-scraped-data">analytics meets scraped data with the goal of predicting future moves</a> from your competitors.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/sentiment-analysis-amazon-reviews">Sentiment analysis on reviews</a></strong><a href="https://substack.thewebscraping.club/p/sentiment-analysis-amazon-reviews">: Use NLP to extract themes from customer reviews</a>. This is powerful for product teams who want to understand what customers love and hate about competitors. But remember: You&#8217;re analyzing the data internally, not republishing the reviews.</p></li><li><p><strong>Hiring signal analysis</strong>: Aggregate job postings by role type, department, and location. A competitor suddenly posting 15 ML engineer roles tells you they&#8217;re investing in AI. A wave of sales hiring in EMEA tells you they&#8217;re expanding geographically. This is a signal that&#8217;s almost impossible to get from any other source.</p></li><li><p><strong>Trend detection</strong>: Time-series analysis on product launches, feature changes, pricing moves, or social media mentions. <a href="https://substack.thewebscraping.club/p/scraping-data-anomaly-detection">The goal is to spot patterns or anomalies</a> before they become obvious. If three competitors all add the same feature within two months, that&#8217;s a market trend, not a coincidence.</p></li></ul><p>Overall, the <a href="https://substack.thewebscraping.club/p/building-a-scraper-dashboard-streamlit">output of your scraping pipeline should be dashboards</a>, reports, or automated alerts, not a database dump that someone has to manually dig through. If the insights don&#8217;t reach decision-makers in a usable format, the whole pipeline is wasted effort.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Legal and Ethical Considerations: Don&#8217;t Skip This Section</h2><p>I know, I know. You&#8217;re a developer, not a lawyer. But here&#8217;s a thing I&#8217;m sure you know: Most legal problems in scraping are self-inflicted. They happen because someone scraped &#8220;everything on the page,&#8221; stored it &#8220;for later,&#8221; and only then asked: <em>&#8220;Wait, can we actually use this?&#8221;</em></p><p>As discussed in detail in &#8220;<a href="https://substack.thewebscraping.club/p/avoid-copyright-violations-scraping">How to Avoid Copyright Violations While Scraping</a>&#8221;, let&#8217;s go through the key legal and <a href="https://substack.thewebscraping.club/p/best-practices-for-ethical-web-scraping">ethical principles of web scraping</a> shortly:</p><ul><li><p><strong>Scrape facts, not expression</strong>: Copyright protects expression, not facts. Prices, SKUs, dates, availability, and job titles are facts. No one owns the fact that a SaaS product costs $49/month. On the other hand, product descriptions, review text, and blog posts are creative expressions.</p></li><li><p><strong>Don&#8217;t store raw pages by default</strong>: Storing the HTML of entire pages means creating copies of copyrighted content. Instead, parse in-memory, extract only the fields you need, and discard the rest. If you need to debug, store a small sample with short retention.</p></li><li><p><strong>Respect </strong><em><strong>robots.txt</strong></em>: <a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">The </a><em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications">robots.txt</a></em><a href="https://substack.thewebscraping.club/p/understanding-robotstxt-and-its-implications"> file is not the law, but ignoring it is evidence of bad faith if things go sideways</a>. In disputes, it can be used to show that you knew you were unwelcome and kept going anyway.</p></li><li><p><strong>Terms of Service matter</strong>: If the ToS explicitly forbids scraping and you scrape anyway, you may have a breach-of-contract problem. This is often easier for the site owner to prove than copyright infringement, because the argument is straightforward: you agreed to a contract, then you violated it.</p></li><li><p><strong>Don&#8217;t scrape behind a login</strong>: Once you log in, you&#8217;ve affirmatively agreed to a contract. Breaking that contract to scrape is a fast track to legal trouble. If your plan requires authenticated access, treat it as a licensing problem, not an engineering challenge.</p></li><li><p><strong>GDPR/CCPA</strong>: If you&#8217;re scraping anything that could be personal data (usernames, reviewer names, profile information), you need to know which privacy laws apply. This is especially relevant for review scraping and social media monitoring.</p></li></ul><p>Here&#8217;s the mental model that works: A price comparison tool that shows prices and links back to the source? Generally safe. A product catalog that copies descriptions, images, and reviews so users never need to visit the original site? That&#8217;s where you get into trouble, even if you don&#8217;t publicly display the results because you use them for internal analysis.</p><h2>Keeping Your Scrapers Alive: Monitoring and Maintenance</h2><p>Scrapers in production break for several reasons. Sites change layouts, add anti-bot measures, restructure their URLs, or just go down for maintenance. If you don&#8217;t monitor your scrapers, your data goes stale silently, and you won&#8217;t know until someone asks why the pricing dashboard hasn&#8217;t updated in three weeks.</p><p>Here&#8217;s a breakdown of what you need:</p><ul><li><p><strong>Dead selector detection</strong>: Alert when a CSS selector or XPath returns empty across multiple consecutive runs. A selector that worked yesterday and returns nothing today means the site changed its HTML structure. The keyword here is &#8220;multiple consecutive runs&#8221;. A single empty result could be a transient issue, so consider not triggering alerts on the first failure. Instead, set a threshold, like three consecutive empty results, before flagging it. When it does fire, you need to inspect the current page structure and update your selectors. Alternatively, try to <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">go beyond the DOM using AI and LLMs</a>, to make your extraction more resilient to layout changes in the first place.</p></li><li><p><strong>HTTP status monitoring</strong>: A spike in 403s means you&#8217;re getting blocked. A spike in 429s means you&#8217;re hitting rate limits. A spike in 404s means URLs have changed. Each of these requires a different response. For 403s, check your proxy pool and rotation logic: You might need fresher IPs or a lower request rate. For 429s, back off and increase your delays between requests; the site is telling you exactly what the problem is. For 404s, the target has likely restructured its URL patterns, which means you need to update your URL generation logic, not just retry the same broken links. Log these status codes per source and per run so you can spot trends early. A gradual increase in 403s over a week is a warning sign that your current setup is losing effectiveness, even if individual runs still return some data.</p></li><li><p><strong><a href="https://substack.thewebscraping.club/p/ensuring-data-quality-in-web-scraping">Data quality checks</a></strong>: Row counts, null rates, value distributions. If your price tracker suddenly shows all prices as $0 or your review scraper returns empty text fields, you want to know immediately. Build quality checks into your pipeline as a post-scrape validation step, not as something you run manually. Compare each run&#8217;s output against baseline expectations: If you normally get 200 rows from a source and today you got 12, something is wrong, even if those 12 rows look fine individually.</p></li><li><p><strong>Automated tests against fixture HTML</strong>: Save sample HTML pages from your targets and write tests against them. When a test fails, you know the site has changed before your production scraper breaks. Treat your scrapers like production code, because they are. In practice, this means saving a snapshot of a relevant section in the target page as a local HTML file. Then, write unit tests that run your extraction logic against that fixture and assert expected outputs. Store these fixtures in version control alongside your scraper code. When a site changes and your production scraper breaks, update the fixture with the new HTML. This gives you a repeatable workflow for handling site changes instead of scrambling every time something breaks.</p></li></ul><p>The goal is simple: You should know when something breaks before your stakeholders do. A Slack alert that says &#8220;Competitor X pricing scraper returned 0 results&#8221; is infinitely better than a product manager asking why the dashboard is empty.</p><h2>Conclusion</h2><p>In this article, you learned that market research scraping is about building a reliable pipeline that collects the right facts, transforms them into insights, and doesn&#8217;t get you in legal trouble.</p><p>The competitive advantage of scraping for market research is in what you do with the data. Anyone can code a scraper. But building a system that delivers reliable, actionable market intelligence week after week? That&#8217;s where the real value is!</p><p>So, let us know: Are you using web scraping for market research? What sources have you found most valuable? How did you structure your scraping pipeline? Let&#8217;s discuss in the comments!</p>]]></content:encoded></item><item><title><![CDATA[Two stealth browsers just dropped. Also, your proxy provider might be overcharging you.]]></title><description><![CDATA[Use the new TWSC tools to discover proxy prices and news in the web scraping industry]]></description><link>https://substack.thewebscraping.club/p/two-stealth-browsers-proxy-prices</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/two-stealth-browsers-proxy-prices</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Wed, 25 Mar 2026 15:39:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8b292fc2-af4f-4f3e-8e94-0204b7fd08bb_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few things landed on my desk this week that I did not want to wait until the next issue to share. So here is a quick bonus edition: an update to a tool I have been building, and two projects from the scraping world that caught my attention.</p><div><hr></div><h2><strong>The Proxy Price Benchmark is now updated weekly</strong></h2><p><br>If you haven't checked it yet, the <a href="https://proxyprice.thewebscraping.club/">Proxy Price Benchmark</a> is the tool I built to answer a simple but important question: how much should you actually be paying for your proxies?<br><br>Every week, I (or better, my fleet of agents) update the pricing data directly from the vendors, so you always have a reliable reference to compare offers or negotiate with your current provider.<br><br>This week, we added two new vendors: <strong>Dataimpulse</strong> and <strong>AnyIP</strong>, bringing the total number of monitored providers to 27.<br><br><a href="https://proxyprice.thewebscraping.club/">Check the latest prices</a><br><br>If you use proxies at scale and would find API access to this data useful, I am considering a paid API plan. If you are interested, join the waitlist and tell me about your use case. I want to understand demand before I build it.<br></p><div><hr></div><h2><br><strong>This week on Scraping News: stealth browsers are getting serious</strong></h2><p><br>The <a href="https://news.thewebscraping.club/">Scraping News feed</a> has been tracking an interesting trend this week: two new stealth browser projects worth watching.<br><br><strong><a href="https://owlbrowser.net/">Owl Browser</a></strong> is a purpose-built browser engine for automation at scale. Not a Playwright wrapper but a full engine built on Chromium (CEF) with a custom C99 HTTP server, 256 parallel contexts, and sub-12ms cold start. Self-hosted, Docker-ready, with Python and TypeScript SDKs. If you are running high-volume scraping and hitting the limits of standard headless setups, this is worth a closer look.<br><br><strong><a href="https://github.com/rayobyte-data/rayobrowse">Rayobrowse</a></strong> is Rayobyte's open-source stealth Chromium browser, released from their production scraping infrastructure. It handles fingerprint randomization at the browser level (user agent, WebGL, fonts, screen resolution, timezone) and connects via CDP, so it works with Playwright, Puppeteer, Selenium, or any custom script. Runs on headless Linux with no GPU required.<br><br>Both address the same problem from different angles: standard headless Chromium is detected, and the solution is now moving from patch-level evasion to full browser-level stealth. We will be covering both in depth on TWSC soon.<br><br><a href="https://news.thewebscraping.club/">See all the latest news on Scraping News</a><br></p><div><hr></div><p>Keep in mind that both the Proxy Price Benchmark tools and Scraping News are in an early version; feel free to suggest improvements and bug fixes.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[WebSocket Bot Detection Techniques and How to Bypass Them]]></title><description><![CDATA[You may already know generic anti-bot techniques, but what about WebSocket-specific ones? Let&#8217;s find out!]]></description><link>https://substack.thewebscraping.club/p/websocket-bot-detection-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/websocket-bot-detection-scraping</guid><dc:creator><![CDATA[Antonello Zanini]]></dc:creator><pubDate>Sun, 22 Mar 2026 09:30:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/48bfc637-7402-4dd9-b7ab-d007f6fa773d_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Websites and web applications are becoming more complex than ever, with live data powering features that deliver fast insights. If you&#8217;re wondering which technology makes those live updates possible, the answer is WebSockets.</p><p>You might think that, in a web scraping scenario, the solution is simply to connect directly to the WebSocket channels. Sure, that&#8217;s possible, but there are a few obstacles along the way. The main ones are WebSocket anti-bot techniques and bot detection measures.</p><p>In this post, I&#8217;ll walk through the most common ones, explain how they work, and share proven tips and tricks to help you avoid them.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KA5B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KA5B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png" width="560" height="315.38461538461536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:560,&quot;bytes&quot;:1650775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656767?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KA5B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!KA5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667617e8-0f8b-4603-82b2-8eb965078c98_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><h2>A Quick Intro to WebSockets</h2><p>Before diving into WebSocket bot detection, let me first provide some context about WebSocket as a protocol and its role in web scraping.</p><h3>What Is the WebSocket Protocol?</h3><p><a href="https://websocket.org/guides/websocket-protocol/">WebSocket</a>, also abbreviated as <em>WS</em> for short, is a web protocol standardized in <a href="https://datatracker.ietf.org/doc/html/rfc6455">RFC 6455 </a>that enables full-duplex, bidirectional communication between clients and servers over a single, persistent TCP connection.</p><p>Unlike HTTP, which is stateless and request-driven, WebSockets establish a long-lived connection through an initial HTTP handshake. After the handshake, both client and server can send messages independently, with data transmitted in frames that can be text, binary, or control frames (ping, pong, close).</p><p>WebSockets support fragmentation, masking, and optional compression via extensions like per-message-deflate, while newer HTTP/2 and <a href="https://substack.thewebscraping.club/p/faster-web-scraping-with-http3">HTTP/3 mechanisms</a> allow multiplexing, reduced latency, and better proxy traversal.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WRwD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WRwD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 424w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 848w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1272w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;HTTP vs WebSocket&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="HTTP vs WebSocket" title="HTTP vs WebSocket" srcset="https://substackcdn.com/image/fetch/$s_!WRwD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 424w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 848w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1272w, https://substackcdn.com/image/fetch/$s_!WRwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4af58e47-e333-4b99-8f69-92d98d0222db_1490x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">HTTP vs WebSocket</figcaption></figure></div><div><hr></div><blockquote><p><em><br>For your ethical scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><blockquote><div><hr></div></blockquote><h3>Why and When Web Pages Use WebSockets</h3><p>The WebSocket protocol opens the door to live, bidirectional web communication. Unlike HTTP&#8217;s request-response model, it lets servers and clients exchange data continuously over a single, persistent connection.</p><p>In general, WebSockets are essential for any application where low latency and frequent updates are required. Common use cases include:</p><ul><li><p><strong>Live streaming</strong>: YouTube Live, TikTok LIVE, Kick, Twitch, and similar platforms.</p></li><li><p><strong>Chat applications</strong>: Slack, Discord, and other messaging services.</p></li><li><p><strong>Collaboration tools</strong>: Google Docs, Figma, and online whiteboards.</p></li><li><p><strong>Gaming and multiplayer experiences</strong>: Browser-based MMO games, turn-based games, and PvP games.</p></li><li><p><strong>Financial data feeds</strong>: Stock tickers, cryptocurrency price updates, and trading dashboards.</p></li><li><p><strong>IoT and telemetry</strong>: Sensor updates, home automation, and device monitoring.</p></li><li><p><strong>Notifications and alerts</strong>: Push updates for social networks, dashboards, or monitoring systems.</p></li></ul><p>In short, WebSocket comes into play wherever instant, continuous communication is necessary (and standard HTTP polling would be too slow or resource-intensive).</p><h3>Main Challenges of Scraping Data from WebSockets</h3><p>Connecting to a WebSocket server for collecting data isn&#8217;t as straightforward as <a href="https://substack.thewebscraping.club/p/apis-in-web-scraping">spoofing API requests for web scraping</a>. In particular, the main challenges of scraping data straight from WebSockets include:</p><ul><li><p><strong>Finding the right client implementation</strong>: You must use a WebSocket client (and there are way fewer than HTTP clients&#8230;) that supports the correct protocol version and any negotiated extensions, such as compression or subprotocols.</p></li><li><p><strong>Limited documentation and examples</strong>: WebSocket scraping is less common than API scraping, so there are fewer guides, tools, and community resources available.</p></li><li><p><strong>Proxy integration complexity</strong>: Not all clients support proxy integrations, making IP rotation a challenge.</p></li><li><p><strong>No request&#8211;response model</strong>: You can&#8217;t simply send a request and receive a response, as with API scraping. Instead, you must send the right messages and then listen to a continuous stream of events.</p></li><li><p><strong>Real-time data handling</strong>: You require a system to collect, process, and store messages in real time, often dealing with high-frequency updates.</p></li></ul><h2>Main WebSocket Anti-Bot Techniques and Solutions</h2><p>Now you&#8217;re ready to discover the most important WebSocket-specific bot detection techniques, along with practical tips to avoid and bypass them. The idea here is to target a WebSocket server from an automated script, relying on a WS client in Python, Node.js, or another programming language of your choice.</p><h3>WebSocket Handshake Issues</h3><p>The <a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers#client_handshake_request">WebSocket handshake</a> is a transition phase in which an HTTP connection is upgraded to a persistent WebSocket connection. During this step, both the client and the server negotiate the connection parameters, and either side can abort the process if the conditions aren&#8217;t acceptable.</p><p>Because the handshake is where the protocol upgrade happens, it&#8217;s also a pivotal security and bot-detection point. The server must carefully validate everything the client requests. Otherwise, protocol misuse or security issues may occur.</p><p>In detail, during the handshake, a WebSocket client must send a valid HTTP/1.1 GET request with specific headers, for example:</p><pre><code>GET /live-data HTTP/1.1
Host: example.com:9000
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: JKjeFfYU8mti9re0prPQrw==
Sec-WebSocket-Protocol: chat, superchat
Sec-WebSocket-Version: 13</code></pre><p>In practice, browsers also include additional headers such as <em>Origin</em>, <em>User-Agent</em>, <em>Referer, Cookie</em>, as well as authentication headers (e.g., <em>Authorization</em>). While these HTTP headers aren&#8217;t strictly required by the WebSocket specification, they are extremely valuable for <a href="https://substack.thewebscraping.club/p/browser-fingerprinting-test-online">fingerprinting and bot detection</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6chN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6chN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 424w, https://substackcdn.com/image/fetch/$s_!6chN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 848w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png" width="1456" height="1239" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1239,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note all extra HTTP headers&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note all extra HTTP headers" title="Note all extra HTTP headers" srcset="https://substackcdn.com/image/fetch/$s_!6chN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 424w, https://substackcdn.com/image/fetch/$s_!6chN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 848w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!6chN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34640876-45bf-4d04-b896-eea6f78c6a15_1744x1484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note all extra HTTP headers</figcaption></figure></div><p>Now, the server should respond with <em>400 Bad Request </em>and immediately close the connection if it encounters:</p><ul><li><p>An unknown or malformed header.</p></li><li><p>An invalid <em><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Sec-WebSocket-Key">Sec-WebSocket-Key</a></em>.</p></li><li><p>An unsupported WebSocket version.</p></li></ul><p>Instead, if the WebSocket version is unsupported, the server should return a <em>Sec-WebSocket-Version</em> header listing the versions it supports (most modern servers only accept <a href="https://datatracker.ietf.org/doc/html/rfc6455#section-1.2">version </a><em><a href="https://datatracker.ietf.org/doc/html/rfc6455#section-1.2">13</a></em>).</p><p>In practice, repeated handshake failures or non-browser-like handshake patterns are often treated as a bot indicator. Those may result in blocking, particularly after repeated handshake attempts from the same IP or when fingerprinting enables identification even across IP changes.</p><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Always send a valid </strong><em><strong>Origin</strong></em><strong> header</strong>: All major browsers include it, and many servers automatically reject WebSocket requests without one.</p></li><li><p><strong>Replicate real browser handshakes as closely as possible</strong>: Inspect the WebSocket request made by a real browser and match all headers (e.g., <em>User-Agent </em>and similar extra headers).</p></li><li><p><strong>Avoid excessive handshake attempts from the same machine</strong>: Too many connection attempts in a short time window are a common bot signal.</p></li><li><p><strong>Use IP rotation carefully</strong>: Rotation can help avoid rate-based blocks, but it doesn&#8217;t protect against fingerprint-based detection if the handshake remains identical.</p></li></ul><h3>Honeypot WebSocket Events and Channels</h3><p>If you&#8217;re familiar with <a href="https://substack.thewebscraping.club/p/scraping-high-frequency-python">common anti-bot techniques</a>, you&#8217;ve probably heard of honeypots. A honeypot is a decoy mechanism designed to attract bots by exposing fake or hidden resources, allowing systems to detect automated behavior when those resources are accessed or interacted with (e.g., invisible links or fake pages created to study bots).</p><p>In the context of WebSockets, honeypot events are a possible anti-bot technique to detect automated clients. With this approach, the server deliberately sends fake, misleading, or non-actionable events over the WebSocket connection. Similarly, the server might expose channels that aren&#8217;t meant to be accessed by regular clients.</p><p>Yet, automated scraping bots may react incorrectly to WebSocket honeypots by:</p><ul><li><p>Processing incoming data that is fake or intentionally invalid.</p></li><li><p>Requesting access to or subscribing to channels they aren&#8217;t supposed to use.</p></li></ul><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Study real browser behavior carefully</strong>: Inspect WebSocket traffic in your browser&#8217;s DevTools (&#8220;Network&#8221; &#8594; &#8220;Socket&#8221;) and observe which server messages actually trigger data flow or UI updates.</p></li><li><p><strong>Avoid assuming every message is meaningful</strong>: Remember that reacting to every event can lead to detection.</p></li></ul><h3>Connection Lifecycle Anomalies and Patterns</h3><p>Since WebSocket channels are stateful (unlike stateless HTTP requests), servers can detect bots by analyzing connection behavior over time. Scraping bots tend to prioritize speed over realistic user behavior, which can produce identifiable patterns.</p><p>In this regard, popular bot-like indicators include:</p><ul><li><p><strong>Very short-lived connections</strong>: Opening and closing sockets rapidly to collect data.</p></li><li><p><strong>Immediate reconnections after closure</strong>: Reconnecting instantly without human-like delays.</p></li><li><p><strong>High connection churn per IP</strong>: Multiple connections from the same IP within a short period.</p></li><li><p><strong>Missing browser events</strong>: Typical browser WebSocket clients trigger events like proper socket closure, whereas bots often skip them.</p></li><li><p><strong>Unnatural latency patterns</strong>: Servers use ping frames as heartbeats to check responsiveness. Real users on home Wi-Fi or mobile networks exhibit variable latency (jitter), while automated scripts deployed on data centers generally show extremely stable, low-latency responses.</p></li></ul><p><strong>&#128204; Tips</strong>:</p><ul><li><p><strong>Introduce some randomness</strong>: Introduce realistic delays between connections and reconnections.</p></li><li><p><strong>Replicate intended behavior</strong>: Emulate browser close events if testing automated clients.</p></li><li><p><strong>Add latency variation</strong>: Consider latency variation when sending and receiving frames to mimic real-world network jitter.</p></li><li><p><strong>Rotate connection IPs</strong>: Use proxies to <a href="https://substack.thewebscraping.club/p/how-many-ip-needed-scraping">distribute WebSocket connections across multiple IPs</a>.</p></li></ul><h3>WebSocket Binary Data Transmission</h3><p>WebSocket servers sometimes choose to send binary data instead of plain text or JSON. The main technical reasons for this are:</p><ul><li><p><strong>Reduced bandwidth</strong>: Binary messages omit field names and whitespace, making packets smaller than JSON strings and supporting high-frequency updates.</p></li><li><p><strong>Faster parsing</strong>: Binary data can be read as typed arrays or fixed-size fields, avoiding JSON parsing overhead.</p></li><li><p><strong>Custom protocols</strong>: Web apps can define their own compact binary format for predictable, high-frequency data.</p></li><li><p><strong>Efficient number storage</strong>: Numeric values can be stored in 1&#8211;4 bytes rather than as multi-character strings, saving space.</p></li></ul><p>For instance, TikTok LIVE pages use WebSockets to stream updates (e.g., chat messages, view counters, and other statistics) in binary format:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GVMj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GVMj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 424w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 848w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1272w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png" width="1456" height="862" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:862,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the binary message sent from the server&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the binary message sent from the server" title="Note the binary message sent from the server" srcset="https://substackcdn.com/image/fetch/$s_!GVMj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 424w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 848w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1272w, https://substackcdn.com/image/fetch/$s_!GVMj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a4dea3-ac88-4ef3-a161-e3dcbe529297_3071x1818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the binary message sent from the server</figcaption></figure></div><p>Sure, binary data can be converted to text. So, you may think that&#8217;s not a problem&#8230;</p><p>Well, keep in mind that most web applications using binary data implementations include some form of compression or encryption. This adds significant complexity!</p><p>Reverse-engineering these systems is technically possible by inspecting browser WebSocket clients, analyzing request headers for compression hints, or trial-and-error with common compression methods. Still, that&#8217;s time-consuming and error-prone. Plus, encryption keys, salts, or other details can easily change with each deployment.</p><p><strong>&#128204; Tips</strong>:</p><p>This time, the only piece of advice I have is to look for alternative data sources. Many WebSocket-based pages, including TikTok LIVE, use regular HTTP APIs to retrieve initial data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZW_3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 424w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 848w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1272w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png" width="1456" height="779" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:779,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i" title="Note the RESTful HTTP request made by the client during rendering  Note: Why aren&#8217;t these APIs called server-side when the HTML page i" srcset="https://substackcdn.com/image/fetch/$s_!ZW_3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 424w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 848w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1272w, https://substackcdn.com/image/fetch/$s_!ZW_3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eae33f3-6ef1-422d-905d-339b51577816_3031x1622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note the RESTful HTTP request made by the client during rendering</figcaption></figure></div><p><strong>Note</strong>: Why aren&#8217;t these APIs called server-side when the HTML page is generated? In the case of live data, it&#8217;s more reliable to fetch it on the client, because even a single second of latency could result in outdated or inconsistent information.</p><p>Thus, polling over those RESTful APIs instead of the WebSocket data streams can allow you to retrieve the information of interest without dealing with binary encoding, compression, or encryption challenges.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>WebSocket-Based Bot Detection Measures</h2><p>The WebSocket protocol is built on top of HTTP, so they inherit <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">many anti-bot techniques commonly used for HTTP requests</a>. At the same time, due to its stateful and persistent nature, anti-bot solutions like WAF (Web Application Firewalls) can leverage WebSockets to detect automated behavior even more effectively&#8230;</p><p>As a result, WebSocket-based anti-bot measures are not only relevant when connecting directly to WS servers, but also when interacting with web pages through browser automation tools like Playwright and Selenium. That&#8217;s why you must know them!</p><h3>Advanced TLS Fingerprinting</h3><p>Traditional HTTP fingerprinting checks headers and TLS details. WebSockets extend this by combining the TLS handshake with WebSocket-specific framing, which is much harder to spoof. Signals include <a href="https://developers.cloudflare.com/bots/additional-configurations/ja3-ja4-fingerprint/">JA3/JA4 fingerprints</a>, unusual cipher suite ordering, frame fragmentation patterns, and incorrect masking behavior.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Continuous Device Fingerprinting</h3><p>HTTP allows basic fingerprinting on a per-request basis, but it can&#8217;t verify whether the client&#8217;s environment remains consistent. The stateful nature of WebSockets enables servers to continuously <a href="https://substack.thewebscraping.club/p/what-is-device-fingerprint">validate device fingerprints</a> over time. For example, servers can request Canvas/WebGL renders, available fonts, and other browser characteristics repeatedly. Any inconsistency can lead to an immediate block.</p><h3>Real-Time User Behavior Monitoring</h3><p>WebSockets allow live streaming of mouse, keyboard, and scrolling events back to the server. This enables a much deeper level of user behavior analysis compared to static HTTP requests.</p><p>After all, most <a href="https://substack.thewebscraping.club/p/browser-automation-landscape-2025">browser automation scripts</a> produce perfectly straight mouse movements or instantaneous clicks, while human interactions naturally include slight jitter, variable speed, and reaction delays. These differences make automated clients easier to detect when behavior is constantly monitored over a WebSocket connection.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Conclusion</h2><p>Here, I introduced the WebSocket protocol and explained why and when it comes in handy. Specifically, you learned that it powers live data updates on web applications. Want to access that data? Well, it&#8217;s not as straightforward as you might think due to WebSocket anti-bot techniques.</p><p>In this post, I explored the most relevant WS bot detection methods, along with useful advice for bypassing them successfully. You also saw how WebSocket&#8217;s stateful, continuous data streaming can be used by WAFs and other advanced anti-bot systems for enhanced detection.</p><p>I hope you found this helpful and informative. If you have any questions or comments, drop them below. Until next time!</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #100: Hybrid Scraping - One Browser Login, Thousands of HTTP Requests]]></title><description><![CDATA[Building a pipeline that uses Camoufox for authentication and curl_cffi for extraction on Akamai-protected targets.]]></description><link>https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 19 Mar 2026 22:07:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9e52f7e3-270c-41cc-ba33-7bbbfb446247_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Browser-based scraping tools have become the default answer when a website deploys anti-bot protection. When a target runs Akamai, Cloudflare, or Datadome, the natural reflex is to reach for Playwright, Puppeteer, or one of their stealth variants like Camoufox or Pydoll. And it works. A real browser renders JavaScript, solves challenges, and presents a legitimate fingerprint. The success rate is high.</p><p>But a browser does everything the hard way. It downloads the full page, parses HTML, executes JavaScript, renders the DOM, loads images, fonts, and stylesheets. For each request, it allocates hundreds of megabytes of RAM and takes seconds to complete what an HTTP client could do in milliseconds. When a pipeline needs to scrape ten pages, this overhead is irrelevant. When it needs to scrape ten thousand pages, the browser becomes the bottleneck.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Consider a concrete scenario: we need to monitor the wishlist of an e-commerce account, pulling product data, stock levels, and price changes every hour across hundreds of items. Running Camoufox for every single API call would mean spinning up a full browser instance, navigating to each page, waiting for JavaScript to execute, extracting the data, and closing. For a hundred items, that is minutes of execution time and gigabytes of memory. The same API calls through an HTTP client would complete in seconds using a fraction of the resources.</p><p>As we measured in <a href="https://substack.thewebscraping.club/p/scraping-nike-with-open-source">THE LAB #96</a>, HTTP clients with TLS impersonation can be 27x faster than browsers on the same target. The difference is not marginal. It is the difference between a pipeline that runs on a single machine and one that requires a cluster.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><p>The problem is that these two approaches are usually treated as mutually exclusive. Either you use a browser for everything, accepting the overhead, or you try an HTTP client and hope the anti-bot system does not block it. But many websites only need a browser at the gate: for the login, the initial challenge, or the session establishment. Everything after that is plain API calls.</p><p>If we can use a browser to earn a valid session and then hand it off to an HTTP client, we get the reliability of browser automation where it matters and the speed of HTTP everywhere else. That is the pattern we want to build. But the handoff is not as simple as copying a few cookies, and the traps along the way are worth understanding before building a pipeline around this idea.</p><h2>The hybrid pattern</h2><p>The idea is simple in principle. Many websites require a browser only at the gate: the login flow, the initial anti-bot challenge, or the session establishment. Once that gate is passed, subsequent requests are plain API calls or page fetches that do not require JavaScript execution. If we can extract the session state from the browser and replay it through an HTTP client, we skip the browser for 99% of the work.</p><p>The session state, in practice, means cookies. An authentication flow sets session cookies that the server trusts for subsequent requests. If we transfer those cookies from the browser to an HTTP client, the server should treat the HTTP client as the same authenticated user.</p><p>But cookies alone are often not enough. Modern anti-bot systems like Akamai do not just check whether you have the right cookies. They also check whether the client presenting those cookies looks like the same client that earned them. </p><p>This is where TLS fingerprinting enters the picture: if the browser that logged in was Firefox, but the HTTP client that reuses the cookies presents a Python TLS fingerprint, the server may reject the request or simply drop the connection without responding.</p><p>So the real challenge is not just transferring cookies. It is maintaining continuity across two different execution models: the browser and the HTTP client must look like the same entity to the server.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>Tool landscape</h2><p>For this experiment, we used two tools.</p><p><a href="https://github.com/daijro/camoufox">Camoufox</a> is a custom Firefox build designed for stealth. It spoofs fingerprints (WebGL, canvas, audio, navigator properties), patches headless detection vectors, and uses Playwright&#8217;s Juggler protocol for automation. We covered it extensively in <a href="https://substack.thewebscraping.club/p/scraping-datadome-camoufox">THE LAB #65: Scraping Datadome-protected websites with Camoufox</a>. Its role here is limited to one thing: logging in.</p><p><a href="https://github.com/yifeikong/curl_cffi">curl_cffi</a> is a Python binding for curl-impersonate, a modified version of curl that mimics the TLS and HTTP/2 fingerprint of real browsers. It supports impersonating Chrome and Firefox at specific versions, which means it can present the same TLS fingerprint as the browser that established the session. Unlike a browser, it uses negligible resources per request and can process thousands of pages per minute.</p><p>The key property that makes this pairing work: Camoufox is Firefox-based, and curl_cffi can impersonate Firefox&#8217;s TLS fingerprint. The server sees a consistent Firefox identity across both steps.</p><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">100.HYBRID_SCRAPING</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The target: Net-a-Porter</h2><p>We chose <a href="https://www.net-a-porter.com">Net-a-Porter</a> as our target. It is a luxury e-commerce platform protected by Akamai Bot Manager, with authenticated features (wishlists, account details) exposed through internal JSON APIs. This gives us a clean test case: the login requires a real browser (Akamai blocks automation tools at the login endpoint), but the authenticated API calls are plain HTTP requests that return structured JSON.</p><p><em><strong>Please keep in mind that this is an experiment for study purposes, and we&#8217;re not inciting you to scrape Net-a-Porter or any other website, especially the part behind a login.</strong></em></p><p>Before diving into code, we need to understand what we&#8217;re dealing with. Net-a-Porter&#8217;s architecture has three layers relevant to us:</p><p><strong>Akamai Bot Manager</strong> sits in front of everything. It sets a cluster of tracking cookies (<code>_abck</code>, <code>bm_sz</code>, <code>bm_s</code>, <code>ak_bmsc</code>, and others) that are generated through JavaScript execution on the client side. These cookies prove that a real browser visited the page. Without them, API calls either fail or hang indefinitely.</p><p><strong>The login API</strong> at <code>/api/nap/wcs/resources/store/nap_il/loginidentity/v2</code> accepts a JSON payload with email and password. On success, it returns a 201 status with an <code>Ubertoken</code> in the response body. This token is the key to all authenticated endpoints.</p><p><strong>Authenticated API endpoints</strong> like the wishlist API at <code>/api/nap/wcs/resources/store/nap_il/wishlist/v2/{id}</code> require both the session cookies and the <code>Ubertoken</code> passed as an <code>x-ubertoken</code> header. They return clean JSON with product details, stock levels, and metadata.</p><h2>The experiment: what worked and what did not</h2><p>We did not arrive at the final solution directly. The investigation path itself reveals the constraints of session handoff, so it is worth walking through each attempt.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/hybrid-scraping-camoufox-curl-cffi">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Stop Getting Blocked: Upgrade Your Scraping Infrastructure with Dolphin{anty}]]></title><description><![CDATA[My review of Dolphin{anty}. Weighing the pros, cons, and unique capabilities of this anti-detect browser.]]></description><link>https://substack.thewebscraping.club/p/dolphin-anty-product-review</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/dolphin-anty-product-review</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 15 Mar 2026 15:33:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d9486151-c072-4ffa-b126-fe482a216e7e_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The web scraping industry has evolved very fast in recent years. The fact that rotating proxies is no longer enough to guarantee success is a clear sign of how advanced anti-bot systems have become.</p><p>Lots of tools have emerged to solve the issue of browser fingerprinting, which, for example, is one of the <a href="https://substack.thewebscraping.club/p/differences-residential-mobile-proxies">primary reasons for blocks even when using high-quality residential proxies</a>. So, the need companies have for stable, scalable data collection makes anti-detect solutions essential for survival in the current status of the industry.</p><p>In this article, you&#8217;ll discover Dolphin{anty}: A powerful anti-detect browser that lets you orchestrate hundreds of unique, isolated browser profiles. You&#8217;ll learn its strengths, why you should consider it for your scraping or multi-accounting projects, and how it works with a practical guide.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What is an Anti-detect Browser?</h2><p>An antidetect browser is a specialized web browsing tool designed <a href="https://substack.thewebscraping.club/p/understanding-browser-fingerprint">to mask a user&#8217;s digital fingerprint</a>, allowing them to appear as a distinct, unique visitor to websites and tracking systems. Standard browsers like Chrome or Firefox broadcast a user&#8217;s hardware and software data. An anti-detect browser, instead, enables users to customize and spoof these parameters for every session.</p><p>In the context of web scraping, web scraping professionals use this technology to bypass anti-bot measures that rely on browser fingerprinting to identify and block automated traffic. Anti-detect browsers can also be used in &#8220;multi-accounting&#8221; strategies. You can use them to create isolated browser profiles, each with its own unique fingerprint, cookies, and proxy IP. The common use case is that a single user can manage hundreds of social media, e-commerce, or ad accounts simultaneously without triggering security flags that would normally link the accounts together and lead to mass bans.</p><div><hr></div><blockquote><p><em>A successful data pipeline is made not only by the right tool to use, but also from the right IP address. Proxy providers like <strong>Decodo</strong> help you achieving your scraping goals.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>What is Dolphin{anty} and Why Consider it for Your Web Scraping Projects?</h2><p><a href="https://dolphin-anty.com/">Dolphin{anty}</a> is an anti-detect browser that allows you to manage hundreds of unique, isolated browser profiles for web scraping and multi-accounting. You can use it via its desktop application or programmatically, as it provides a flexible API for deep integration with your scripts.</p><p>The best part of using it is that you can orchestrate wide scraping operations without worrying about browser fingerprinting. Forget about immediate IP bans, CAPTCHAs triggered by suspicious metadata, or complex cookie management. Dolphin{anty} handles the masking of your digital identity for you very simply. Also, thanks to its <a href="https://dolphin-anty.com/blog/en/dolphin-anty-has-become-even-more-effective-a-significant-update-to-the-scenarios-capabilities/">built-in &#8220;Scenarios&#8221; builder and synchronizer</a>, it can automatically replicate human-like actions across multiple profiles simultaneously. So, say goodbye to manual warm-up routines and the fear of losing accounts to anti-fraud systems.</p><p>The top reasons why you should consider it for your projects are the following:</p><ul><li><p><strong>Advanced anti-detect capabilities:</strong> If you&#8217;ve been scraping for a while, you know that standard headless browsers often leak metadata that triggers anti-bot defenses. Dolphin{anty} solves this by providing real, unique digital fingerprints for every profile. It mimics user behaviors at a granular level, allowing you to bypass sophisticated detection systems without the constant headache of being blocked.</p></li><li><p><strong>Mass profile management:</strong> Managing a few accounts is easy, but scaling to hundreds or thousands is a different beast. Dolphin{anty} is built for scale. It allows you to orchestrate hundreds of isolated browser profiles from a single interface. Whether you are managing a massive farm of accounts for data collection or need to segment your scraping tasks, the tool provides the infrastructure to keep everything organized and efficient.</p></li><li><p><strong>Flexible API integration:</strong> For those who prefer code, Dolphin{anty} offers a robust API that integrates deeply with your existing Python or Node.js pipelines. This allows you to automate profile creation, launch browsers programmatically, and integrate the anti-detect capabilities directly into your custom scraping infrastructure.</p></li></ul><div><hr></div><blockquote><p><em>For your ethical scraping activity, you need IPs with good reputation. For this reason, we&#8217;re using a proxy provider like our partner <strong>Ping Proxies</strong>, <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">that&#8217;s sharing with TWSC readers this offer</a>.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMCv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png" width="432" height="87.84" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1200,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:274315,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMCv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 424w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 848w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1272w, https://substackcdn.com/image/fetch/$s_!cMCv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc4621b-d173-4d25-a896-9a358b03971d_1200x244.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#128176; - <a href="https://dashboard.pingproxies.com/sign-up?utm_source=twsc&amp;utm_medium=banner_one&amp;utm_id=twsc">Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries</a></strong></p></blockquote><div><hr></div><h2>Dolphin{anty}&#8217;s Main Features</h2><p>Dolphin{anty} is packed with features designed to make multi-accounting and scraping easier. The main features you should know about it are the following:</p><ul><li><p><strong>Real fingerprint generation:</strong> The core of Dolphin{anty} is its ability to provide genuine device fingerprints. Instead of just blocking trackers, it creates a unique digital identity for every profile you run. In practice, it manages over 20 parameters&#8212;from WebRTC to Canvas&#8212;so your scrapers look exactly like real users on real devices.</p></li><li><p><strong>Built-in Automation:</strong> You don&#8217;t always need to be a coding wizard to automate tasks. Dolphin{anty} offers a &#8220;Scenarios&#8221; builder that lets you create automated workflows visually. Whether it&#8217;s warming up accounts or parsing data, you can set these scripts to run automatically. And for those who prefer code, the flexible API allows you to integrate these profiles directly into your existing scripts.</p></li><li><p><strong>Profile synchronizer:</strong> This is a game-changer if you need to perform the same action across multiple accounts. The Synchronizer allows you to perform an action in a &#8220;master&#8221; profile, and the tool automatically repeats that exact action across all other selected profiles in real-time. This saves you a massive amount of time on routine interactions.</p></li><li><p><strong>Team collaboration:</strong> If you work in a team, you know that sharing browser sessions and cookies can be a nightmare. Dolphin{anty} simplifies this by allowing you to transfer profiles, cookies, and proxies to colleagues in just a few clicks. You can also manage permissions, ensuring that team members only have access to the functionality they need.</p></li><li><p><strong>Smart profile management:</strong> When you are dealing with hundreds of profiles, organization is key. The tool provides a highly intuitive interface where you can use tags, statuses, and notes to sort and find your profiles instantly. It&#8217;s built to help you navigate a large farm of accounts without getting lost in the chaos.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2><strong>Hands-on Dolphin{anty}: Step-by-step Scraping Tutorial</strong></h2><p>In this section, you will see how easy and fast it is to use Dolphin{anty}. Get ready for the tutorial!</p><h3>Setting Up Dolphin{anty} </h3><p>First of all, you need to create a new login. After <a href="https://dolphin-anty.com/panel/#/auth/registration">creating a new account on Dolphin{anty}</a>, the system will ask you to download the software. As you can see from the image below, it supports all the major Operating Systems:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qICh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qICh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 424w, https://substackcdn.com/image/fetch/$s_!qICh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 848w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1272w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:158491,&quot;alt&quot;:&quot;Dolphin Anty supports all major operating systems by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Dolphin Anty supports all major operating systems by Federico Trotta" title="Dolphin Anty supports all major operating systems by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!qICh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 424w, https://substackcdn.com/image/fetch/$s_!qICh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 848w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1272w, https://substackcdn.com/image/fetch/$s_!qICh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8309e44a-d25e-486a-9db2-a55cdff0c9eb_1532x637.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Dolphin{anty} supports all major operating systems</figcaption></figure></div><p>Below is how Dolphin{anty}&#8217;s interface appears after you installed it on your machine:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kn4O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 424w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 848w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1272w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png" width="1456" height="762" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:762,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:185539,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kn4O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 424w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 848w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1272w, https://substackcdn.com/image/fetch/$s_!Kn4O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cee3920-ba3d-46f4-862f-5407524ddda3_1912x1001.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Dolphin{anty}&#8217;s first interface</figcaption></figure></div><p>Good. Everything is set up. Time to create new profiles!</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Create New Profiles</h3><p>Before using Dolphin{anty}, you have to create a new profile. To do so, click on <strong>CREATE PROFILE</strong> and fill in the fields:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xU9q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xU9q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 424w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 848w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1272w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png" width="1152" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1152,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183896,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xU9q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 424w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 848w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1272w, https://substackcdn.com/image/fetch/$s_!xU9q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac09413c-7651-4b6f-85f7-509b8e65c148_1152x871.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating new profiles in Dolphin{anty} ty</figcaption></figure></div><p>Profiles are the core of Dolphin{anty}. This is where, for example, you can change the fingerprinting for your anti-detect strategies. To do so, you only need to click on <strong>NEW FINGERPRINT,</strong> and the tool will change all the fingerprinting data for you. And if the standard fingerprinting is not sufficient, you can manage advanced configurations:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rafe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rafe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 424w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 848w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1272w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png" width="1166" height="873" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:873,&quot;width&quot;:1166,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184964,&quot;alt&quot;:&quot;Changing fingerprint configuration in Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Changing fingerprint configuration in Dolphin Anty by Federico Trotta" title="Changing fingerprint configuration in Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Rafe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 424w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 848w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1272w, https://substackcdn.com/image/fetch/$s_!Rafe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414720b-3b93-451b-b7f5-bd57781fe568_1166x873.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Changing fingerprint configuration in Dolphin{anty} </figcaption></figure></div><p>Also, if your use case needs to use a specific social media like Facebook, you can set Facebook&#8217;s URL as the starting page and the credentials to log in to a profile you need to manage:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ONwa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ONwa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 424w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 848w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1272w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png" width="1173" height="872" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:872,&quot;width&quot;:1173,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187856,&quot;alt&quot;:&quot;How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta" title="How to set up your social media profile&#8217;s login with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!ONwa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 424w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 848w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1272w, https://substackcdn.com/image/fetch/$s_!ONwa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F372f1f30-449d-49a4-88f7-93d1051e204d_1173x872.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How to set up your social media profile&#8217;s login with Dolphin{anty} </figcaption></figure></div><p>When everything is set up, click on <strong>SAVE,</strong> and your profile is completed! You are now ready to use Dolphin{anty} via UI or code.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Use Dolphin{anty}  Via The UI</h3><p>The power of anti-detect browsers rely in allowing you to create different profiles and letting you use the browser with one instance, but different profiles. So, after you created the profiles, click on <strong>START</strong> to launch the instances:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!38xz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!38xz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 424w, https://substackcdn.com/image/fetch/$s_!38xz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 848w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1272w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png" width="1456" height="285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:285,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124225,&quot;alt&quot;:&quot;How to launch instances with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to launch instances with Dolphin Anty by Federico Trotta" title="How to launch instances with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!38xz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 424w, https://substackcdn.com/image/fetch/$s_!38xz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 848w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1272w, https://substackcdn.com/image/fetch/$s_!38xz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ad16a7-737f-408c-b52a-d451c3bd33ed_1908x373.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">How to launch instances with Dolphin{anty}</figcaption></figure></div><p>Dolphin{anty} will launch a new browser instance, allowing you to manage as many profiles as you have created and activated. Below is the expected result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_xVL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_xVL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 424w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 848w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1272w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png" width="1058" height="916" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:916,&quot;width&quot;:1058,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:218925,&quot;alt&quot;:&quot;Launching an instance with two different profiles with Dolphin Anty by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Launching an instance with two different profiles with Dolphin Anty by Federico Trotta" title="Launching an instance with two different profiles with Dolphin Anty by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!_xVL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 424w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 848w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1272w, https://substackcdn.com/image/fetch/$s_!_xVL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ac724-4892-47a9-8d6e-731d92d421f0_1058x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Launching an instance with two different profiles with Dolphin{anty} </figcaption></figure></div><p>That&#8217;s it for using Dolphin{anty}  via UI!</p><h3>Use Dolphin{anty} Via Code</h3><p>Before using Dolphin{anty}  via code, you have to create an API key. To do so, navigate through the <strong><a href="https://dolphin-anty.com/panel/#/api">API</a></strong><a href="https://dolphin-anty.com/panel/#/api"> panel in the web app</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oWZ3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 424w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 848w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1272w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224126,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/188140801?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oWZ3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 424w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 848w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1272w, https://substackcdn.com/image/fetch/$s_!oWZ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88ac408-382d-4d60-855d-2ee0c3b468cf_1903x760.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating an API key in Dolphin{anty} </figcaption></figure></div><p>Now you can connect to a profile through a port generated at startup and automate the browser using tools like <a href="https://substack.thewebscraping.club/p/improving-performance-puppeteer-scraping">Puppeteer</a>, <a href="https://substack.thewebscraping.club/p/web-scraping-from-0-to-hero-our-first">Playwright</a>, <a href="https://substack.thewebscraping.club/p/selenium-tutorial-course">Selenium</a>, and others.</p><p>Basic automation you can do includes the following:</p><ol><li><p>Start a profile via API with DevTools Protocol enabled.</p></li><li><p>Connect to the profile&#8217;s port using a browser tool.</p></li><li><p>Run your own automation script through the open connection.</p></li></ol><p>Dolphin{anty} allows you maximum flexibility, so you can use your favourite programming language. For example, below is how you can write an authorization script:</p><pre><code><code>import requests
api_url = "&lt;http://localhost:3001/v1.0/auth/login-with-token&gt;"
token = "your-api-key"
request_data = {"token": token}
headers = {"Content-Type": "application/json"}

response = requests.post(api_url, json=request_data, headers=headers)
if response.status_code == 200:
&#9;print("OK", response.json())
else:
&#9;print("Error", response.status_code)</code></code></pre><p>If the response is successful, you will receive a message like the following:</p><pre><code><code>{"success": true}</code></code></pre><p>Discover how to use <a href="https://help.dolphin-anty.com/en/collections/4645237-api">Doplhin{anti} via API by reading the documentation</a>!</p><h2>Pros and Cons of Dolphin{anty}</h2><p>Like any tool, Dolphin{anty} has its strengths and weaknesses. Here is a breakdown of what you need to know before deciding if it fits your stack.</p><p>&#128077; <strong>Pros:</strong></p><ul><li><p><strong>Top-tier fingerprinting:</strong> The ability to generate real, unique fingerprints for every profile is its biggest selling point. It goes beyond simple user-agents, making your scrapers look genuinely human.</p></li><li><p><strong>Built-in automation tools:</strong> The &#8220;Scenarios&#8221; builder and the Synchronizer are massive time-savers. You can automate routine warm-up tasks or replicate actions across dozens of profiles without writing a single line of code.</p></li><li><p><strong>Team-centric design:</strong> If you work with a team, the ability to transfer profiles and share them instantly is invaluable. It removes the friction of sharing session data manually via files or text.</p></li></ul><p>&#128078;<strong>Cons:</strong></p><ul><li><p><strong>REST API complexity:</strong> This is a significant friction point for developers. Unlike other solutions that offer native SDK wrappers, Dolphin{anty} relies only on REST API calls for automation. This adds &#8220;boilerplate&#8221; complexity compared to simply importing a library.</p></li><li><p><strong>Resource intensive:</strong> Running multiple browser profiles with full fingerprinting requires significant system resources. You will need a powerful machine if you plan to run dozens of concurrent sessions locally.</p></li></ul><h2>Conclusion</h2><p>In this article, you discovered Dolphin{anty}, a flexible anti-detect browser that can be used both via UI and via code. As you&#8217;ve learned, it comes packed with interesting features that can speed up your processes. In particular, we found that the &#8220;Scenarios&#8221; feature is the one that actually makes it stand out.</p><p>So, let&#8217;s discuss in the comments: Were you already using Dolphin{anty} before reading this article? What&#8217;s your experience with it?</p>]]></content:encoded></item><item><title><![CDATA[The DMCA Was Built to Stop DVD Piracy. Google Wants to Use It Against Scrapers]]></title><description><![CDATA[How a 12-page complaint is trying to turn every CAPTCHA into a federal copyright perimeter]]></description><link>https://substack.thewebscraping.club/p/google-vs-serpapi-web-scraping-case</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/google-vs-serpapi-web-scraping-case</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Sun, 08 Mar 2026 17:52:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/97060676-c153-4ea6-a3c9-7e70cd1f3c22_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On December 19, 2025, Google filed a lawsuit against SerpApi in the Northern District of California. The case number is 25-10826, and the complaint is 12 pages long. Twelve pages that could reshape how the entire scraping industry operates.</p><p>We are not talking about a cease-and-desist letter or a Terms of Service dispute. Google did not send SerpApi any communication before filing the lawsuit. No cease-and-desist, no attempt to resolve their concerns directly. SerpApi told us this was highly unusual, and that had Google reached out, they might have learned that their claims lack merit.</p><p>Google is invoking the Digital Millennium Copyright Act, specifically Section 1201, the anti-circumvention provision. The same statute originally designed to prevent people from cracking DVD encryption is now being pointed at a SERP scraping API.</p><div><hr></div><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>We reached out to both Google and SerpApi for comment on this case. Google did not respond. SerpApi did, and we will include their statements throughout this article where relevant.</p><p>Let us break down what happened, why it matters, and what it could mean for anyone who scrapes the web for a living.</p><h3>The Facts</h3><p>Google&#8217;s complaint tells a straightforward story. SerpApi, founded in 2017 by Julien Khaleghy, operates a paid API that sends automated queries to Google Search and returns the results as structured JSON. Google estimates that SerpApi sends hundreds of millions of artificial search requests per day, and that this volume has increased by as much as 25,000% over the past two years.</p><p>In January 2025, Google deployed a technological protection measure called SearchGuard. SearchGuard works by sending JavaScript challenges to incoming search queries. For regular browser users, the challenge is invisible: the browser runs the JavaScript, sends back the expected response, and the search results load normally. For automated systems, the challenge is a wall. Bots that cannot execute JavaScript or that fail behavioral checks get blocked.</p><p>According to Google&#8217;s complaint, SerpApi&#8217;s response to SearchGuard was to build circumvention mechanisms. The complaint alleges that SerpApi creates &#8220;fake browsers using a multitude of IP addresses that Google sees as normal users,&#8221; misrepresents device and location information when solving challenges, and syndicates authorization tokens from legitimate requests to unauthorized machines around the world. Google also alleges that SerpApi uses automated means to bypass CAPTCHAs that SearchGuard deploys as a secondary verification layer. SerpApi disputes these factual allegations.</p><p>The complaint cites SerpApi&#8217;s own blog posts, where the company reportedly described SearchGuard as making &#8220;web scraping more difficult&#8221; but claimed to be &#8220;fortunate to be minimally impacted&#8221; because its services had &#8220;already pre-solved Google&#8217;s JavaScript challenge.&#8221;</p><h2>The Legal Theory</h2><p>This is where it gets interesting for the scraping industry, because Google chose not to sue under the Computer Fraud and Abuse Act (CFAA). That would have been the traditional route. Instead, Google went with the DMCA.</p><p>The context matters. The CFAA path has been significantly narrowed by the hiQ Labs v. LinkedIn case. In that landmark decision, the Ninth Circuit held that scraping publicly available data does not violate the CFAA, and warned against allowing companies to create &#8220;information monopolies.&#8221; The Supreme Court vacated and remanded the case under its Van Buren ruling, but on remand, the Ninth Circuit reaffirmed its original position.</p><p>After hiQ, the CFAA is a much weaker weapon against scraping of publicly visible content. Google needed a different legal framework. Section 1201 of the DMCA provides one.</p><p>Section 1201 has two relevant provisions. The first, Section 1201(a)(1)(A), prohibits the act of circumventing a technological measure that effectively controls access to a copyrighted work. The second, Section 1201(a)(2), prohibits trafficking in technology designed to circumvent such measures. Google&#8217;s complaint invokes both.</p><p>The argument chain goes like this: Google&#8217;s search results contain copyrighted content, specifically images in Knowledge Panels licensed from third parties, merchant-supplied product images in Google Shopping, and licensed content from Google Maps. SearchGuard is a technological measure that controls access to these search results pages (and therefore to the copyrighted works within them). SerpApi circumvents SearchGuard. Therefore, SerpApi violates Section 1201.</p><p>Each act of circumvention carries statutory damages of between $200 and $2,500. Google alleges billions of individual circumventions. Do the math, and the potential damages exceed what SerpApi could ever pay. Google itself notes in the complaint that SerpApi &#8220;reportedly earns a few million dollars in annual revenue, but already faces liability that is orders of magnitude higher and growing.&#8221;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h2>SerpApi&#8217;s Position</h2><p>When we reached out to SerpApi, they were clear about their stance. On the fundamental legality of what they do, SerpApi told us: &#8220;<em>We embrace the term &#8216;scraping,&#8217; and we practice it legally and transparently. SerpApi accesses publicly visible search results, the same ones available to any browser, and delivers clean, structured JSON back to our customers. We&#8217;ve operated this way since 2017, serving developers, researchers, and businesses who need reliable access to public information at scale.&#8221;</em></p><p>On the legal boundaries of automated access to search results, their position is equally direct: &#8220;<em>The law on this is clear, and we&#8217;re prepared to defend that position in court. Scraping is legal, and we stand behind our products and customers. Our API replicates real-time searches with no login, no bypass of any paywall, and no access to anything that isn&#8217;t already available to anyone with a browser. U.S. courts have upheld this repeatedly; hiQ Labs v. LinkedIn is a key precedent. The data Google surfaces lives on the open web. Google didn&#8217;t create it.</em>&#8221;</p><p>In February 2026, <a href="https://serpapi.com/blog/google-v-serpapi-motion-to-dismiss-why-were-in-the-right/">SerpApi filed a motion to dismiss</a>. Their arguments include the assertion that the DMCA is a copyright protection statute, not a website protection statute, and that Google is improperly trying to use it to control access to public portions of its website. They also argue that mimicking browser behavior to access publicly available pages is not the same as cracking encryption or disabling authentication, and that any ambiguity in the definition of "circumvention" must be given its narrowest reasonable reading, citing the "First Amendment interest in maintaining accessibility of the Internet as an open forum."</p><p>SerpApi also pointed out what they see as an absurdity in Google&#8217;s theory. If statutory damages were calculated at scale, the total &#8220;would exceed U.S. GDP.&#8221; Congress, they argue, never intended Section 1201 to be used this way.</p><p>On the DMCA claim specifically, SerpApi told us: &#8220;<em>The DMCA&#8217;s anti-circumvention provision was designed to protect copyrighted works, full stop. Google is not protecting access to copyrighted works. Google is improperly attempting to use the DMCA to limit access to the public portions of its website. We believe that the law is on our side.</em>&#8221;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>The Hypocrisy Argument</h2><p>SerpApi is not shy about making this point. <a href="https://serpapi.com/blog/google-v-serpapi-threatening-access-to-public-data/">In a blog post about the lawsuit</a>, they argue that Google&#8217;s case threatens access to public data on the open internet and this resonates widely in the scraping community. As they told us: &#8220;<em>Google indexed the web without anyone&#8217;s permission. That&#8217;s how search works. Now it&#8217;s trying to pull up the ladder behind it, prohibiting the practices that it used, and still uses today, to build its business empire. That&#8217;s why SerpApi is standing up to Google. Not just to protect our business, but to protect legal competition and open access to public information on the internet.</em>&#8221;</p><p>Google Search operates by crawling, indexing, and presenting content from billions of websites. Many of those website owners never explicitly consented to being indexed. Google&#8217;s position has always been that robots.txt provides the mechanism for opting out, and that the default state of the open web is crawlable. Now Google is arguing that its own search results should be exempt from the same logic.</p><p>The irony is not lost on legal commentators either. <a href="https://abovethelaw.com/2025/12/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google">Above the Law</a>( described the case as Google &#8220;<em>pulling up the ladder after climbing it.</em>&#8221; <a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">Eric Goldman&#8217;s blog published an extensive guest analysis</a> arguing that Google&#8217;s DMCA strategy represents an attempt to relitigate hiQ Labs through a different statutory framework.</p><h2>Why This Matters Beyond SerpApi</h2><p>If Google&#8217;s legal theory prevails, the implications extend far beyond one API company. The core question is whether deploying an anti-bot system on a publicly accessible website is enough to invoke federal copyright law against anyone who bypasses it.</p><p>Think about what that means in practice. Every CAPTCHA, every JavaScript challenge, every behavioral analysis system deployed on a public website could potentially become a &#8220;technological protection measure&#8221; under Section 1201. Any scraper that solves a CAPTCHA, executes JavaScript to render a page, or rotates IP addresses to avoid detection could be committing a federal offense.</p><p>This is not hypothetical. The legal theory applies to any website that hosts copyrighted content (which is almost all of them) and deploys some form of bot detection (which is increasingly all of them).</p><p>Eric Goldman&#8217;s blog highlighted this exact concern. <a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">The guest analysis by Kieran McCarthy</a> warns that accepting Google&#8217;s theory would allow any website deploying anti-bot technology to invoke federal law against circumvention, &#8220;transforming speed bumps and CAPTCHAs into federally enforceable copyright perimeters.&#8221;</p><p>The <a href="https://www.eff.org/">Electronic Frontier Foundation</a> has also weighed in. Staff attorney Tori Noble stated that &#8220;the right to scrape publicly available information keeps the Internet free and open,&#8221; cautioning that overly broad DMCA interpretations undermine innovation and research.</p><p>SerpApi made a similar point when we asked about the impact on consumers: &#8220;<em>Scraping-powered services benefit all kinds of consumers who use the web every day. Scraping helps to maintain the free and open flow of information across the internet, ultimately encouraging things like price transparency, competition, and informed decision-making, all to benefit consumers. Expanding the DMCA as Google has suggested would only benefit the largest tech incumbents and hinder transparency and healthy competition.</em>&#8221;</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Emerging Legal Pattern</h2><p>Google&#8217;s lawsuit does not exist in isolation. In October 2025, <a href="https://copyrightalliance.org/wp-content/uploads/2025/10/Reddit-v.-SerpApi.pdf">Reddit filed a 41-page complaint</a> against SerpApi, Perplexity AI, Oxylabs, and AWMProxy in the Southern District of New York. The complaint is far more aggressive than Google&#8217;s, both in tone and in scope: six legal counts including three separate DMCA claims, unfair competition, unjust enrichment, and civil conspiracy.</p><p>Reddit&#8217;s framing is vivid. It describes the defendants as &#8220;similar to would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead.&#8221; AWMProxy is characterized as &#8220;a former Russian botnet.&#8221; Perplexity is compared to &#8220;a North Korean hacker.&#8221; The language is clearly designed to make scrapers look like criminals.</p><p>The underlying theory is similar to Google&#8217;s. Reddit has signed licensing deals with both <a href="https://blog.google/inside-google/company-announcements/expanded-reddit-partnership/">Google</a> and <a href="https://openai.com/index/openai-and-reddit-partnership/">OpenAI</a> to grant them programmatic access to its data. Companies that want Reddit content at scale are expected to pay for it. But when scrapers circumvent SearchGuard to harvest Google&#8217;s search results, they also harvest Reddit content without paying a cent. According to data Reddit obtained through a subpoena to Google, the three scraping defendants accessed almost three billion Google SERPs containing Reddit content in just two weeks during July 2025. SerpApi alone accounted for over 1.8 billion of those page accesses. Like Google, Reddit did not send SerpApi any communication before filing suit. SerpApi disputes these figures and the other factual allegations in Reddit&#8217;s complaint, and has filed a motion to dismiss in that case as well.</p><p>Reddit also produced a piece of evidence that reads like a detective novel. It created a hidden &#8220;test post&#8221; that could only be crawled by Google&#8217;s search engine and was not otherwise accessible anywhere on the internet. Within hours, the contents of that post appeared in Perplexity&#8217;s &#8220;answer engine.&#8221; The only way Perplexity could have obtained that content was through scraping Google&#8217;s search results. Reddit calls this technique the equivalent of &#8220;marked bills&#8221; in a bank robbery investigation.</p><p>The Reddit complaint also reveals a detail that connects directly to our industry: after Reddit sent a cease-and-desist letter to Perplexity in May 2024, Perplexity&#8217;s citations to Reddit content did not decrease. They increased forty-fold.</p><p>And in December 2025, in Ziff Davis v. OpenAI, a federal judge in the Southern District of New York ruled that robots.txt files do not &#8220;effectively control access&#8221; under Section 1201. Judge Sidney Stein compared robots.txt to a &#8220;keep off the grass&#8221; sign that &#8220;relies on readers to decide to comply rather than enforcing any kind of access control itself.&#8221; The ruling is important because it sets a baseline: passive, voluntary measures are not enough to trigger DMCA protection.</p><p>But SearchGuard is not robots.txt. It is an active system that executes JavaScript, performs behavioral analysis, deploys CAPTCHAs, and makes real-time decisions about whether to grant access. Whether this kind of system meets the &#8220;effectively controls access&#8221; standard is the open legal question. The answer will likely set the direction for the entire industry.</p><p><a href="https://blog.ericgoldman.org/archives/2026/01/relitigating-hiq-labs-and-scraping-through-the-lens-of-the-dmca-1201-anti-circumvention-guest-blog-post.htm">Legal commentators</a> have identified what they call the &#8220;DMCA 1201 scraping strategy&#8221;: platforms deploy technological protection measures specifically to create legal standing under Section 1201, then sue when those measures are circumvented. The sequence is intentional. Deploy, document, sue. Whether courts view this as legitimate copyright protection or as <a href="https://abovethelaw.com/2025/12/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/">strategic rent-seeking</a> will determine the outcome.</p><p>There is also a relevant doctrinal debate. The Lexmark case in the Sixth Circuit introduced the &#8220;front door/back door&#8221; argument: if a house&#8217;s front door is unlocked, putting a lock on the back door does not mean the house is &#8220;access-controlled.&#8221; Applied here: if anyone with a regular browser can access Google Search results, does deploying SearchGuard against automated systems meaningfully &#8220;control access&#8221; to the copyrighted works within those results?</p><h2>The AI Angle</h2><p>There is one more layer worth noting. <a href="https://searchengineland.com/openai-chatgpt-serpapi-google-search-results-461226">As Search Engine Land reported</a>, OpenAI used SerpApi to scrape Google Search results for ChatGPT responses on current events, after Google declined to provide direct access to its search index. SerpApi listed OpenAI as a customer on its website as recently as May 2024 before removing the listing. Other reported customers include Meta, Apple, and Perplexity.</p><p>This context matters because Google already has a massive structural advantage in the AI race when it comes to fresh web data. <a href="https://finance.yahoo.com/news/google-huge-edge-over-openai-110102636.html">Cloudflare CEO Matthew Prince put numbers on it</a>: &#8220;For every one page that OpenAI sees, Google is seeing 3.2 pages.&#8221; Against Microsoft, the ratio is 4.8 to 1. The reason is simple. Publishers cannot block Googlebot without disappearing from search results. So Google gets access to the web at a scale that no competitor can match, and it can use that data not just for search but also for training and running its AI products.</p><p>In this context, suing companies that make it easier for competitors to scrape Google&#8217;s search results is not just about protecting copyrighted images in Knowledge Panels. It is also an act of defense of a competitive advantage. If OpenAI or any other AI company can get structured search data through SerpApi, they partially close the gap that Google&#8217;s crawler monopoly creates. Shutting down that channel through litigation serves Google&#8217;s position in the AI race, even if the complaint is framed purely in terms of copyright protection.</p><h2>What Happens Next</h2><p>The case is still in its early stages. <a href="https://ppc.land/serpapi-files-motion-to-dismiss-googles-dmca-scraping-lawsuit/">SerpApi filed its motion to dismiss</a> on February 20, 2026. <a href="https://www.courtlistener.com/docket/72059948/google-llc-v-serpapi-llc/">According to the court docket</a>, the initial case management conference before Judge Yvonne Gonzalez Rogers is scheduled for March 30, 2026, and a hearing on the motion to dismiss is set for May 19, 2026.</p><p>If the motion to dismiss fails and the case proceeds to discovery and trial, it will force courts to answer questions that have been left open since hiQ. Is a JavaScript challenge a &#8220;technological protection measure&#8221; under the DMCA? Can anti-bot systems on publicly accessible websites invoke federal anti-circumvention law? Does the DMCA protect the act of accessing a public webpage, or only the copyrighted works behind genuine access controls like encryption and authentication?</p><p>For the scraping industry, the stakes are high. A ruling in Google&#8217;s favor would give any website with copyrighted content and a bot-detection system a federal cause of action against scrapers. A ruling in SerpApi&#8217;s favor would confirm that the DMCA was not designed to protect public webpages from automated access, regardless of the technical measures deployed.</p><p>We will follow the case closely. Whatever happens, the days of operating in a legal gray area are coming to an end. The courts will have to draw a line, and that line will define the rules for the next decade of web scraping.</p><p>*<em>Disclaimer: We are not lawyers. This article represents our analysis of publicly available court filings and legal commentary. Consult legal counsel for advice specific to your situation.</em>*</p>]]></content:encoded></item><item><title><![CDATA[THE LAB #99: HTTP Caching for Web Scraping]]></title><description><![CDATA[How Conditional Requests Can Cut Your Proxy Bill, using HTTP caching.]]></description><link>https://substack.thewebscraping.club/p/http-caching-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/http-caching-scraping</guid><dc:creator><![CDATA[Pierluigi Vinciguerra]]></dc:creator><pubDate>Thu, 05 Mar 2026 15:18:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5c39bf0e-6c50-4c30-bb29-fe68b7b616d5_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the biggest cost drivers in recurring scraping operations is fetching pages daily or even more times a day, especially if we need to use proxies, just to discover that have not changed since the last run. <br>In price monitoring application this is fairly common: let&#8217;s say you are monitoring prices every hour across 50,000 product pages, it&#8217;s highly probable that  most of them still show the same price they showed an hour ago. You are paying your proxy provider for bandwidth that carries identical data, over and over.</p><p>The scraping industry is well aware of this problem. A <a href="https://scrapeops.io/blog/scraping-shock/">recent analysis by ScrapeOps</a> found that even though proxy prices have dropped by 67% over the past five years, the cost per successful payload has actually increased by 133%, mostly because anti-bot defenses now require heavier infrastructure. When each request costs more, wasting them on unchanged pages hurts even more.</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><p>Several approaches try to solve this. Tools like <a href="http://changedetection.io">changedetection.io</a> monitor pages for visual or structural changes and alert you when something is different. On the more technical side, <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Altay Akkus&quot;,&quot;id&quot;:272178059,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13f918be-3a3b-4cc1-b442-cd912cb5efbe_144x144.png&quot;,&quot;uuid&quot;:&quot;dcac9616-c6b5-4389-950b-847aa67e589d&quot;}" data-component-name="MentionToDOM"></span> <a href="https://altayakkus.substack.com/p/partial-content-web-crawling-using">recently explored</a> using SimHash as a client-side fingerprint to determine whether a document has changed since the last crawl, without downloading the full body. These are valid strategies, but they all share one trait: they require you to build and maintain the change detection logic yourself.</p><p>What you might not know is that the HTTP protocol already has a native mechanism for this, and it has been part of the spec since 1999. It is called conditional requests, and it lets the server itself tell your scraper &#8220;nothing has changed&#8221; by responding with a 304 status and zero bytes of body. No diffing, no hashing, no client-side state management beyond storing a single header value.</p><p>We have written about proxy cost optimization before in articles like <a href="https://substack.thewebscraping.club/p/optimizing-proxy-costs">Optimizing Proxy Usage for Large-Scale Scraping</a> and <a href="https://substack.thewebscraping.club/p/analyzing-cost-web-scraping">Analyzing the Cost of a Web Scraping Project</a>, but we have never covered this technique. In this article, we will test it against real e-commerce sites and measure exactly how much bandwidth and money it can save.</p><div><hr></div><blockquote><p><em>For your scraping needs, having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>How HTTP caching works (the short version)</h2><p>When a web server responds to a request, it can include headers that describe the freshness and identity of the content. Two of these headers are relevant for our purposes.</p><p><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/ETag">The first is </a><code>ETag</code><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/ETag">, short for Entity Tag</a>. It is a string that uniquely identifies a specific version of a resource. Think of it as a fingerprint of the page content. When the content changes, the ETag changes.</p><p><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Last-Modified">The second is </a><code>Last-Modified</code>, a timestamp indicating when the resource was last updated.</p><p>These two headers enable what HTTP calls conditional requests. The idea is simple. After your first request, you store the ETag (or the Last-Modified value) returned by the server. On the next request to the same URL, you send it back using the `If-None-Match` header (for ETags) or `If-Modified-Since` (for timestamps). The server compares your stored value with the current one. If they match, the server responds with status 304 Not Modified and an empty body. If they do not match, you get a regular 200 response with the fresh content.</p><p>A 304 response contains zero bytes of body. For a proxy billed per GB, that is a request that costs almost nothing in bandwidth.</p><h2>The tools we used</h2><p>The HTTP caching technique itself is protocol-level and works with any HTTP client that allows setting custom headers. You could implement it with Python&#8217;s `requests`, <code>httpx</code>, or even raw <code>curl</code>.</p><p>For this article, we used <a href="https://github.com/lexiforest/curl_cffi">curl_cffi</a>, a Python HTTP client built on top of curl-impersonate. Its main strength for our purposes is TLS fingerprinting: it can impersonate the TLS handshake of real browsers (Chrome, Firefox, Safari), which prevents e-commerce sites from blocking the request before we even get to test caching behavior. Without TLS fingerprinting, some of the e-commerce targets we wanted to test would have returned 403 immediately, making it impossible to evaluate their caching support.</p><p>Then later in the article, we&#8217;ll see if we can use the same approach with Scrapy.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><p></p><h2>The audit methodology</h2><p>Before attempting conditional requests, we need to check whether a target supports them. We wrote a simple audit function that makes two requests to any URL.</p><p>The first request is a standard GET. We capture the <code>ETag</code>, <code>Last-Modified</code>, and <code>Cache-Control</code> headers from the response, along with the response body size.</p><p>If an ETag or Last-Modified header is present, we make a second request with the corresponding conditional header (<code>If-None-Match</code> or <code>If-Modified-Since</code>). If the server responds with 304, the site supports conditional requests and we measure the bandwidth saving.<br></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;38438c17-6240-4fb7-b820-bcab5f5bf7d7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import time
from curl_cffi import requests


def audit_caching(url: str) -&gt; dict:
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    }

    resp = requests.get(url, headers=headers, impersonate="chrome", timeout=30)

    resp_headers = {k.lower(): v for k, v in resp.headers.items()}
    etag = resp_headers.get("etag")
    last_modified = resp_headers.get("last-modified")
    cache_control = resp_headers.get("cache-control")
    response_size = len(resp.content)

    result = {
        "url": url,
        "status": resp.status_code,
        "etag": etag,
        "last_modified": last_modified,
        "cache_control": cache_control,
        "response_size_bytes": response_size,
        "supports_304": False,
    }

    if etag or last_modified:
        time.sleep(2)

        cond_headers = dict(headers)
        if etag:
            cond_headers["If-None-Match"] = etag
        if last_modified:
            cond_headers["If-Modified-Since"] = last_modified

        cond_resp = requests.get(
            url, headers=cond_headers, impersonate="chrome", timeout=30
        )

        result["conditional_status"] = cond_resp.status_code
        result["conditional_size_bytes"] = len(cond_resp.content)
        result["supports_304"] = cond_resp.status_code == 304

    return result</code></pre></div><p><a href="https://github.com/TheWebScrapingClub/thelab">You can find the code in our GitHub repository reserved to paying users, inside the folder </a><strong><a href="https://github.com/TheWebScrapingClub/thelab">99.CONDITIONAL_SCRAPING</a>.</strong></p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h2>Shopify stores: full conditional request support</h2><p>We focused our testing on Shopify stores because, while working on various scraping projects, we came across several Shopify-hosted sites that had this caching system enabled. Shopify powers hundreds of thousands of online stores and is one of the most common scraping targets in e-commerce, so the finding felt worth investigating systematically. The results were clear: Shopify stores with the native page cache enabled support conditional requests out of the box.</p><p>Allbirds, Kylie Cosmetics, and Brooklinen all returned 304 responses consistently. Here is what we measured on Allbirds:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;df447db3-cd85-4495-a2ce-1e3336e6b09e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">URL: https://www.allbirds.com/products/mens-tree-runners.json
Status: 200
Response size: 7,961 bytes

Caching headers:
  ETag: "page_cache:11044168:ProductDetailsController:de822deb7906aa6f9932541f4fe3dae9"
  Last-Modified: not present
  Cache-Control: not present

Conditional request support:
  304 Not Modified: YES
  Conditional response size: 0 bytes
  Bandwidth saving: 100.0%</code></pre></div><p>The saving is 100% because the 304 response body contains exactly zero bytes. The only cost is the request/response headers, which are a few hundred bytes.</p><p>This behavior was consistent across three types of Shopify endpoints. The Product HTML page is the standard storefront URL that a browser would load (e.g. <code>/products/mens-tree-runners</code>), which includes the full rendered page with images, reviews, and theme assets. The Product JSON endpoint is the same URL with .json appended (e.g. <code>/products/mens-tree-runners.json</code>), which returns only the structured product data: variants, prices, inventory, and metadata. The Catalog JSON endpoint <code>(/products.json</code>) returns the first page of the store&#8217;s entire product catalog in a single response.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sMAm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sMAm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 424w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 848w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1272w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png" width="914" height="159" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:159,&quot;width&quot;:914,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39412,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/189924926?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sMAm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 424w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 848w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1272w, https://substackcdn.com/image/fetch/$s_!sMAm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e0d43a-9f08-48e6-95e2-026361410d18_914x159.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We ran repeated conditional requests on each endpoint and confirmed that all returned 304 consistently. The ETag stayed stable as long as the product data did not change.</p>
      <p>
          <a href="https://substack.thewebscraping.club/p/http-caching-scraping">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Kadoa: Simplify Your Scraping Workflows with Automation and AI]]></title><description><![CDATA[My review of Kadoa: An AI-powered tool that lets you create scraping workflows in minutes]]></description><link>https://substack.thewebscraping.club/p/kadoa-review-ai-powered-scraping</link><guid isPermaLink="false">https://substack.thewebscraping.club/p/kadoa-review-ai-powered-scraping</guid><dc:creator><![CDATA[Federico Trotta]]></dc:creator><pubDate>Sun, 01 Mar 2026 12:34:31 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ad9731e7-7825-4d82-afea-27d4bd727905_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The web scraping industry has evolved very fast in recent years. The fact that <a href="https://substack.thewebscraping.club/p/the-evolving-career-scrapers">web scraping professionals needed to pivot their careers from scripts to agents</a> is only one of the facts that confirm how resilient this industry is. In particular, the scraping industry has changed not only due to AI, which is relatively recent, but also due to developments in infrastructure, bot detection, and more.</p><p>Lots of tools and libraries for the main programming languages have indeed driven web scraping to significant growth. The need companies have for data also makes such growth the actual reason for existing.</p><p>In this article, I&#8217;ll talk about Kadoa: A tool that lets you create resilient scraping workflows in minutes. I&#8217;ll show you its strengths, why you should consider it, and how it works, with a practical guide.</p><p>Let&#8217;s dive into it!</p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LobM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png" width="526" height="296.2362637362637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:1650775,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/164781961?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LobM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 424w, https://substackcdn.com/image/fetch/$s_!LobM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 848w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1272w, https://substackcdn.com/image/fetch/$s_!LobM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526bd4b2-5ff9-4fa2-9bf7-d5fb0134e98c_1920x1081.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to <strong>1 TB of web unblocker</strong> for free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://netnut.io/unblocker&quot;,&quot;text&quot;:&quot;Claim your offer&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://netnut.io/unblocker"><span>Claim your offer</span></a></p></blockquote><blockquote><div><hr></div></blockquote><h2>What is Kadoa?</h2><p><a href="https://www.kadoa.com/">Kadoa</a> is a web scraping tool that automatically and programmatically extracts web data at scale. You can use it either via the UI or via code, as it has SDKs and provides you with REST APIs.</p><p>The best part of using it is that you can just paste the target URL and the tool retrieves the data for you. Forget about anti-bot measures, fingerprinting issues, or proxy management: Kadoa does all of that for you very simply. Also, thanks to its AI engine, it can automatically recognize the structure of the data you want to scrape from a target website. So, say goodbye also to <a href="https://substack.thewebscraping.club/p/scraping-with-llms-gpt-vision">CSS selectors and any other strategy you use to go beyond the DOM using LLMs</a>.</p><div><hr></div><blockquote><p><em>Using the right tool is just the first steps for a successful data extraction pipeline. Having a reliable proxy provider like <strong>Decodo</strong> on your side improves the chances of success. </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png" width="401" height="225.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:554497,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/187315628?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 424w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 848w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1272w, https://substackcdn.com/image/fetch/$s_!qJZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec072bc-2ac2-49bf-8084-43858298808f_800x450.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://visit.decodo.com/WyQ3mA&quot;,&quot;text&quot;:&quot;Try Decodo Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://visit.decodo.com/WyQ3mA"><span>Try Decodo Now</span></a></p></blockquote><div><hr></div><h2>Why Consider Kadoa for Your Web Scraping Projects?</h2><p>The top reasons why you should consider Kadoa are the following:</p><ul><li><p><strong>Scrape via workflows</strong>: Kadoa&#8217;s UI is built to help you set scraping workflows step-by-step. Insert your target URL(s), define the data schema (or let AI make the work for you), and choose to scrape all the available pages or to remain on page and see the agent work for you.</p></li><li><p><strong>Write code only if you need it</strong>: Other than the UI, Kadoa provides you with Python and JavaScript SDKs and a wide set of REST APIs you can call. This allows you to create workflows via UI, but to manage and call them via code if you need to.</p></li><li><p><strong>Integrated data quality management</strong>: Before starting the scraping process of your target data, Kadoa allows you to manage data quality. In practice, it allows you to set data quality rules or to manage the rules it provides you, thanks to its AI agent.</p></li><li><p><strong>Easy proxy management</strong>: If you&#8217;ve been scraping for a while, you know that you have low chances of successfully scraping the majority of the content you need without using proxies. Using proxies is not a very big issue if you are used to it and if you already have a favourite provider. However, Kadoa simplifies proxy management. It already provides you with a list of countries you can choose from and, under the hood, it manages everything that&#8217;s needed to integrate proxies in your workflow.</p></li><li><p><strong>Scheduling feature</strong>: There are cases where you need to scrape the same target data from time to time. Or, eventually, you&#8217;d like to be notified when data in a target page has changed. Kadoa provides both these features. You can choose to schedule your workflow to scrape at precise time intervals. You can also choose among different notifications, one of which is getting notified when data is changed.</p></li></ul><h2>Kadoa&#8217;s Main Features</h2><p>Below is a list of Kadoa&#8217;s top features to help better understand its potential:</p><ul><li><p><strong>Simple and intuitive UI</strong>: Kadoa&#8217;s UI is simple and intuitive. It allows you to create workflows in minutes. Every scraping workflow is subdivided into steps, and Kadoa provides you with different screens. In a matter of a few minutes, you can define your preferred setup, insert the target page(s), and leave it scraping for you.</p></li><li><p><strong>Chrome extension</strong>: Other than the UI, <a href="https://www.kadoa.com/chrome-extension">Kadoa provides you with a Chrome extension</a>. If you are a Chrome user, this feature allows you to define everything you need directly on the target page, then trigger the workflow to let Kadoa&#8217;s agent start scraping.</p></li><li><p><strong>Code integrations</strong>: If you are a developer or if you simply need to invoke your workflows via code, Kadoa offers you two possibilities. It provides you with <a href="https://github.com/kadoa-org/kadoa-sdks">Python and JavaScript SDKs</a> in an open-source repository, so that you can use custom code to invoke your scrapers. Also, if you like to use code but prefer <a href="https://docs.kadoa.com/api-reference/introduction">REST APIs, Kadoa provides you with several endpoints</a>.</p></li><li><p><strong>Scraping suitable for structured or unstructured data</strong>: One of the difficult aspects you may encounter when manually scraping websites is defining how to grab unstructured data. This is one of the typical use cases where you could <a href="https://substack.thewebscraping.club/p/detect-pattern-scraped-data-with-ai">use AI to detect patterns in data in your scraping projects</a>. The good news is that you don&#8217;t need to come up with imaginative solutions. Kadoa automatically retrieves unstructured data for you thanks to its AI engine.</p></li><li><p><strong>Data schemas definition</strong>: The tool provides you with a feature that allows you to define recurrent data structures. This can be helpful when you retrieve similar data from different websites. If you leave its AI engine to automatically define the data structure, in such cases, you could lose consistency across similar data.</p></li><li><p><strong>Proxy and anti-detection features</strong>: Forget about anti-bot measures and proxy management. Kadoa manages anti-bot solutions under the hood. It also provides you with a predefined list of locations you can choose from, and it will automatically set coherent proxies.</p></li><li><p><strong>Error handling</strong>: It provides you with advanced error handling management. Common cases are when the target site goes offline, is under maintenance, or encounters a technical issue. When this happens, Kadoa detects the problem, it notifies you, and automatically retries the data extraction. If recovery still fails, its support team is notified and investigates.</p></li><li><p><strong>Integration capabilities</strong>: The software allows you to integrate with several third parties. One interesting one is the <a href="https://n8n.io/integrations/kadoa/">integration between n8n and Kadoa</a>, which allows you to get your scraping automation workflow a step forward.</p></li><li><p><strong>Pricing model and usage graphs</strong>: Kadoa offers a <a href="https://www.kadoa.com/pricing">free tier option</a>, for which you can use 500 credits. Its pricing model is based on credit consumption, and it provides you with a UI section where you can see a graph of the consumption.</p></li><li><p><strong>Extensive docs</strong>: <a href="https://docs.kadoa.com/docs/introduction">Kadoa has extensive documentation</a> that covers both UI and API usage.</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/@thewebscrapingclub&quot;,&quot;text&quot;:&quot;Check the TWSC YouTube Channel&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.youtube.com/@thewebscrapingclub"><span>Check the TWSC YouTube Channel</span></a></p><div><hr></div><h3>Hands-on Kadoa: Step-by-step Scraping Tutorial</h3><p>In this section, I&#8217;ll show you how to use Kadoa on an actual scraping task via the UI. The workflow will retrieve <a href="https://finance.yahoo.com/quote/INTC/history/?period1=1737538396&amp;period2=1769074385&amp;frequency=1wk">Intel&#8217;s historical price from Yahoo Finance</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XhPg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XhPg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 424w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 848w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1272w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png" width="1106" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1106,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138487,&quot;alt&quot;:&quot;Intel historical stock price data, image from their website taken by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Intel historical stock price data, image from their website taken by Federico Trotta" title="Intel historical stock price data, image from their website taken by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!XhPg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 424w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 848w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1272w, https://substackcdn.com/image/fetch/$s_!XhPg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d40113-8cb0-484d-8353-621aea2896c0_1106x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Intel historical stock price data</figcaption></figure></div><p>In this scraping workflow, I will:</p><ul><li><p>Set the target web page.</p></li><li><p>Define the data schema.</p></li><li><p>Set scheduling options and notifications.</p></li><li><p>Retrieve the actual data.</p></li></ul><p>Before starting the actual workflow, log in to Kadoa. Below is the first access page you will see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!taKn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!taKn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 424w, https://substackcdn.com/image/fetch/$s_!taKn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 848w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1272w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png" width="1456" height="705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:705,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:166980,&quot;alt&quot;:&quot;Kadoa's first access page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa's first access page by Federico Trotta" title="Kadoa's first access page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!taKn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 424w, https://substackcdn.com/image/fetch/$s_!taKn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 848w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1272w, https://substackcdn.com/image/fetch/$s_!taKn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a5107e-6935-4e6a-8310-1fc9049d532e_1897x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa's first access page</figcaption></figure></div><p>Perfect! You are now ready to create your first scraping workflow with Kadoa.</p><h3>Step #1: Create a New Workflow</h3><p>From the main page, click on <strong>Add workflow</strong> to create a new one and paste the target URL. The <strong>Proxy location</strong> box allows you to select a country where proxies are localized; leave it to <strong>AUTO</strong> to let the tool automatically manage it. Click on <strong>Continue</strong> to proceed with the next step:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nd3l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 424w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 848w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png" width="1456" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240236,&quot;alt&quot;:&quot;A new workflow in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A new workflow in Kadoa by Federico Trotta" title="A new workflow in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Nd3l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 424w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 848w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd3l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe04bb38-0e05-43e3-b128-9fd7a9009780_1885x923.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A new workflow in Kadoa</figcaption></figure></div><p>Note that inside the <strong>Enter one or more URLs </strong>box<strong>,</strong> you have to insert the target page. If the target page is more than one, you can insert all the target pages you are interested in.</p><p>Alright, you created a new workflow in Kadoa. Let&#8217;s proceed with the next step and customize it!</p><h3>Step #2: Define the Data Schema</h3><p>As the next step, define the data schema:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ib01!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ib01!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 424w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 848w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1272w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png" width="1456" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8febc847-1387-489a-815c-cb8fa342897e_1840x861.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:200560,&quot;alt&quot;:&quot;Define the data schema in a Kadoa workflow by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Define the data schema in a Kadoa workflow by Federico Trotta" title="Define the data schema in a Kadoa workflow by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Ib01!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 424w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 848w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1272w, https://substackcdn.com/image/fetch/$s_!Ib01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8febc847-1387-489a-815c-cb8fa342897e_1840x861.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Define the data schema in a Kadoa workflow</figcaption></figure></div><p>If you want to insert the schema manually, Kadoa already provides you with some predefined schemas. For this tutorial, I&#8217;ve chosen to let AI do the job. So I selected <strong>AI Suggest Fields</strong>.</p><p>The system, then, asks you how you want to navigate the data. For the sake of this example, I decided to scrape only the current page from the target one, but you can also choose among three different options:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2REz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2REz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 424w, https://substackcdn.com/image/fetch/$s_!2REz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 848w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1272w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png" width="1456" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:318930,&quot;alt&quot;:&quot;Scraping data on a single page in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scraping data on a single page in Kadoa by Federico Trotta" title="Scraping data on a single page in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!2REz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 424w, https://substackcdn.com/image/fetch/$s_!2REz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 848w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1272w, https://substackcdn.com/image/fetch/$s_!2REz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414c4a6e-cd39-4894-82ff-314445f92c46_1727x931.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Scraping data on a single page in Kadoa</figcaption></figure></div><p>After clicking on <strong>Continue</strong>, the agent will start doing its job:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!05pv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!05pv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 424w, https://substackcdn.com/image/fetch/$s_!05pv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 848w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1272w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png" width="1456" height="702" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:702,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183128,&quot;alt&quot;:&quot;Kadoa&#8217;s AI agent working by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s AI agent working by Federico Trotta" title="Kadoa&#8217;s AI agent working by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!05pv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 424w, https://substackcdn.com/image/fetch/$s_!05pv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 848w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1272w, https://substackcdn.com/image/fetch/$s_!05pv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a901e2-fc69-4982-a56a-c11144ad9346_1882x907.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s AI agent working</figcaption></figure></div><h3>Step #3: Review Extracted Fields and Schedule the Workflow</h3><p>Because I let AI work, the agent automatically tries to extract the data from the target page. But before proceeding, Kadoa asks for your review:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uYWt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uYWt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 424w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 848w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1272w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png" width="1456" height="783" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:783,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:305054,&quot;alt&quot;:&quot;The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta" title="The proposed extraction data schema by Kadoa&#8217;s AI agent by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!uYWt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 424w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 848w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1272w, https://substackcdn.com/image/fetch/$s_!uYWt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d93c4-ef9a-4add-b118-ac0154723ca9_1735x933.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The proposed extraction data schema by Kadoa&#8217;s AI agent</figcaption></figure></div><p>As you can see from the previous image, the agent has correctly detected the data to extract from the target page. Also, this job is finely improved as the tool provides you with a screenshot of the data it will extract, so that you can visualize it even better.</p><p>In the next step, you have to define the scheduling:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qxMP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qxMP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 424w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 848w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1272w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png" width="1456" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:260383,&quot;alt&quot;:&quot;Scheduling workflows in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Scheduling workflows in Kadoa by Federico Trotta" title="Scheduling workflows in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!qxMP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 424w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 848w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1272w, https://substackcdn.com/image/fetch/$s_!qxMP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a33bea-0430-4208-9a08-9a19926974e7_1745x863.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Scheduling workflows in Kadoa</figcaption></figure></div><p>For the sake of this example, I decided to run the workflow only once. But, as you can see, you can choose among several scheduling options.</p><h3>Step #4: Set Notifications and Final Details</h3><p>As the next step, define the way you want to be notified:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CuRd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CuRd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 424w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 848w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1272w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png" width="1456" height="772" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:288124,&quot;alt&quot;:&quot;Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta" title="Setting up your Kadoa&#8217;s workflow latest details by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!CuRd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 424w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 848w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1272w, https://substackcdn.com/image/fetch/$s_!CuRd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca7ec8e-2743-47c0-9c5c-b49d561d1ca9_1738x922.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Setting up notifications in Kadoa</figcaption></figure></div><p>In this case, I decided to be notified via email if the workflow fails. You can add different notification channels by clicking on <strong>Add channel</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qp2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 424w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 848w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1272w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png" width="856" height="471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e10a8400-3f90-4793-832d-01537d8ef16b_856x471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:471,&quot;width&quot;:856,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45084,&quot;alt&quot;:&quot;Adding notification channels in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Adding notification channels in Kadoa by Federico Trotta" title="Adding notification channels in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!Qp2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 424w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 848w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1272w, https://substackcdn.com/image/fetch/$s_!Qp2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10a8400-3f90-4793-832d-01537d8ef16b_856x471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Adding notification channels in Kadoa</figcaption></figure></div><p>Next, define the latest details of your scraping workflow:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zCk9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zCk9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 424w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 848w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1272w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png" width="1456" height="674" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:674,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:261303,&quot;alt&quot;:&quot;Define your workflow&#8217;s latest details by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Define your workflow&#8217;s latest details by Federico Trotta" title="Define your workflow&#8217;s latest details by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!zCk9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 424w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 848w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1272w, https://substackcdn.com/image/fetch/$s_!zCk9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0345f7d-bc97-4b8c-aed1-19682dec6785_1740x806.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Define your workflow&#8217;s latest details</figcaption></figure></div><p>Before starting with the actual scraping, the system asks you to approve the sample data it proposes to you or to review the data quality rules:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hpCi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hpCi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 424w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 848w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1272w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png" width="1456" height="678" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229183,&quot;alt&quot;:&quot;Decide whether to review rules or not by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Decide whether to review rules or not by Federico Trotta" title="Decide whether to review rules or not by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!hpCi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 424w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 848w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1272w, https://substackcdn.com/image/fetch/$s_!hpCi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6426c06c-619a-4a60-8d8e-3a34fdd6a6d1_1897x883.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Decide whether to review data quality rules or not</figcaption></figure></div><p>By clicking on <strong>Review rules</strong>, the tool provides you with automated data quality rules. You can select them if you think this will improve the quality of the scraping result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1tPf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1tPf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 424w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 848w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1272w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png" width="1456" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:328419,&quot;alt&quot;:&quot;Reviewing data quality rules in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reviewing data quality rules in Kadoa by Federico Trotta" title="Reviewing data quality rules in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!1tPf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 424w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 848w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1272w, https://substackcdn.com/image/fetch/$s_!1tPf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31b72d9b-26ec-44d2-aa76-d819b326a3fd_1895x892.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reviewing data quality rules in Kadoa</figcaption></figure></div><p>When you are done reviewing quality rules, click on <strong>Approve</strong>. The actual scraping workflow will start and will be queued:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cov2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cov2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 424w, https://substackcdn.com/image/fetch/$s_!cov2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 848w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1272w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161278,&quot;alt&quot;:&quot;New Kadoa&#8217;s workflow queued by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="New Kadoa&#8217;s workflow queued by Federico Trotta" title="New Kadoa&#8217;s workflow queued by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!cov2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 424w, https://substackcdn.com/image/fetch/$s_!cov2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 848w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1272w, https://substackcdn.com/image/fetch/$s_!cov2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a72677-04ac-4520-9bf9-e0dd337f632d_1891x811.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">New Kadoa&#8217;s workflow queued</figcaption></figure></div><p>Et voil&#224;! You have launched your first scraping workflow with Kadoa.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/p/consulting&quot;,&quot;text&quot;:&quot;Need help with your scraping project?&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.thewebscraping.club/p/consulting"><span>Need help with your scraping project?</span></a></p><div><hr></div><h3>Download Data, See Logs and Statistics in Kadoa</h3><p>The <strong>workflow</strong> section reports all the workflows you created, their status, and the token consumption for each scraper:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GQl0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GQl0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 424w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 848w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1272w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png" width="1456" height="631" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:631,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176411,&quot;alt&quot;:&quot;Kadoa&#8217;s workflows summary and statistics by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s workflows summary and statistics by Federico Trotta" title="Kadoa&#8217;s workflows summary and statistics by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!GQl0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 424w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 848w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1272w, https://substackcdn.com/image/fetch/$s_!GQl0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac1ec0d-24b1-4e9a-93c5-fd72593e342d_1893x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s workflows summary and statistics</figcaption></figure></div><p>By clicking on one workflow, you can see the data it retrieved and can decide the format you want to download it:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QgG1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QgG1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 424w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 848w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1272w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png" width="1456" height="724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:724,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233419,&quot;alt&quot;:&quot;Visualizing and retrieving scraped data in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Visualizing and retrieving scraped data in Kadoa by Federico Trotta" title="Visualizing and retrieving scraped data in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!QgG1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 424w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 848w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1272w, https://substackcdn.com/image/fetch/$s_!QgG1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c2baf44-5895-48b0-a5d9-91a421034f69_1883x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visualizing and retrieving scraped data in Kadoa</figcaption></figure></div><p>The <strong>Activity log</strong> page reports detailed logs of every action occurred to your workflows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UduM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UduM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 424w, https://substackcdn.com/image/fetch/$s_!UduM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 848w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1272w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png" width="1456" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178535,&quot;alt&quot;:&quot;Kadoa&#8217;s logs page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s logs page by Federico Trotta" title="Kadoa&#8217;s logs page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!UduM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 424w, https://substackcdn.com/image/fetch/$s_!UduM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 848w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1272w, https://substackcdn.com/image/fetch/$s_!UduM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d8c5fd9-084d-4285-85e7-0dd5f1c4b673_1903x897.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s logs page</figcaption></figure></div><p>The <strong>Usage</strong> page reports graphs of the trend in terms of active workflows and the number of rows extracted for workflow, as well as the remaining total tokens on your plan:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!snXw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!snXw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 424w, https://substackcdn.com/image/fetch/$s_!snXw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 848w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1272w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png" width="1456" height="843" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:843,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139493,&quot;alt&quot;:&quot;Kadoa&#8217;s tokens usage page by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Kadoa&#8217;s tokens usage page by Federico Trotta" title="Kadoa&#8217;s tokens usage page by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!snXw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 424w, https://substackcdn.com/image/fetch/$s_!snXw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 848w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1272w, https://substackcdn.com/image/fetch/$s_!snXw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd9f6b75-58e6-4cc2-99cd-9867f6c98273_1585x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kadoa&#8217;s tokens usage page</figcaption></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.thewebscraping.club/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Manage Kadoa&#8217;s Workflows via APIs</h2><p>As introduced before, <a href="https://docs.kadoa.com/api-reference/introduction">Kadoa provides you with several endpoints for making calls via REST APIs</a>. The APIs allow you to perform several actions that are not strictly necessary for workflows already created. For example, you can start <a href="https://docs.kadoa.com/api-reference/crawling/start-crawling-session">crawling sessions</a> and <a href="https://docs.kadoa.com/api-reference/schemas/create-schema">create data schemas</a>.</p><p>Before using the API, get your API Key under the <strong>Settings</strong> page:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pwMH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pwMH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 424w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 848w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1272w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png" width="1456" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108703,&quot;alt&quot;:&quot;Get your Kadoa API key by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Get your Kadoa API key by Federico Trotta" title="Get your Kadoa API key by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!pwMH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 424w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 848w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1272w, https://substackcdn.com/image/fetch/$s_!pwMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea3c304a-ae51-4505-99bd-8e52f6d62105_1721x442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Get your Kadoa API key</figcaption></figure></div><p>If you want to manage already existing workflows, either created via the UI or APIs, you have to use the specific workflow&#8217;s ID via the UI.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b0VW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b0VW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 424w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 848w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1272w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png" width="1456" height="205" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:205,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66109,&quot;alt&quot;:&quot;Get a workflow ID by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Get a workflow ID by Federico Trotta" title="Get a workflow ID by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!b0VW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 424w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 848w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1272w, https://substackcdn.com/image/fetch/$s_!b0VW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d15234-e449-47ec-8192-79ffc018e02c_1899x267.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Get a workflow ID</figcaption></figure></div><p>Then you can perform several actions by invoking the REST endpoints. For example, you can <a href="https://docs.kadoa.com/api-reference/workflows/schedule-a-workflow">schedule a particular workflow</a> for later:</p><pre><code><code>curl --request PUT \\
  --url &lt;https://api.kadoa.com/v4/workflows/{workflowId}/schedule&gt; \\
  --header 'Content-Type: application/json' \\
  --header 'x-api-key: &lt;api-key&gt;' \\
  --data '
{
  "date": "2025-02-07T10:00:00.000Z"
}
'</code></code></pre><p>Where you have to insert the following:</p><ul><li><p><em>workflowId</em> : Is the ID of the workflow you want to schedule.</p></li><li><p><em>&lt;api-key&gt;</em>: Is your KadoaAPI key.</p></li><li><p>The actual date you want your workflow to start the scraping task. You have to use the ISO format for the date in UTC.</p></li></ul><h2>Kadoa: Final Comments</h2><p>After analyzing and testing the tool, I can say the following are its main advantages and disadvantages:</p><p><strong>&#128077; Pros</strong>:</p><ul><li><p>Ready for AI integration. You can download the scraped data or integrate it into your AI projects directly via API.</p></li><li><p>Suits all the user needs, as it provides APIs, SDKs, and the UI.</p></li><li><p>Supports structured output formats, including JSON.</p></li><li><p>Offers virtually unlimited scalability on the side of infrastructure management and the number of URLS to scrape.</p></li><li><p>Focuses on data quality before scraping, not later.</p></li><li><p></p></li></ul><p><strong>&#128078; Cons</strong>:</p><ul><li><p>Currently, it supports only 5 proxy locations.</p></li><li><p>You can&#8217;t scrape all the websites you&#8217;d like:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nPP4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nPP4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 424w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 848w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1272w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png" width="1226" height="616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:1226,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93999,&quot;alt&quot;:&quot;Unsupported scraping URL in Kadoa by Federico Trotta&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.thewebscraping.club/i/184656414?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Unsupported scraping URL in Kadoa by Federico Trotta" title="Unsupported scraping URL in Kadoa by Federico Trotta" srcset="https://substackcdn.com/image/fetch/$s_!nPP4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 424w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 848w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1272w, https://substackcdn.com/image/fetch/$s_!nPP4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedabe0b-a420-4121-8157-2a9f7f2acbd5_1226x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Unsupported scraping URL in Kadoa</figcaption></figure></div><h2>Conclusion</h2><p>In this article, I&#8217;ve presented Kadoa: An AI-powered scraping tool that helps you simplify your scraping projects. As you&#8217;ve seen, this is a ready-to-use tool that creates scraping workflows in minutes via UI and also supports code.</p><p>Let us know in the comments: Did you know this tool before? Have you already tested it?</p>]]></content:encoded></item></channel></rss>