How AI is changing the web scraping industry
How AI brought to life a new set of tools and services for web scraping
The buzzword of 2023 was, without a doubt, LLM. This was true for almost the entire following year, but then it was replaced by the word “agent” or, better yet, AI agent.
OpenAI recently released its own ChatGPT agent, Operator, and hundreds of startups are developing vertical agents for the most varied tasks.
An AI agent is a program capable of autonomously performing tasks on behalf of a user or another system, using the right tools and deciding the workflow needed to complete the task. For example, an AI agent in the travel industry could browse the web on different websites and find the flight tickets for you that fit the best as an answer to the prompt: “I want to fly from Paris to London on one of the weekends in March, find me the cheapest option.”
Breaking down this task in chunks, the agent should be able to:
understand human language
break down the instruction into a series of subtasks and create a workflow
access the web to retrieve data from multiple sources
elaborate the data to compose an answer
For obvious reasons, the step of retrieving data is what interests us the most. The agent needs to have access to a browser, automate its actions, and be capable of reading the website-rendered Document Object Model (DOM), including the dynamic content after the website is loaded.
The pre-agent era: Playwright on Cloud
From the previous description, it seems that once the agent has access to a browser automation tool like Puppeteer or Playwright, this would be enough.
If you’re doing web scraping and have tried deploying a fleet of Playwright instances, you know it's not as easy or cheap as it seems. Let me give you one personal example.
I usually work on scraping the entire e-commerce product catalog, including prices, links to products, and other essential information, in multiple countries. Regardless of the technology, I usually split this task by sending one country per VM (with some exceptions). When using Scrapy and not needing any proxy, data comes almost for free: I need a few hours (if not minutes) of a small virtual machine (t3.micro, on AWS), which costs me a few cents. But when it comes to Playwright, you need larger machines (at least t3.medium), using almost no concurrency to not overload the machine, causing longer runs. I consider Playwright my last resort, and I will use it only if there’s some anti-bot software on the target website. Usually, it’s not even enough: it’s detectable, and in most cases, I need alternatives.
Stelthier alternatives
As of today, when we need to bypass an anti-bot solution, we have several options.
Web Unblockers
Web unblockers are a sort of super API handling JS rendering, fingerprinting management, and everything else needed to bypass anti-bots. Proxy providers usually develop them, so they use their solid infrastructure and, just like proxies, can be plugged in on browserless scrapers. In this way, you can maximize speed and request concurrency, delegating the infrastructure work to the provider.
Depending on the single case, this can be a more economically viable solution than spawning larger AWS VMs that will last longer.
The downside of this approach is that you don’t have a browser to interact with; if you need to push some buttons or filter some results, you cannot do it.
Anti-detect browsers
Browser automation tools like Playwright or Puppeteer were initially designed to test web and mobile apps, but they also gained traction in the web scraping industry. To simulate a human browsing a website (and bypass anti-bot protections), you need a browser that completes an action like a human would.
But nothing lasts forever, and anti-bot softwares started immediately to detect automated browsers since their automation leaves some traces in the browser fingerprint.
That’s where anti-detect browsers came into play. Instead of automating standard versions of traditional browsers, we could use in our Playwright script, a set of different browsers designed explicitly to cover their traces, masking the underlying hardware and software and generating browser fingerprints that seem legit in the eyes of the anti-bot solutions.
While they are super helpful in bypassing anti-bots, they still have the limit of every browser automation solution since the hardware needed to run is the same.
The agent needs
Given all the premises done so far and the boom of AI agents, it’s not surprising that more startups are rethinking browsers and their infrastructure.
Services like HyperBrowser AI, Browserless, BrowserBase, and many others help you manage the infrastructure for running hundreds of Playwright/Puppeteer sessions. You just need to connect your Playwright script to their infrastructure via WebSocket, and you don’t need to worry about scaling your infrastructure, being able to use hundreds of sessions in parallel.
Last Tuesday,
from LightPanda, wrote on these pages what they’re building. Instead of “simply” creating the infrastructure for managing actual browsers, with more or less advanced stealth features, they built a new type of browser, which is made-to-measure for machines instead of humans.Agents, in fact, have different needs than humans, the actual main users of traditional browsers.
First, AI agents need a fast browser to complete their tasks as quickly as possible. The browser should also be able to access every website, so stealth features are needed to bypass anti-bot protections. It should also be scalable: from just one machine, I should be able to launch hundreds of agents to automate whatever I can.
Did you notice it? This list of features really outlines what’s needed to scale web scraping operations using browser automation tools. The only difference between web scraping and AI Agent tasks is that they, as far as I’ve seen today, tend to last a few minutes while scraping could potentially last hours.
But the infrastructure and the tools used could potentially be the same, and it’s something that can open a new frontier in web scraping.
For this reason, you’ll shortly see on this page some reviews of these services: we’re at the dawn of a new era, and I’d like to discover what it brings us with this amazing community.
Since these services are popping up daily, if you notice one, please share it with me in the comment section or by email at pier@thewebscraping.club.
Thanks for Pier's sharing.
Thanks to our strong industry collaboration with The Webscraping Club, we’ve always stayed at the forefront of scraping innovation.
Scrapeless's Browserless is now on the waitlist! 🚀
https://browserless.scrapeless.com/
Looking forward to your review of Sequentum -- we have cloud scale, mature anti-bot customizable browser profiles and low code ease of use!