THE LAB #75: Building self healing scrapers with AI
How can we use LLMs to analyze HTML and fix our web scrapers?
Imagine having hundreds of scrapers up and running in your production environment. We all know what happens after a while: They start to break and return partial data, and you need to fix them.
The main challenge is that every website has a unique structure requiring a custom scraper. Even after developing a scraper, maintaining it becomes time-consuming as websites change layouts or implement anti-bot measures. This has led to an ever-growing need for developers specializing in building and preserving scrapers, making scaling operations expensive and cumbersome.
The more scrapers you have, the more work is needed to keep them running, which requires more people and money.
Companies with large-scale scraping needs have been forced to dedicate significant resources to maintaining scraper fleets, managing IP rotation, solving CAPTCHAs, and evolving anti-bot mechanisms.
This is why traditional web scraping companies don’t scale. But what about AI-assisted web scraping? LLMs and AI are great productivity enablers; could they also help us in web scraping?
The AI Promise: A Revolution in Web Scraping?
AI has the potential to revolutionize web scraping by increasing efficiency in two fundamental ways:
Retrieving data without scraping: AI models, particularly large language models (LLMs), can sometimes return structured data without scraping. This is especially true when AI agents are equipped with browser capabilities and can interact with web pages dynamically.
Automating scraper generation: AI can generate and maintain scrapers, reducing the need for manual coding and maintenance.
AI as a Data Retrieval Tool
In some cases, AI eliminates the need for traditional web scraping by providing direct responses to data queries. Some AI-powered agents can browse the web, interact with pages, and extract information dynamically, simulating a human user. Many commercial and open-source solutions, like OpenAI operators, are available today and eliminate the need to create a scraper and provide structured data on demand, regardless of the website's structure.
This approach usually works best for one-off or repetitive small tasks, but it doesn’t work well for extensive web scraping activities. It can be suitable for searching online for the best deal on a particular pair of shoes, starting with Google, but not for scraping an entire large e-commerce catalog. The costs, reliability, and execution times of this approach can be barriers to completing this task.
AI-Generated Scrapers
Another promising application of AI is using machine learning models to generate scrapers programmatically. These systems analyze web pages, understand their structure, and generate the necessary code to extract data, effectively automating a traditionally manual process. Compared to AI agents, the scope of these tools is limited to generating code for web scraping, and for this reason, we have few tools that can do that, like ScrapeGraphAI and the Oxycopilot from Oxylabs.
This is the best approach for extensive web scraping projects, and it can help shrink the size of a web scraping team or increase its productivity.
Imagine a scenario where you have a scraper running, and after a minor UX change, the scraper stops retrieving all the information needed. If you’ve set up an alert for that, you can trigger a workflow where your AI assistant downloads the website's HTML code and adjusts the scraper accordingly.
Of course, it can’t be a silver bullet. If your scraper needs a rework because the website changed the anti-bot solution used or you need to add proxies to it now, you still need some humans in the loop, at least to choose the right proxy providers for your needs.
In this article, I’m building a proof of concept for an AI pipeline that returns a fixed scraper when given the broken scraper and a URL of the target web as input.
Fixing scrapers using GPTs
Before diving into the code, we need to set our expectations and divide the task into smaller chunks.
First, it would be impossible for an LLM to fix any kind of scraper just by using the website’s code. We need to share as much context as possible and also the desired output, at least the data structure we’d like to obtain.
For this reason, I’m doing this experiment of creating a working scraper capable of reading product list pages (PLP) from an e-commerce website like this one.
The desired output has the following format:
data_structure = {
"product_code": "A unique code for the product",
"gender": "The gender of the product",
"full_price": "The full price of the product",
"price": "The price of the product",
"currency": "The currency of the product",
"itemurl": "The URL of the product detail page",
"brand": "The brand of the product",
"category1_code": "The first value of the breadcrumb of the current page",
"category2_code": "The second value of the breadcrumb of the current page",
"category3_code": "The third value of the breadcrumb of the current page",
"imageurl": "The URL of the product image",
"title": "The name of the product"
}
After retrieving the code from a browser automation tool, I’ll ask the GPT-4o model to find relevant HTML code and, in the following steps, write XPATH selectors for each of the fields in the output and fix a scraper I’ll pass in input.
Let’s see what will happen!
Retrieve the HTML code
First, we need to retrieve the HTML code from the website, which is the easiest part. For this example, we can use any browser automation tool. I chose Camoufox for its anti-bot features, but we could also use Playwright or Puppeteer and eventually connect them to an external browser. In any case, this is not the core of this project.
with Camoufox(humanize=True,os="windows", geoip=True
) as browser:
page = browser.new_page()
page.goto('https://www.balenciaga.com/it-it/uomo/accessori', timeout=0)
time.sleep(5)
HTML_text = page.content()
browser.close()
Extract meaningful HTML parts
Now that we have the website's HTML, we need to pass it to an LLM to extract meaningful sections for use in the next steps.
In this prompt, I’ll provide GPT with all the context needed to understand the first and most difficult task of the project: finding where the product’s data is and extracting the HTML section from the code. This required hours of prompt engineering on my side, and, spoiler alert, I’m not happy with how it works.
I tried to handle both cases when product data is inside HTML tags and in JSON format.
Using the OpenAI SDK, I can send the HTML and my prompt to the GPT-4o model and read the result.
def ask_chatgpt(html_content, fields):
prompt = f"""
You are given an HTML page from an e-commerce website.
This page is a catalog of products. Focus your efforts on understanding the structure of the product catalog. Keep in mind that we're looking for the following information per each product:
{fields}
Return an HTML code that contains the product catalog, in any format you see it. Do not modify the HTML code, just return it as is.
If there's a JSON object in the HTML containing the product catalog, return the portion of the HTML where the JSON object is included, together with an XPATH selector to match it.
If there's no JSON with product data, then extract only portion of the HTML that contains the product information.
In any case, if there the JSON object or the HTML containing the product catalog, return the html needed the pagination of the catalog (link to the next page).
At the end of the file, return the breadcrumb of the page, with the values of the categories and its HTML code.
"""
response = client.chat.completions.create(model="gpt-4o", # Use the latest available model
messages=[
{"role": "system", "content": "You are an AI that extracts HTML code from a webpage, returning portions of it, which contains a product catalog. You return the HTML code as is."},
{"role": "user", "content": prompt},
{"role": "user", "content": html_content}
])
return response.choices[0].message.content
Apart from the prompt, which can be improved, this code's most interesting part is the messages.
Each message has a role and content: as far as I understand, messages with a role system define general AI behavior. The system message is a high-level instruction that sets the AI's behavior, tone, and constraints throughout the conversation.
Conversely, the user message represents input from the person or program interacting with the AI.
Please note that to send the HTML content inside a message, I needed to upgrade to Tier 2 of my account on the OpenAI API platform, which cost me more than 100 USD.
Below Tier 2, you don’t have enough tokens per minute to upload a decent HTML code.
The script is in the GitHub repository's folder 75.GPTSCRAPING, which is available only to paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please use the following form to request access.
A small rant before going on
As mentioned, I’ve spent hours prompt engineering this first step (and it could be much better), but it frustrates me. It looks like playing slot machines by trying again and again with small changes in the wording, and every time, you have different results even if you don’t change anything.
It seems to me like a continuous loop of flaky tests. You keep failing and don’t know how to fix bugs until you do, and then you don’t know how you did it. Then, you hope it will work for enough time.
Is it just me that’s feeling like this? Do you have any resources for learning more about writing efficient prompts?
Writing the XPATH selectors
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.