About LLMs, AI and Web Scraping
Some notes on the impact of the AI and LLMs in the web scraping industry
In the latest post of The LAB series, we used a library called ScrapeGraphAI to scrape some web pages using LLMs.
The answers we obtained by the GPT models were not always good: by using GPT 3.5 Turbo, the only model that allows us to use enough tokens to parse medium-sized pages, the resulting data was plausible but not correct.
Things got better with GPT-4o, especially for some websites like TripAdvisor that, even using a minimal prompt, was parsed correctly on the first try.
Since I’m not an expert in AI, for this article I’ve asked Marco Vinciguerra, teaching assistant and ML engineer at Università degli Studi di Bergamo, for some additional comments, to understand better the pros and cons of this AI-centric approach to web scraping.
HTML parsing is only half of the job
Web scraping a page is a task that can be split into two different actions: crawling to the target page to get the HTML and parsing it. When we speak about LLMs in web scraping, we’re taking care only of the second part, the HTML parsing. Depending on the website and its countermeasures, this could be the most challenging part if there’s no anti-bot installed or the easiest one if the website is heavily protected. This means that the advantages of implementing an AI solution for HTML parsing could be more or less relevant depending on the difficulty of reaching the HTML of the target website. If we struggle to get the code, we’re spending most of our time bypassing an anti-bot and the advantage given by the parsing solution will be minimal, compared to the total time spent on the project.
For this reason, more and more companies are adding HTML parsing solutions to their existing anti-bot software, to create a unified product for both phases.
Are LLMs good enough for web scraping?
One of the main issues of web scraping is data quality, with different shades that depend on the industry. Working on a dynamic pricing algorithm requires web data to be perfect and timely, and the same happens when quantities are involved, like inventory tracking or price comparison tools.
What we have seen in some examples of the previous article is that ChatGPT in some cases could not retrieve all the data on the page, and in others hallucinated and gave wrong results.
We have seen LLMs behave greatly with a popular website like TripAdvisor, where given a minimal prompt, it returned a detailed and correct answer containing every information regarding the reviews of a place. When collecting data from an E-commerce, instead, we needed to write an elaborate prompt and rework it multiple times to get an acceptable answer.
Is this because TripAdvisor reviews were probably in the GPT models' training data? If so, is it still true that by using a generic LLM model we could scrape the whole web? Or, in the future, we’ll see models dedicated to website categories like e-commerce or social media?
I’ve had the chance to ask these questions with Marco and got some interesting insights.
First, hallucinations with GPT-4o are reduced, and we’ve seen this when we could use it also in our examples. Of course, general-purpose models like GPTs could not be as effective as those specifically designed for web scraping, which is why the ScrapeGraphAI team is working on training a new model specifically designed for it. I can’t wait for it!
How can we detect and handle errors in scraping?
LLMs bring a new challenge for data quality controls. Using traditional methods, selectors have two states: working, since they’re able to retrieve the right information, or broken when they don’t retrieve the right information or any information at all. With LLMs, instead, we can get plausible data, which seems correct at first sight but it’s now what comes from the HMTL but the model generates it.
We’ve got also a consistency issue over time: how can we know if the model is returning coherent results over time?
Last but not least, if the model returns a wrong piece of information, what can we do to select a different one?
Again, a model designed for web scraping should reduce the hallucinations, avoiding most of the issues at the origin.
We could have still some errors in the data scraped, but they can be detected like we normally do with standard scraping: comparing the known outcomes with the results of the scraper gives us control of what’s happening.
Writing some tests on our scraper or at the end of our data pipeline can help us detect errors both when writing the scraper and when running it in production.
If we want to change the output for a single field, we could act on the prompt or, using ScrapeGraphAI, we can use a particular node: the Generate Scraper Node.
Talking with Marco he showed me the potential of this type of node: giving in input a page, a model, and a Python library, the node can write you the selectors to use for your scraper.
In this way, you can automate the writing of your scrapers, increasing the web scraping team's productivity.
This is also an answer to one of my pain points for the use of LLM in production: the response times. While for small websites and projects, we can wait 10 seconds per request to get data from an LLM, this is not feasible on a larger scale. Keeping using Scrapy or your preferred framework but not having to write the selectors code could be a good advantage for a team.
Of course, there’s a trade-off: while using LLMs we don’t have to rewrite the scraper once the website changes since, theoretically, it should be able to fetch data automatically, by using AI-generated selectors they will break just like human-generated ones at every website change.
What’s sure is that we’re expecting exciting times ahead and can’t wait for new releases in this field.