THE LAB #60: Writing scrapers with LLMs
Comparing LLama3.1, GPT4 and Mistral in creating scrapers
"The factory of the future will have only two employees: a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment." – Warren G. Bennis
LLM-powered web scraping tools seem to be everywhere today. From Hacker News to webinars, everyone seems to be implementing and using these tools.
This thread on Hacker News is one of the many on this topic in the past weeks: the author started with a complaint about the costs of scraping using LLMs, but in the 100+ comments you can find at least 10 people promoting their AI-powered tool for scraping.
Pros and cons of using LLMs for web scraping
When talking about LLMs and AI, there’s a big premise to make: every consideration and conclusion regards their actual state. Given the pace of the evolution in LLMs, every judgment risks aging badly after months, as soon as a new breakthrough model is presented.
This is why, after some months after the first article, I wanted to write again about ScrapeGraph-AI, to understand how these tools can be useful in our scraping operations.
Setting the expectations
Web scraping is a rabbit hole: at first, it seems easy, you can see data on your screen, and with some no-code tools and three clicks you can get it in an Excel spreadsheet. This is great and true for many websites when you need a one-off requirement, but a recurring task like this requires some more steps.
Then you have websites where these tools misinterpret the layout, returning incorrect or partial data, so you begin to understand you need something more flexible.
Then you use LLMs for getting data, but it won’t work anyway for 100% of the cases. That’s why, before implementing LLMs in your web scraping pipeline, it’s important to set the expectations on the results.
What can we ask to LLM-powered tools?
These tools usually use some browser automation tool to load the web page and the response.
So, when there’s no anti-bot protecting the website, we can expect:
once a desired output data is defined, the tool can write the scraper for us
the tool can heal the scraper, in case the website’s code has changed over time
the tool can also return the desired data without even writing the scraper
shorten the developing time of scrapers, reducing building costs
What we cannot expect from LLM-powered tools
While we can’t expect them to use advanced techniques like scraping data from apps, since they just interpret the response they got from the browser, we can’t ask them these tasks:
inspect the network tab to use the internal APIs of the website
return data not included in the response, like inventory levels included only in the APIs but not in the HTML, because of the bullet before
bypass anti-bot solutions, since they basically run a browser automation tool with standard settings, if not simply using Python requests.
being cheaper than traditional scraping, at least in the execution phase. Once the selectors are written, extracting data from HTML is free in Python, while the same operation requires tokens, which need to be paid.
to be faster than traditional web scrapers. LLMs need time to create an output, so using them for scraping will be slower than than a Scrapy or Playwright scraper.
implement proxies or any advanced feature to the scrapers
Speaking of proxies, the new video on our YouTube channel is live. Here’s the chat I had with Fabien Vauchelles, creator of Scrapoxy.
As mentioned at the beginning, this is the actual situation. Maybe in the near future, tokens will cost far less and performances will be 10X better, so LLMs can be a good choice also for running scrapers, but today I see the main advantage in making them write the scrapers for us.
Using LLMs as our junior programmer
As of today, I think we can get the most by using LLMs-powered scraping tools to write the code for us.
It’s like delegating to an intern the first draft of the code, which you will review and deploy on production.
To do so and test different models on the same website, I’m using ScrapeGraph-AI: it’s a powerful and eclectic tool that makes me swap models just by changing some parameters.
For these tests, I’m using OpenAI GPTs models, Mistral and LLama3.1, and asking them to create scrapers for different types of websites.
As always, if you want to have a look at the code, you can access the GitHub repository, available for paying readers. For this article, I’ve created different files under folder 60.SCRAPEGRAPH-AI, split by the target website.
If you’re one of them but don’t get error 404 in accessing this link, please write me at pier@thewebscraping.club since I need to add you manually to the repository.
List GitHub repositories
GitHub is for sure one of the websites that has been scraped by all these companies to train their model. Thanks to the knowledge contained in all these repositories, we have now LLMs that act like a pair-programmer.
Since it has been widely used for the training of the models, I’m expecting that the website structure is familiar to them. I will ask GPT4, LLama3.1, and Mistral to write a scraper that, given a user profile, lists all the repositories of this person.
GPT4
Thanks to ScrapeGraph-AI, I can use the ScriptCreatorGraph to ask to write the code for a Beautifoulsoup scraper. I’ve reinforced this information also in the prompt, adding more details on the exception handling.
The output schema is defined in the GitHubRepository Class.
You can find this code and the output under the 60.SCRAPEGRAPH-AI/GitHub folder of the repository.
from scrapegraphai.graphs import ScriptCreatorGraph
from langchain_core.pydantic_v1 import BaseModel, Field
graph_config = {
"llm": {
"api_key": "APIKEY",
"model": "openai/chatgpt-4o-latest",
},
"library": "beautifulsoup",
"verbose": True,
"headless": False,
}
class GitHubRepository(BaseModel):
Author: str = Field(description="Author of the respository, get it dynamically and not from the input URL")
RepositoryName: float = Field(description= "the name of the repository")
RepositoryDescrioption: float = Field(description= "the description of the repository, use N.A. if missing")
# Create the script creator graph
script_creator_graph = ScriptCreatorGraph(
prompt="Create the scraper needed for getting data from the GitHub user's list of repositories. The script must work with the beautifulsoup library. Add exception handling in case a field is missing.",
source="https://github.com/berstend?tab=repositories",
config=graph_config,
schema=GitHubRepository
)
result = script_creator_graph.run()
print(result)
The result, a basic scraper, meets the expectations.
import requests
from bs4 import BeautifulSoup
import json
def scrape_github_repositories(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
repositories = []
author = soup.find('meta', {'name': 'octolytics-dimension-user_login'})['content']
for repo in soup.find_all('li', class_='public'):
try:
repo_name = repo.find('a', itemprop='name codeRepository').text.strip()
except AttributeError:
repo_name = "N.A."
try:
repo_description = repo.find('p', itemprop='description').text.strip()
except AttributeError:
repo_description = "N.A."
repositories.append({
"Author": author,
"RepositoryName": repo_name,
"RepositoryDescrioption": repo_description
})
return json.dumps(repositories, indent=4)
if __name__ == "__main__":
url = "https://github.com/berstend?tab=repositories"
print(scrape_github_repositories(url))
It’s a real working scraper, returning the first page of repositories, with the selected data structure in JSON format.
[
{
"Author": "berstend",
"RepositoryName": "chrome-versions",
"RepositoryDescrioption": "Google Chrome release and version info as JSON (self updating)"
},
{
"Author": "berstend",
"RepositoryName": "browser-monitor",
"RepositoryDescrioption": "N.A."
},
{
"Author": "berstend",
"RepositoryName": "puppeteer-extra",
"RepositoryDescrioption": "\ud83d\udcaf Teach puppeteer new tricks through plugins."
},
....
LLAMA3.1
With Llama3.1 I didn’t have the same luck.
The scraper created uses some selectors that simply don’t work, probably some more tuning of the prompt is needed.
MISTRAL
Mistral really left me astonished.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.