How LLMs are affecting the costs of web scraping
Some examples of using LLMs in web scraping with ScrapeGraphAI
This article is written together with Marco Vinciguerra, founder of ScrapeGraphAI. While the most theoretical aspects of the article are written by me, Marco explains in detail the code and the different benefits of using ScrapeGraphAI.
If you’re in Milan and want to know more about it, you can meet us both at this “ScrapeGraphAI: You Only Scrape Once + Lightning Talks” meetup at the Microsoft HQ
Let’s continue our deep dive into the costs of web scraping: Today, together with Marco Vinciguerra, founder of ScrapeGraphAi, we’ll see the impacts of LLMs and AI in the web scraping industry.
Where AI has more impact on web scraping
In the latest post, we’ve seen the different types of costs in web scraping.
One of the coolest projects around in the AI scraping landscape is ScrapeGraphAi, a multi-purpose tool for scraping using LLMs.
Given its flexibility, it can be used in different phases for web scraping, like writing the code for you, or, given a prompt, extracting data from a website without needing a scraper.
This makes ScrapeGraphAI useful in different use cases, both for experts and not in the subject, let’s see two different scenarios.
Scenario 1: a professional with random web data needs but no web scraping experience
Let’s say a journalist needs some data for an article: He/She has a tech background but doesn’t have time to create a web scraper for each website and doesn’t want to run a scraper. He/She just needs some data, and the easiest way in this case is to use the SmartScraperGraph module.
Smartscraper graph
The SmartScraperGraph is an advanced web scraping tool that utilizes a directed graph structure to navigate and extract data from websites. It leverages large language models (LLMs) to interpret and respond to prompts, enabling it to adapt to different website structures and content dynamically. This flexible solution minimizes the need for manual configuration, making it ideal for scraping a wide range of web pages.
The output of this graph is the JSON file with the information required by the prompt
Benefits
1. Increased Productivity: By automating the web scraping process, users can focus on higher-level tasks and enjoy increased productivity.
2. Improved Accuracy: Using LLMs and dynamic graph construction minimizes the risk of errors, ensuring that extracted data is accurate and relevant.
Construction
The SmartScraperGraph is based on 4 principal components:
FetchNode: It fetches the state from the URL of an already downloaded document.
ParseNode: parses the HTML state in a document and then parsed content is split into chunks for further processing.
RAGNode: it stores the state in a RAG database.
GenerateAnswerNode: from the elements in the database it calls an LLM (it could be Openai, Gemini, or a local model with Ollama) it generates the answer
Implementation
Here’s an example of SmartScraperGraph in action, using Mistral as an LLM.
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json",
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
},
"verbose": True,
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their descriptions",
source="https://perinim.github.io/projects",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)
Another version of this graph, where is possible to execute in parallel multiple instances of SmartScraperGraph, can be coded like this
import json
from scrapegraphai.graphs import SmartScraperMultiGraph
graph_config = {
"llm": {
"api_key": "openai_key",
"model": "gpt-4o",
},
"verbose": True,
"headless": False,
}
multiple_search_graph = SmartScraperMultiGraph(
prompt="Who is Marco Perini?",
source= [
"https://perinim.github.io/",
"https://perinim.github.io/cv/"
],
schema=None,
config=graph_config
)
result = multiple_search_graph.run()
print(json.dumps(result, indent=4))
We just have seen how easy and fast it is to get the data with ScrapeGraphAI. Even in the worst case, where the journalist has spent 2/3 hours of prompt engineering, the time needed for creating a working scraper for scratch, for a non-expert, would have been much higher.
We can say that AI and ScrapeGraphAI are enabling more professionals to taste water with web data without being web scraping professionals. Going back to our cost classification, they’re drastically reducing the setup cost for these users.
Scenario 2: companies with several running scrapers in a production environment
Let’s say we have already several scrapers in a production environment, for both large and small websites.
In our case, it would be extremely inefficient, both in terms of time and cost, to send every request to an LLM and parse the data, as we did in the previous scenario.
In this case, the ScriptCreatorGraph can help us: instead of data, we’re asking ScrapeGraphAI to write the scraper for us, reducing again our setup cost.
ScriptCreatorGraph
The ScriptCreatorGraph is a groundbreaking tool that utilizes a directed graph structure and large language models (LLMs) to automate the creation of scripts for web scraping, data processing, and more. This innovative solution empowers users to define their requirements and generate customized script code without requiring extensive programming knowledge.
The output of this script is a script in python
Construction
Is similar to the SmartScraperGraph, the only difference is that instead of having the GenerateAnswerNode it has a GenerateScraperNode that, given the documents, returns the beautifulsoup document.
Implementation
Here’s instead how a ScriptCreatorGraph code looks like
from scrapegraphai.graphs import ScriptCreatorGraph
from scrapegraphai.utils import prettify_exec_info
graph_config = {
"llm": {
"api_key": "openai_key",
"model": "gpt-3.5-turbo",
},
"library": "beautifulsoup"
}
script_creator_graph = ScriptCreatorGraph(
prompt="List me all the projects with their description.",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
)
result = script_creator_graph.run()
print(result)
The advantages of having an AI model writing for you, with the proper precautions, the scraper is not limited to the savings of the setup costs for new websites, but also on their maintenance.
A futuristic scenario: self-healing web data pipelines
Given the ability of LLMs to write code, we can imagine a futuristic scenario where a web scraper will self-heal. When our QA process detects a failure in some or all of our scraped data, it can automatically trigger a fix request to a LLM, on a certain field or for the whole scraper.
Using its capabilities, we get back a new scraper that, after some tests, can be released in production with the new selectors.
In this scenario, the maintenance costs for scrapers, at least for the easiest ones, are near zero, giving the companies the capability to scale, which is a current pain point for data factories.
Final remarks
This article concludes our mini-series about the cost of web scraping: after identifying the cost categories in the previous post, today we’ve seen how LLMs could help, today and in the future, to lower them.
We’re still far from LLMs being blindly used in production environments but we just keep in mind that we’re in the early days of this technology, but we could already catch a glimpse of the advantages for the industry if the promises will be kept.
Thanks to Marco for sharing with us the details of ScrapeGraphAI and hope to have you again on these pages.