THE LAB #84: AI-Driven Web Scraping: OpenAI Codex vs Cursor vs AI Scraping Tools
Is OpenAI Codex the new silver bullet for scraping?
Web scraping is experiencing a new golden age thanks to LLMs and AI-driven tools.
In this article, we’re comparing four approaches that use AI into the scraping workflow: OpenAI Codex, Cursor (with Model Context Protocol), ScrapeGraphAI, Firecrawl and Zyte API. Each represents a different strategy, with its own strengths and trade-offs:
OpenAI Codex—Released just a few weeks ago, it’s currently available only for ChatGPT Pro, Team, and Enterprise users. It’s still a beta product, but I wanted to share my first impressions of it. Codex is a coding assistant that connects to a specific GitHub repository and pulls/pushes the code.
Cursor + MCP (Model Context Protocol) – Uses an AI-augmented IDE with external tool plugins. Supports local development and the AI can fetch and parse pages via user-defined tools, making standard scraping tasks easier and improving as you refine your tools/rules.
ScrapeGraphAI API – A commercial “AI-first” scraping service. Simplifies scraping to just specifying the desired output and target URL, letting an LLM handle the rest. It adapts to page changes automatically, reducing maintenance in the long run.
Firecrawl - Another commercial solution for extracting markdown and structured data from the web, with no scraping expertise.
Zyte API (AI Extraction) – A veteran scraping platform (formerly ScrapingHub) enhanced with AI. It combines Zyte’s robust scraping/unblocking stack with AI (both in-house ML models and LLMs) to directly return structured data. Handles proxying, headless browsers, and parsing for you, and it’s highly scalable and reliable for common data types.
In the sections below, we dive deeper into each tool’s current capabilities and limitations.
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
OpenAI Codex
OpenAI Codex can assist developers by generating Python scraping scripts or suggesting how to parse data. As mentioned, it connects to a GitHub repository and pulls/pushes the code from there. It can help you by explaining the code, for bugfixing purposes, and, as we’re trying to do, it can help you write some code from zero. However, it has apparent limitations for web scraping tasks:
No Internet Access: Codex runs code in a sandboxed environment without internet connectivity. This means it cannot fetch webpages directly on its own. You might get code suggestions based on generic patterns, but you’ll have to run that code yourself in a real environment to actually retrieve data. The agent itself cannot access external websites, APIs, or other services, so it’s not a fully autonomous scraper.
Latency and Overhead: Using Codex adds an extra layer to development. You describe the scraping task in natural language, the AI generates code, and then you execute or refine it. While this can speed up coding, it can also introduce latency, especially if the AI’s suggestions aren’t exactly right and you enter a loop of prompts and fixes, while a developer might tweak code more directly. Like most coding tools, it’s a trade-off between convenience and speed.
Large HTML = Context Issues: Codex (and GPT models in general) have a fixed context window for prompts. I tried to pass an HTML page in the prompt to bypass the issue with the internet access, but I crashed the environment because of its size. This forces you to chunk the HTML or summarize it, which can is another complication to handle. In short, Codex struggles with voluminous content – you can’t just paste a whole site’s HTML and expect reliable extraction if it’s beyond the token limit.
Additionally, because Codex isn’t specialized for web data, it might not know which parts of HTML are important without guidance. To get the correct output, it may require careful prompting (e.g., “ignore the navigation menus, focus on the product list”).
In summary, OpenAI Codex can be a helpful code generator, but it is not a one-stop scraping solution. You’ll still need to do the heavy lifting of running code and dealing with data outside the AI, and you must be mindful of context size and other constraints.
At this stage of development, it’s a no in my opinion: without internet access, it’s almost unusable for this task.
Thanks to the gold partners of the month: Smartproxy, IPRoyal, Oxylabs, Massive, Scrapeless, Rayobyte, SOAX, ScraperAPI and Syphoon. They prepared great offers for you, have a look at the Club Deals page.
Cursor (with MCP and Rules)
Cursor is an AI-powered IDE (similar to GitHub Copilot but more advanced) that allows you to integrate external tools via the Model Context Protocol (MCP). In the context of web scraping, this means you can have the AI assistant work alongside your code, calling custom scraper functions you provide. The approach, as we’ve seen in a past episode of The Lab, effectively turns Cursor into a smart local scraping assistant.
You set up MCP “servers” (essentially plugins) that can perform actions like fetching a URL, parsing HTML, or saving data. Within Cursor, the AI (acting as an MCP client) can invoke these tools as needed.
For example, you might implement a tool that takes a URL and returns the page’s HTML. When you ask Cursor’s AI to, say, “scrape all book titles from this site”, it can call your HTML-fetch tool with the guessed URL, get the HTML, and then help generate code (like XPath or CSS selectors) to extract the titles. The key benefit is that the AI is not working blindly – real data from your environment supports it, and it can execute code or use tools to gather context.
Another interesting feature you can use on Cursor is writing coding rules: they’re basically your guidelines and procedures for writing code that meets your standards.
As an example, you can say to use only XPath selectors and not CSS ones, but also specify the templates of the scrapers you want to create, and so on. Rules help to reduce the fuzziness of the code written by AI, since you’re improving your prompt with a set of instructions. The more specific, clear, and concise they are, the more reliable Cursor becomes.
Over time, as you add more tools or refine the rules (prompts) for how the AI should extract data, the system becomes more capable. In other words, it improves with use: your library of scraping functions can grow, and the AI will leverage those in future requests.
If you’re writing scrapers with the same data model and targeting websites with a light or absent anti-bot protection, this is a good way to speed up development times. It’s like having a junior programmer who knows some web scraping, assisting you within your IDE. You stay in control of the code on your machine, which appeals to developers who need local solutions rather than hosted services. And it’s almost free, with 20 USD a month, you can use Cursor Pro, which is everything you need (on top of your OpenAI or Claude bills).
On the other hand, advanced or non-standard scenarios are still challenging. For instance, if a site’s data is rendered via an internal API call or requires complex interaction (multi-page navigation, form submissions, solving CAPTCHAs, etc.), the AI won’t automatically know that or handle it. You would need to guide it, perhaps by building additional MCP tools (e.g. a tool to call an API endpoint or run a headless browser). At that point, you’re essentially back to writing custom scrapers – the AI can assist, but it can’t magically intuit hidden endpoints or intricate data schemas without your direction. Similarly, highly custom data extraction strategies (say, extracting meaning from images, or applying business-specific logic to scraped data) are beyond the out-of-the-box capabilities; you’d implement those manually or via other libraries.
In summary, Cursor + MCP is a powerful augmented coding approach for scraping. It offers a significant productivity boost for routine scraping tasks by keeping the loop tight (code and AI in one place) and letting you gradually expand what the AI can do. Remember that it’s not a turnkey web scraper for every situation – for challenging cases, you’ll still be writing code and guiding the process closely.
ScrapeGraphAI and its API
ScrapeGraphAI is a relatively new open-source tool that takes an “LLM-first” approach to web scraping. It allows you to create scrapers, but also to get results without the need to code a fully working scraper. To simplify the data retrieval even more, they just launched a commercial API version, which can extract the data for you.
If you use the open-source project, you can get data from a web page (or a list of them) by using any LLM, both local and commercial, and a prompt.
You can also ask to write a scraper and selectors for BeautifulSoup and other libraries, as you can see from this example, but in my opinion, the solution with Cursor + MCP and rules gives you more flexibility and customization for your needs.
The scripts mentioned in this article are in the GitHub repository's folder 84.AI-WEB-DRIVEN-SCRAPING, which is available only to paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please use the following form to request access.
The ScrapeGraphAI API, instead, given a URL and a desired output structure, returns the data to you. While this can be expensive for large vertical scraping (1 million pages from the same website), if you need to scrape 1 million pages from 10k websites, it can be highly convenient. In fact, given the cost of making 1 million API requests, you need to compare that with the cost of developing 10k web scrapers and the delivery time.
This was one of the points of my speech at PragueCrawl, the event organized by Apify and Massive last week. You can find the recording of my 2 cents here.
I suggest checking out Apify's YouTube channel for the other speeches, which were so interesting.
In the repository, you’ll find the code I used for the presentation, where I asked ScrapeGraphAI API to extract data for me.
The results were quite good on almost every website, so this is a great solution in case you need to scrape many different websites.
The advantage of this solution, compared with Cursor and also its open-source version, is that under the hood, there’s an anti-bot bypass. In many cases, you can get the data you want without worrying about the anti-bot protection of the website.
About the HTML parsing, instead, since it relies on LLMs, Scrapegraph inherits their limits and pros: it’s incredibly efficient for starting a project, but can be expensive in certain circumstances. Also, if the data is not contained in the HTML of the rendered page, but hidden in a certain website’s API, this data cannot be retrieved with any LLM-powered tool, but only with custom scrapers.
Firecrawl
Similar to ScrapeGraphAI API, we have another open-source project called Firecrawl.
Even in this case, given a list of URLs in input, a prompt, and a desired output structure, we can extract data from web pages in a few minutes.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.