THE LAB #79: Use Cursor as web scraping assistant with MCP servers
Add MCP Servers to Cursor to increase our web scraping capabilities
In past episodes of The Lab series on The Web Scraping Club, we used Firecrawl to scrape the content of this newsletter, a vector DB like Pinecone to store the articles' markdown, and the OpenAI API to retrieve information and append the context to our prompt to get better answers.
This approach has limitations. The context window size (the amount of information we can include in the prompt) is limited, so the answers were not always great.
Since the AI landscape will change in a few months, it seems that using RAG and Pinecone is old school. Today, the new buzzword is Model Context Protocol (MCP), so I’ve conducted this small experiment to see how MCP can be used for web scraping.
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
What is a Model Context Protocol?
The Model Context Protocol (MCP) is an open standard, initially developed by Anthropic, that enables large language models (LLMs) to interact with external tools and data through a standardized interface. In essence, MCP provides a universal connector between AI models and the systems where data lives. Instead of building custom integrations for every data source or scraping tool, developers can expose their data or functionality via an MCP server. AI assistants (the MCP clients) can consistently request those resources, and this standardized, two-way connection enriches raw content with metadata, context, and instructions so that AI models can more effectively interpret information.
But how does it work?
The MCP follows a client-server architecture: The host application (e.g., an IDE or AI assistant) runs an MCP client (Claude, Cursor, etc.) that connects via the MCP protocol to one or more MCP servers. Each server interfaces with a specific data source or tool (e.g., a database, file system, or web scraper) on your local machine or network, providing structured context or actions to the LLM.
An MCP server can expose resources (read-only context data analogous to GET endpoints) and tools (actions or functions the model can invoke analogous to POST endpoints) to the LLM.
Anthropic open-sourced MCP in early 2024 to encourage industry-wide adoption. The goal was to establish a common framework for AI-to-tool communication, reducing reliance on proprietary APIs and making AI integrations more modular and interoperable.
This structured context handling improves on plain prompts. The model no longer just receives a blob of text; instead, it can load data via resources and call tools with well-defined inputs/outputs. In summary, MCP provides a standardized, secure way to feed context into AI models and receive actions/results, which is especially powerful in complex tasks like web scraping.
Thanks to the gold partners of the month: Smartproxy, Oxylabs, Massive and Scrapeless. They’re offering great deals to the community. Have a look yourself.
Why Use MCP?
As we’ve seen in previous posts using RAG, one challenge was adding the downloaded article to the prompt since the context size is limited. We tried different solutions, like splitting the articles into chunks on Pinecone and then saving only a summary made with OpenAI to reduce its length.
Using MCP, especially in this case, where we’re using it inside Cursor IDE, we can create a tool that extracts the HTML and saves it to a file that can be read and chunked by Cursor and used to generate the XPaths.
Another interesting aspect of using MCP is that we mix programming and prompt engineering. In fact, creating an MCP tool is like creating a function in our code: once we crafted the prompt in the IDE, this is evaluated and calls our tools with the needed argument (like the URL to fetch) guessed from our prompt. The tool itself, instead, is a programmed function, so there’s not the fuzziness of a prompt, but given an input, we know the output we can expect. This is great because it reduces the uncertainty of the whole process and, at the same time, opens up a myriad of use cases for enriching the context of your chat with the LLM.
Last but not least, MCP is a standard, and theoretically, once we have a server up and running, we could plug it into any model and tool that supports it. However, the fact that OpenAI does not currently support it is not a good sign for the protocol's diffusion.
What we’re going to use MCP
We’ll use two key technologies to implement our solution: the MCP Python SDK and Camoufox.
MCP Python SDK. The MCP Python SDK is the official library for building MCP servers (and clients) in Python. It implements the full MCP specification, handling all the underlying protocol messaging so you can focus on defining the resources and tools you need.
With this SDK, you can create an MCP server in just a few lines of code. It provides a high-level FastMCP server class that manages connections and lets you register functions as tools or resources via decorators. For example, you can annotate a function @mcp.tool() to expose it to the model as an actionable tool.
The SDK packages the function’s result and sends it back to the AI client in a consistent format. In summary, setting up an MCP server that can feed context to LLMs or perform tasks on their behalf is straightforward.
Camoufox for HTML retrieval. In web scraping, getting the raw HTML from target pages (especially those with anti-scraping measures or heavy JavaScript) is half the battle. I decided to use Camoufox, which is an open-source stealth browsing tool designed for such scenarios, to be almost sure to be able to get the HTML from every page. This is especially true because the MCP runs locally on my machine, so I won’t need any proxy. On top of its stealth capabilities, I wanted to use Camoufox to build an MCP logic from 0. If you want to save time, you can use the BrowserBase MCP server or the Hyperbrowser one. They have some prebuilt tools like extracting data and interacting with the page, making life easier for us.
Technical Implementation of an MCP Server for writing a Camoufox scraper
Now, we’ll build the MCP server that helps us write a Camoufox scraper in three steps. Each of the steps of the process has its own tool:
fetch_page_content will be a tool that opens Camoufox and stores the HTML of the target page on a file.
generate_xpaths will be a tool that reads the HTML file and, given a template of output data, creates the selectors, again saving them to a file. That’s why we want it flexible enough to handle different page types (for example, a product listing page vs. a product detail page in e-commerce).
write_camoufox_scraper will be a tool that reads the selectors and a template of Camoufox spider (Camoufox_template.py) and creates a new one based on that and the selectors just created.
The server code (xpath_server.py) will be saved in the repository under folder 79.MCP/MCPFiles.
The scripts mentioned in this article are in the GitHub repository's folder 79.MCP, which is available only to paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please use the following form to request access.
Step 1: Set Up the Environment
First, make sure you have Python 3.10+ installed and install the necessary packages. We’ll need the MCP SDK and Camouf, which you can install via pip.
pip install mcp camoufox
The mcp package includes the MCP Python SDK and CLI tool. Camoufox may require an additional step to fetch its browser binary (for example, running python -m camoufox fetch to download the stealth Firefox – refer to Camoufox docs for details). Once these are installed, you’re ready to write the server code.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.