Writing a web scraper with ChatGPT. Is it a good idea?

Is really AI helping us, improving our productivity when writing a web scraper?

Apr 16, 2023

This article is sponsored by MobileHop, your mobile IP proxy provider.

MobileHop provides native mobile IPs on dedicated 4G/5G modems via Verizon and AT&T Wireless to bypass almost all website blocks. A single multihop license gives you access to 50 USA markets and growing!

In November, after OpenAI released ChatGPT, based on GPT-3, the news was literally everywhere. I also wrote on that occasion about AI and web scraping and, after that, in every interview I’ve made, I’ve asked my interlocutors what was their point of view on the state of AI for the web scraping industry.

Five months later, we have GPT-4 and tons of applications have been built on top of GPT models so it’s time to have a closer look at AI for web scraping.

a close up of a computer screen with the word chat on it — Photo by Emiliano Vittoriosi on Unsplas

Can AI write scrapers for us?

At the moment, we cannot expect ChatGPT to write a fully working scraper for us on a chosen website. It will return a syntactically correct scraper but with generic selectors, not useful for our case. If we ask to scrape some well-known website maybe it could return some correct mapping, if the answer was already given in some place like Stackoverflow in the past.

Given that, let’s try to build a scraper from scratch with ChatGPT for a niche website as https://www.gianvitorossi.com/it_it/

I’ll go straight to the final prompt I wrote to get a correct Scrapy spider, where I asked to map four fields and explained to ChatGPT the access plan to the website.

Create a Python web scraper using the Scrapy framework and XPath selectors.
The start_urls parameter should contain the URL [https://www.gianvitorossi.com/it_it](https://www.gianvitorossi.com/it_it) to be scraped.
After scraping the first page, the scraper should enter a function called get_categories.
In this function the scraper extracts all the relative links to different product types, using the following XPath selector: product_pages = response.xpath('//a[@role="menuitem"]/@href').extract()
For each product page, it should enter a function called get_product_list.
Extract the list of products on the page using the XPath selector (’//a[contains(@class, “b-product_tile-image_link”)]/@href).
If there are more products to be shown, crawl to the following product page using the XPath ('//a[data-event-click.prevent="loadMore"]/@href).extract()[0], which contains the absolute path to the next page, and include this selector in a try/except clause. If there is no match for the selector, a pass instruction should be used in the except clause.
For each product extracted, enter a function called get_product_details.
Write the logic to scrape the following fields:
- product code
- full_price
- price
- product_url
The product detail page contains a JSON containing most of the data we need. It can be found by looking for the first occurrence of the script type="application/ld+json" and extract the text from there. It follows the standard Product [Schema.org](
http://schema.org/
) type and you can map the field product_code to the Json's mpn field.
The field price can be mapped by entering the offers dict in the JSON and selecting the price field.
The product_url field can instead be mapped using the response.url value.
Since there are no discounts on this website, full_price field can be mapped as price field.

And here’s the code given in the response:

import scrapy
import json

class GianvitoRossiSpider(scrapy.Spider):
    name = 'gianvitorossi'
    start_urls = ['https://www.gianvitorossi.com/it_it']

    def parse(self, response):
        # Extract product categories
        product_pages = response.xpath('//a[@role="menuitem"]/@href').extract()
        for product_page in product_pages:
            yield scrapy.Request(response.urljoin(product_page), callback=self.get_product_list)

    def get_product_list(self, response):
        # Extract product links
        product_links = response.xpath('//a[contains(@class, "b-product_tile-image_link")]/@href')
        for link in product_links:
            yield scrapy.Request(response.urljoin(link.extract()), callback=self.get_product_details)

        # Check for more product pages
        try:
            next_page = response.xpath('//a[data-event-click.prevent="loadMore"]/@href').extract()[0]
            yield scrapy.Request(response.urljoin(next_page), callback=self.get_product_list)
        except:
            pass

    def get_product_details(self, response):
        # Extract JSON data
        json_data = response.xpath('//script[@type="application/ld+json"]/text()').extract_first()
        data = json.loads(json_data)

        # Extract product details
        product_code = data['mpn']
        full_price = data['offers']['price']
        price = full_price
        product_url = response.url

        # Return item
        yield {
            'product_code': product_code,
            'full_price': full_price,
            'price': price,
            'product_url': product_url
        }

The scraper runs correctly and retrieves all the items. But does writing the prompt needed to create the spider required less time than writing the spider by myself? Difficult to say. In both cases, I need to write down the XPath selectors, so AI didn’t save me time. Let’s suppose we have a properly engineered prompt and a Scrapy spider template to reuse, we can assume the time consumed modifying the prompt or the scraper is the same. The only improvement in the process is that ChatGPT wrote the scraper without any bugs.

Bugfixing with the AI

Let’s test the utility of ChatGPT when we’re in front of a bug in our scrapers. I’ll modify the previous spider by deleting a quote from an Xpath and changing the name of a variable non-initiated.

    def get_product_details(self, response):

        # Extract JSON data

        json_data = response.xpath('//script[@type="application/ld+json"]/text()).extract_first()

        data = json.loads(json_data)

        # Extract product details

        product_code = data['mpn']

        full_price = data['offers']['price']

        price = full_price

        product_url = response.url

        # Return item

        yield {

            'product_code': product_code3,

            'full_price': full_price,

            'price': price,

            'product_url': product_url

        }

Will ChatGPT be able to fix it?

Yes and not only it returned the corrected code but also explained what it fixed.

The error in the original code was a syntax error in the XPath selector used to extract the JSON data. The closing quotation mark was placed incorrectly, resulting in an invalid selector. The corrected code places the closing quotation mark in the correct location, allowing the selector to extract the JSON data correctly. Additionally, there was a typo in the name of the product_code field in the yield statement, which has been fixed in the corrected code.

This is a helpful way to use ChatGPT since sometimes we spend a lot of time looking for the solution to these silly mistakes.

Porting to another language

Instead of creating a scraper with Python, I used the same prompt to create a Node.js scraper. I don’t know anything about this language and let’s see if ChatGPT can help me in creating my first scraper in Node.js

Here’s the result:

const axios = require('axios');
const cheerio = require('cheerio');

const startUrl = 'https://www.gianvitorossi.com/it_it';

async function scrape(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract product categories
    const productPages = $('a[role="menuitem"]').map(function() {
      return $(this).attr('href');
    }).get();

    for (const productPage of productPages) {
      await getProducts(`https://www.gianvitorossi.com${productPage}`);
    }
  } catch (error) {
    console.error(error);
  }
}

async function getProducts(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract product links
    const productLinks = $('a.b-product_tile-image_link').map(function() {
      return $(this).attr('href');
    }).get();

    for (const link of productLinks) {
      await getProductDetails(`https://www.gianvitorossi.com${link}`);
    }

    // Check for more product pages
    try {
      const nextPage = $('a[data-event-click.prevent="loadMore"]').attr('href');
      await getProducts(`https://www.gianvitorossi.com${nextPage}`);
    } catch (error) {
      // No more product pages
    }
  } catch (error) {
    console.error(error);
  }
}

async function getProductDetails(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract JSON data
    const jsonText = $('script[type="application/ld+json"]')[0].children[0].data;
    const data = JSON.parse(jsonText);

    // Extract product details
    const productCode = data.mpn;
    const fullPrice = data.offers.price;
    const price = fullPrice;
    const productUrl = response.config.url;

    // Output data
    console.log({
      productCode: productCode,
      fullPrice: fullPrice,
      price: price,
      productUrl: productUrl
    });
  } catch (error) {
    console.error(error);
  }
}

scrape(startUrl);

This is great. The scraper is correct and works perfectly.

Final remarks

I have to admit I’m impressed by the results of these tests. While it’s true it didn’t save me much time to me when I wrote the Python scraper, it enabled me to write the Node.js one.

AI can lower the barriers to the adoption of a new language and help develop the scrapers while still studying it, helping also to debug the code written. It doesn’t substitute the good old hands-on practice but it could help learn faster.

In the end, AI at the moment is basically more than an aid than a threat that could replace humans in the near future.

Nando Rossi

Apr 14, 2025Edited

I'm a designer who dabs in vibe-coding and I've tried having GPT create scrapers but it always sucked a lot. It got some things right and not others, and when iterating it would break the things that worked... I think it's really important to be able to prompt in such technical detail as you did.

Oh wow I just realized this is from 2023, CRAZY. What I described was this year, 2025...

1 reply by Pierluigi Vinciguerra

Khyali

Apr 20, 2023

Thank you for sharing

2 more comments...

The Web Scraping Club

Discussion about this post

Ready for more?