Scraping E-Commerce websites 101
How to approach the web scraping of e-commerces before start coding.
This post is sponsored by Smartproxy, the premium proxy and web scraping infrastructure focused on the best price, ease of use, and performance.
In this case, for all The Web Scraping Club Readers, using the discount code WEBSCRAPINGCLUB10 you can save 10% OFF for every purchase.
The Web Scraping Club is a free weekly newsletter about web scraping. Once every two weeks, I publish The Lab, paid content with deep dives on more technical aspects and solutions to common issues, with also code on a GitHub repository. You can access to the following articles using a 7 days trial and then subscribe if you find them useful.
In this post, we’ll leave for a moment the technical aspects of web scraping for a better understanding of what it means to scrape an e-commerce website and what aspects should be weighted when approaching such a project.
I will bring my experience of several years of scraping e-commerce websites but I’d be glad to hear more from you in the comments and in the polls inside this article.
How many types of data are inside an e-commerce website?
If I think of scraping an E-Commerce, I immediately think about prices, because that’s what I’ve scraped for years.
But this is only a small part of the data that we can extract from it and, depending on the industry where the target website is operating, some types of data are more important than others.
Here’s a list of what can be scraped from e-commerces (and I’m pretty sure it won’t be an exhaustive one):
prices and promotions, of course, to discover pricing strategies for brands and websites
product features, useful when you need to discover trends and patterns in new products coming or in the competitors’ offering
reviews, to understand the sentiment of the buyers
availability, for checking the inventories of sellers for products or their variants
positioning, to understand if a brand or a product is placed in a highlighted place on a website
distribution, how many brands are on a cohort of websites, and who isn’t there
And I’m sure I’m missing other aspects.
Unluckily I cannot add more voices to the poll so I had to merge some voices.
Is it legal to scrape e-commerce websites?
Yes, as soon we respect the general rules for ethical web scraping.
Do not harm the target website’s functionality
Do not save copyrighted data or items
Do not scrape anything behind a log-in
Do not break the Terms of Services that are needed to be accepted explicitly (Clickwrap)
Use API whenever possible to reduce the weight on the target website
Do not interfere with the target website’s business and operations.
E-commerce websites in different industries have quite the same structure:
A product list page, usually shortened with PLP, where you have different products that match a certain filter. It’s like the bookshelf in a library.
A product detail page, usually shortened with PDP, where you have all the details of a single product, like a single book inside the library.
Depending on the type of data you need to extract, the scraper could enter only the product list page or also the details.
Another classification used when having a look from a business perspective at e-commerce data, is the categorization of the websites by their distribution model.
There are directly the so-called monobrand websites, typically operated by the brands to sell their products, for example, lululemon.com. We have only one seller and one brand sold on this website.
There are then multibrand websites, like Footloker.com, where one seller offers multiple brands to its customer.
Then we have marketplaces, like Amazon.com, where multiple sellers sell multiple brands to the customers.
Each of these website types poses different challenges when you need to scrape them.
Monobrands have typically few items to scrape but they are highly localized in different regions. It means that you can expect a different website from Chanel in China or in Europe, for example.
Multibrands have more items to collect and sometimes have a strong anti-bot solution, to avoid fraud or discourage web scraping.
Marketplaces are usually huge to scrape and the main challenge is to split the execution in a wise way to get all the data needed without going broke and may have different product page layouts and structures depending on the seller.
Web scraping eCommerce websites can provide valuable insights for businesses and individuals alike.
However, it's crucial to consider the peculiarities and challenges associated with scraping each type of eCommerce platform.
By understanding these nuances, you can use a more effective and efficient web scraping strategy, ultimately making the most of the data extracted from these websites.
The Lab - premium content with real-world cases
THE LAB #14: Scraping Cloudflare Protected Websites (early 2023 version)
THE LAB #8: Using Bezier curves for human-like mouse movements
THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting
THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers
THE LAB #2: scraping data from a website with Datadome and xsrf tokens