Scraping E-Commerce websites 101
How to approach the web scraping of e-commerces before start coding.
How many types of data are inside an e-commerce website?
If I think of scraping an E-Commerce, I immediately think about prices, because that’s what I’ve scraped for years.
But this is only a small part of the data that we can extract from it and, depending on the industry where the target website is operating, some types of data are more important than others.
Here’s a list of what can be scraped from e-commerces (and I’m pretty sure it won’t be an exhaustive one):
prices and promotions, of course, to discover pricing strategies for brands and websites
product features, useful when you need to discover trends and patterns in new products coming or in the competitors’ offering
reviews, to understand the sentiment of the buyers
availability, for checking the inventories of sellers for products or their variants
positioning, to understand if a brand or a product is placed in a highlighted place on a website
distribution, how many brands are on a cohort of websites, and who isn’t there
And I’m sure I’m missing other aspects.
Is it legal to scrape e-commerce websites?
Yes, as soon we respect the general rules for ethical web scraping.
Do not harm the target website’s functionality
Do not save copyrighted data or items
Do not scrape anything behind a log-in
Do not break the Terms of Services that are needed to be accepted explicitly (Clickwrap)
Use API whenever possible to reduce the weight on the target website
Do not interfere with the target website’s business and operations.
Some glossary
E-commerce websites in different industries have quite the same structure:
A product list page, usually shortened with PLP, where you have different products that match a certain filter. It’s like the bookshelf in a library.
A product detail page, usually shortened with PDP, where you have all the details of a single product, like a single book inside the library.
Depending on the type of data you need to extract, the scraper could enter only the product list page or also the details.
Another classification used when having a look from a business perspective at e-commerce data, is the categorization of the websites by their distribution model.
There are directly the so-called monobrand websites, typically operated by the brands to sell their products, for example, lululemon.com. We have only one seller and one brand sold on this website.
There are then multibrand websites, like Footloker.com, where one seller offers multiple brands to its customer.
Then we have marketplaces, like Amazon.com, where multiple sellers sell multiple brands to the customers.
Each of these website types poses different challenges when you need to scrape them.
Monobrands have typically few items to scrape but they are highly localized in different regions. It means that you can expect a different website from Chanel in China or in Europe, for example.
Multibrands have more items to collect and sometimes have a strong anti-bot solution, to avoid fraud or discourage web scraping.
Marketplaces are usually huge to scrape and the main challenge is to split the execution in a wise way to get all the data needed without going broke and may have different product page layouts and structures depending on the seller.
Wrap up
Web scraping eCommerce websites can provide valuable insights for businesses and individuals alike.
However, it's crucial to consider the peculiarities and challenges associated with scraping each type of eCommerce platform.
By understanding these nuances, you can use a more effective and efficient web scraping strategy, ultimately making the most of the data extracted from these websites.