THE LAB #26: From internal API to insights.
Getting insights on the automotive industry by scraping a car resell website.
Internal API?
When approaching a new scraping project, a good study phase is desirable if not necessary. One of the first steps is to understand how the website works: if it’s a website with dynamic content, like products on e-commerce, it means understanding how this data is gathered.
Very often, this is done via APIs: according to the page you’re loading, a request to an internal API endpoint is made and the results are shown on the front-end. Later in this post, we’ll see how to spot these APIs with your browser and use them to scrape data from a website.
If APIs are available and they return all the data your scraper needs, these should be used by the scraper. APIs are usually more stable than HTML code, they’re made to be queried (with proper throttling), and there’s no overhead in the responses, making the scraping more lightweight on both the server and the bandwidth aspects.
And when no API is available?
In case there are no APIs available, we should check the HTML code and look for some JSON containing the data we need. It’s not rare, especially if websites are developed in Next.JS, to find in the HTML some tags like
<script id="__NEXT_DATA__" type="application/json">
and then the JSON containing the data that populates the dynamic part of the web page.
This is the second-best approach for web scraping: using the JSON embedded in the HTML code, while there’s no advantage in bandwidth, at least it should be more stable than simple HTML scraping.
Last but not least, if there’s no API or JSON available, we are obliged to proceed with writing our selectors for the plain HTML code.
How to find internal APIs on a website
As we said before, when there’s some dynamic content on a website, there’s a chance that it’s loaded by an internal API and we can intercept it using the browser.
The way I usually observe what’s happening under the hood of a website is by opening the browser’s developer tools, on network tab.
With this view, you can see what’s happening in your browser, in real time. You can track the requests happening in background, typically by user tracking services, or when you click somewhere on the website, you can see what you trigger by doing so.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.