THE LAB #31: Scraping location data using a world grid
Building a fundamental tool for scraping location data in a cost-effective way
One of the most popular dataset categories created with web scraping is location data: it may be reviews of places, accommodation prices and occupancy, store locators, and so on. However, extracting this data, particularly when coordinates are required as inputs for website APIs, presents a peculiar set of challenges.
Let’s analyze two of the most common cases we can encounter: radius-based and squares-based APIs.
Store locators - Radius-based API
One of the most common datasets about locations is about store locators: it’s used by brands and investors to understand how retail operations are proceeding: a trend of an increasing number of stores could be a good signal for the health of a retailer, while a decreasing number could be a consequence of major troubles in the company.
Since all these retail locations need to be found by people, almost every brand has its own store locator on their website, and they look pretty much the same.
There’s a map, a list of stores, and typically an internal API that draws dots on the map.
This case is no exception, the stores on the map are retrieved by this API call
https://www.alexandermcqueen.com/on/demandware.store/Sites-AMQ-WEUR-Site/it_IT/Stores-FindStoresData?countryCode=US
where the parameter countryCode filters the result by country.
That’s the easiest way to scrape data from a store locator, just call the API by iterating on every country of the world and you’ll get all the locations needed.
Another similar case is when we encounter websites like the following:
https://boutique.dolcegabbana.com/index.html?q=34.29208802950000%2C129.85823039930000&qp=34.29208802950000,129.85823039930000&r=65&l=en
In this case, the store discovery happens using coordinates and a radius. But with this method, how can we be sure of scraping all the locations?
Airbnb - Grid-based APIs
In my life as a “data provider”, I’ve found myself in this situation many times. In the previous example, the radius used by the API is 65 Km but, depending from website to website, this could be much smaller.
Take a look at Airbnb as an example, you can scroll the map on the website to find some accommodations, but when you zoom in, other locations previously hidden will pop up.
To populate this portion of Florida, the AirBnb website made a POST request to their internal API. Between different filters about the stay, we can spot the zoom level, which is analog to the radius we have seen before, and also the coordinates of a North-East and a South-West point, which are the corners of a rectangular shape which contains the results shown.
In fact, zooming in, the zoom level parameter increases, and the two dots are closer.
The need for a (smart) grid of the world
The latest two examples show us how location data is generally retrieved when displayed on a map. We can have a point and get all the places inside an imaginary circle with a determined radius. Or we have a polygon, typically a square or a rectangle, where locations are contained, and we need to pass the coordinates of at least two angles of it.
To solve both cases, we should split the world into sectors, ideally in squares, small enough for any case: using the square center, we could use it as the center of a circle with a radius (first case), while we can use its corners for API like the second case.
But how big should these squares be and how to create them?
It depends from case to case but we can create some tools to help us out without reinventing the wheel each time. You will find the code for these tools in The Web Scraping Club GitHub repository reserved for our paying readers. If you’re one of them and don’t have access to it, please write me at pier@thewebscraping.club
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.