THE LAB #66: How to properly scrape a booking website
Business logic and best practices for scraping booking websites like Airbnb and Booking in the most efficient way
In the hospitality sector, web data extracted from platforms like Airbnb and Booking.com has become instrumental in implementing effective revenue management strategies. Data from these platforms offers granular insights into market dynamics, competitor pricing, booking patterns, occupancy trends, and seasonal variations. For example, tracking real-time competitor pricing helps hoteliers implement dynamic pricing models, enabling them to adjust room rates based on demand fluctuations, special events, and competitor actions to maximize revenue per available room (RevPAR). Additionally, data on listing availability and length of stay requirements across different properties provides valuable context, helping hotels optimize their occupancy strategies and avoid over-reliance on discounting during peak periods.
In addition to pricing and availability, these platforms reveal trends in guest preferences, such as popular amenities or emerging accommodation types. This enables hoteliers to tailor their offerings and more effectively align with market demand.
When collected over a long period, this data can also show trends in demographics or tourism routes, with some locations becoming more popular and others losing their allure.
To extract this data efficiently, sophisticated web scraping and API integration tools are essential, as they can gather structured data on listings at scale. However, optimizing the number of calls is crucial, given the scale of the operations and the peculiarities of the industry that we’re going to explore now.
Key parameters for scraping online travel platforms
If you’ve ever booked a night out online, you’re probably aware that the price of your stay depends on several factors, some unique to the industry. Let’s briefly see them.
Length of stay and number of rooms
Of course, the length of the stay and the number of rooms you’re booking are the basic information that composes the total price of your vacation. Depending on your business needs, you can exploit these to extract insights from online travel platforms.
If you want to understand how prices behave in a certain area, looking for one room for one night is probably enough, but we’ll see other details later in the paragraph.
However, if you need to understand the structures' fill rate, you can use these parameters to your advantage. Depending on the platform you’re querying, you can ask for the availability of ten or twenty rooms in the next 30/60/90 days. While you can see only when their fill rate will be over 80% or 90% for bigger hotels with more than one hundred rooms, this information will be more meaningful for medium-sized structures.
At Re Analytics, we have used this approach in the past for hundreds of cities worldwide. It was so interesting to see the wave of bookings in a city before a big event in it, like a Formula 1 GP or some lesser-known expositions. It happened several times that we needed to manually check if our near 100% fill rate was correct, and we found out that there was some particular event in the city we were not aware of.
The same approach can be used to check the availability of Airbnb structures, but since these structures can host fewer people and probably are not even open for the full year, you need to trim the parameters in your request.
Room type
If your main focus is revenue optimization or, generally speaking, extracting insights from a structure's nightly prices, one key element for determining the price is selecting a particular room type.
This is true, especially for hotels with different rooms at different prices. Therefore, this should be considered when defining your request to the website.
Advance time
We all know that the price of a last-minute booking will differ from that of a booking made months in advance.
This must be taken into consideration depending on your business needs. Suppose you’re building a hotel price monitoring system to understand the housing market, for example. In that case, you can extract the prices of bookings today for tomorrow since they are used only for statistical purposes. If you’re building a revenue management system for hotels, you probably want to know how the prices are moving in one or two months so your customers can trim the prices accordingly.
Amenities of the structures
Again, this may or may not be interesting, depending on your business needs. If your solution is tailored to maximizing revenues for structure owners, both short-term rentals and hotels, understanding which amenities are available in hotels with prices above the average can be a good insight into planning investments or benchmarking your structure. For example, if your hotel has an SPA, you can compare the trends of prices of structures with SPAs in the following months and trim your prices accordingly.
How do travel reservation websites work?
One key aspect to consider when you book accommodation online is its location. How far is it from the beach or the city center? Or is it easily reachable by car?
Because of this, every platform has a “map view” feature on its website.
By moving ourselves over the map or zooming in and out, the results displayed change, thanks to the underlying APIs that retrieve information from the database, given a set of coordinates and a radius (or a viewport) delimited by them.
Here’s a part of the underlying payload passed to the API for this request, where we can see the coordinates of the bounding box passed.
If we’re willing to map a whole city or state, simulating calls from a user interested in the entire territory is the key to scraping the website.
But how can we scrape all the listings on the portion of territory of our interest? In fact, by using the coordinates in the coordinates in the API calls, we don’t know beforehand where they fall. We can tell by reading the responses since inside them, there’s the address of the structure, but how can we improve the efficiencies of our requests before understanding that we’re out of scope?
In a previous The Lab Article about scraping location data, you can find how to build a grid of the world (or a smaller location) and then divide it into smaller squares, ignoring uninhabited portions.
By creating different layers of grids, we can pass them to our scrapers to improve the completeness of the extraction while minimizing the number of requests.
But how can we use this approach on a real-world project?
Let’s see it with an example: we want to scrape the hotels in Genova, Italy, from Booking.com.
Disclaimer: I’m not creating a full scraper of Booking.com in the paid section, since it would mean wrapping a GraphQL API, but I’m focusing more on the logic of creating the requests.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.