How We Scraped Global Hotel Data to Track Economic Trends
Tracking Tourism and Economic Flows with Global Hotel Data Scraping
In 2017, at RE Analytics, we embarked on an ambitious project to monitor global tourism flows on a scale unprecedented for the company at the time. The objective was to track economic and human movements tied to tourism while uncovering long-term trends. Although the project is no longer active today, I’m pleased to share some of the key insights and methodologies with the web scraping and market analysis community.
Our data proved invaluable to global investors and market analysts during critical moments. These included tracking the impact of the 2019 Hong Kong protests, assessing the success of the 2020 Tokyo Olympic Games (postponed to 2021) in driving travel inflows, analyzing how perceptions of terrorist threats in the 2010s reshaped travel to European cities like London, Paris, Berlin, and Milan, monitoring post-2018 financial crisis recovery in major Western cities, and understanding tourism shifts during COVID-19 recovery in collaboration with Bernstein Research. We even evaluated the success of global events like the 2019 CES in Las Vegas in real-time. In each instance, we explored the interplay between political, economic, and social events and their influence on travel patterns.
By the end of the project, we had collected and analyzed billions of data points, offering granular and thought-provoking perspectives on global tourism dynamics.
The Challenge
Before the COVID-19 pandemic, the global economy relied heavily on the physical movement of people and goods. Tracking changes in these movements in near real-time—at a daily scale—was the core purpose of this project.
Travel flows, including both tourism and business travel, are often tracked inconsistently across the world, with data typically aggregated in fragmented and diverse ways. This presented a significant challenge when attempting to derive timely and actionable intelligence.
Our aim was to gain real-time insights into how travel patterns responded to major events and broader economic dynamics. By focusing on hotel occupancy data, we sought to explore the interplay between political, economic, and social forces, answering questions such as:
“How are travel flows impacted by political events such as protests or instability?”
“What are the economic implications of major global gatherings or conferences?”
“How do shifting business trends affect regional travel and accommodation demand?”
These insights allowed us to infer into several critical areas of analysis:
1. Economic Indicators
Travel Health: High occupancy rates signal robust travel activity, reflecting consumer confidence, business investments, and disposable income levels.
Economic Growth: Regions with rising occupancy rates often correlate with broader economic expansion, driven by both leisure and business travel.
Industry Trends: Declining occupancy rates can signal challenges for industries dependent on travel, such as hospitality, transportation, and events.
2. Sector-Specific Investment Decisions
Real Estate and REITs: Hotel occupancy data is crucial for evaluating the performance of hospitality-focused real estate investment trusts (REITs) and other property investments.
Corporate Stocks: Investors in airlines, travel agencies, and hospitality monitor these trends to assess the broader travel sector’s performance.
3. Regional Analysis for Currency and Commodities
Currency Movements: Strong travel inflows can strengthen local currencies, influencing forex markets.
Commodity Impact: Tourism- and business-driven demand for goods and services may indirectly affect commodity consumption patterns.
4. Macro Trends and Shifts
Pandemic Recovery: Post-pandemic recovery rates in hotel occupancy provide insight into how economies are adapting and rebounding.
Geopolitical Stability: Regional declines in travel can signal instability, making it a proxy for broader macroeconomic or political concerns.
5. Inflation and Pricing Trends
Demand-Side Inflation: High occupancy rates can lead to price increases in accommodation, influencing inflation trends in travel-heavy regions.
Supply Constraints: Combined with supply chain issues, rising occupancy rates can exacerbate price pressures, offering actionable insights for inflation-focused investors.
Why Hotels?
We chose to monitor hotel occupancy rates, as opposed to other travel-related metrics, for several key reasons:
Established Industry with Accessible Data
Hotels represent a well-established global industry with consistent online availability of data, making them ideal for web scraping. Unlike other sectors, the hotel industry provides standardized interfaces across most platforms, facilitating efficient data collection.Coverage Across Transportation Types
Monitoring hotels captures travel behavior that spans multiple modes of transportation—car, train, flight—offering a broader perspective than focusing solely on flight data, which excludes other forms of travel.Global Uniformity
Hotels are a relatively uniform service type worldwide. The process of booking a hotel in Tokyo is fundamentally the same as in Bogotá, San Francisco, or Cape Town. This consistency simplifies data collection and analysis on a global scale.Real Estate Market Correlation
Hotel data provides one of the quickest indicators of real estate market trends. Hotels represent the smallest rentable fraction of real estate, offering insights into demand patterns and price shifts. I remember the stunned reactions when I presented this view at a Berlin conference in 2017. However, it has since proven accurate, as seen in cities across Europe where the rise of short-term rentals, such as Airbnb, has driven up real estate prices, sparking public protests and regulatory debates.Airbnb’s Evolving Role
In 2017, Airbnb was not as prevalent as it is today, so it wasn’t included in the project. If we were starting this initiative in 2024, Airbnb would undoubtedly be a critical component, given its disruptive influence on both the travel and real estate markets.
What Metrics Did We Use?
Ideally, the most comprehensive metric would be the total nights spent per hotel globally. Unfortunately, web scraping does not yet provide direct access to this level of detail, so we needed to develop proxy metrics to achieve similar insights.
Our best approach was to monitor hotel occupancy rates through online platforms such as hotel websites and online travel agencies (OTAs) like Booking.com, Expedia, and others. By continuously tracking how many rooms were available for each property, every day, and at what price, we could extract valuable data points.
Although limited to rooms bookable online, this method provided reasonable KPIs that correlated strongly with three key metrics in the hotel industry:
1. Average Daily Rate (ADR)
ADR represents the average revenue generated per occupied room, excluding extras like meals. Since direct ADR data was inaccessible, we used a close proxy: the Average Published Rate (APR).
While APR doesn’t reflect actual guest spending, it reveals price trends over time, offering an invaluable indicator of economic dynamics, such as demand elasticity and price sensitivity.
2. Occupancy Rate (%OCC)
To calculate occupancy rate, two inputs are needed:
Total room stock (the number of rooms in a hotel).
Vacant rooms (rooms still available for booking).
Tracking the vacant rooms online was straightforward—after all, who books a hotel offline these days? Determining the total stock was more challenging, but with creative methods, we eventually solved this problem.
3. Revenue per Available Room (RevPAR)
RevPAR, a key profitability metric, is calculated as:
ADR × %OCC
By using our proxies for ADR (APR) and occupancy rates, we could derive what we called an Online Projected RevPAR, offering insights into revenue trends with sufficient accuracy for market analysis.
In the following sections, we’ll dive deeper into these metrics and explain how we derived them using web scraping techniques, along with the challenges and solutions we encountered.
Average Daily Rate (ADR)
Unlike data from direct partnership with the hotel ownership, web scraping provides a proxy for ADR, the Average Published Rate (APR)
ADR, or Average Daily Rate, is one of the most important performance metrics in the hospitality industry. It represents the average revenue earned per occupied room over a given time period and is a critical measure of profitability and pricing strategy. Together with Occupancy Rate (%OCC) and Revenue per Available Room (RevPAR), ADR forms the foundation of hotel industry analytics.
However, the true ADR can only be accurately calculated by hotel owners or entities with direct access to the hotel’s revenue and occupancy data. This is because real ADR takes into account variables such as:
Discounts applied to bookings (e.g., group rates or loyalty discounts).
Booking channel commissions or fees.
Historical booking prices locked in months before the stay.
Who Can Access ADR?
Property Owners and Investors (e.g., Private Equity, REITs)
Owners and investors involved in hotel real estate use ADR data to monitor asset performance and plan investments. Aggregate ADR data often appears in annual reports, earnings calls, and investment prospectuses.
Benchmarking Services (e.g., STR, CBRE Hotels)
Benchmarking platforms access ADR through data-sharing agreements with hotels. These services ensure confidentiality and publish aggregated reports by region or hotel segment (e.g., luxury or economy) on a weekly, monthly, or annual basis.
Tourism Boards and Government Agencies
Some agencies collect ADR data from local hotels through voluntary participation or regulatory reporting. However, their scope is typically limited to specific jurisdictions.
What Are the Alternatives to ADR?
For external observers without direct access to ADR, alternative data sources can serve as proxies. Below are common methods:
1. Average Published Rate (APR)
Pros: Granular, real-time insights; scalable via web scraping.
Cons: Reflects publicly available rates, which may differ from actual realized revenue.
2. Credit Card Transaction Data
Pros: Captures actual amounts paid and is highly granular.
Cons: Expensive; biased by geography and financial institutions (e.g., it may represent local residents more than international tourists).
3. Email Receipts Data
Pros: Provides detailed data, including amounts paid, booking dates, channels, and room types.
Cons: Like credit card data, it is costly and limited in scope by demographic biases.
Since this blog focuses on web scraping, we’ll delve deeper into APR as the preferred proxy for ADR.
Average Published Rate (APR)
APR, or Average Published Rate, is a valuable proxy metric for ADR. Sometimes referred to as Average Offered Rate (AOR), it represents the average rate publicly available for booking a hotel room on a given date. Unlike ADR, which reflects realized revenue, APR focuses on the prices offered to potential customers, regardless of whether the rooms are booked.
How APR Is Collected
APR can be obtained through web scraping from the following sources:
Direct-to-Consumer hotel websites
Online Travel Agents (OTAs) like Booking.com or Expedia
Metasearch Engines like Google Hotel Search, Trivago, or Kayak
Key Differences Between APR and ADR
Unlike ADR, which is a static value for a specific date and hotel, APR fluctuates based on factors such as:
Channel Variability: Rates can differ across websites or sales channels due to discounts or promotions.
Advance Booking Window: The same room may have different prices depending on how far in advance it is booked.
Why APR Matters
APR’s dynamic nature makes it a powerful tool for market analysis. Tracking APR over time can reveal pricing trends and the speed at which prices adjust to changing demand, providing insights into market elasticity and competitive positioning.
Web Scraping the APR
The technical process of collecting Average Published Rate (APR) data via web scraping has been thoroughly covered in a previous post on this blog. For a detailed guide on the tools, techniques, and best practices involved, you can refer to the article linked here:
Derived Metrics: APR by Advance Reservation Window
One of the most fascinating aspects of working with a metric like the Average Published Rate (APR) is its dependency on a key variable: the advance reservation window. This dependency creates an opportunity to define derived metrics that are simpler to interpret and highly actionable.
For example, the Next Day Published Rate—the price a traveler would pay to book a room 24 hours in advance—is particularly useful. As one of the least advantageous booking scenarios for consumers, this metric provides a fast-moving indicator of how external events influence hotel rates in the short term.
The Next Day Published Rate proved invaluable for analyzing real-time dynamics, such as sudden demand spikes due to local events or disruptions caused by geopolitical or natural occurrences. By focusing on this derived metric, we were able to track the immediate impact of external events on hotel pricing trends with remarkable precision.
This metric, like all the others we measured, highlights a subtle but important challenge in global data collection: time zone calibration. Since the target event—the next day—depends on the local context, the timing of data collection must be carefully adjusted for each time zone worldwide.
To ensure consistency across all data points, we standardized our data collection to occur at 9:00 AM local time in each time zone. This approach allowed us to maintain uniformity in the timing of requests, ensuring that the data reflected comparable conditions regardless of location.
Occupancy Rate (%OCC)
Fast is better than perfect: Achieving the exact occupancy rate is less important than obtaining a highly correlated index that can anticipate this key metric.
Occupancy rate measures the percentage of available rooms in a hotel that are occupied over a specific period. It’s a critical indicator of how effectively a hotel is utilizing its room inventory to generate revenue—in simple terms, how full the hotel is.
Real Occupancy Rate vs. Estimated from OTAs
The real occupancy rate includes all bookings, such as those made directly through hotel websites, walk-ins, corporate contracts, group bookings, and OTA reservations.
By contrast, scraped estimates from OTAs may exclude offline bookings and other channels not listed on the OTA. These estimates can also be skewed if hotels deliberately block room categories or adjust availability to create a sense of urgency.
Factors in Calculating Occupancy Rate
Available Rooms
Occupancy rate is calculated as:
Occupied Rooms / Total Stock
or
1−(Available Rooms/Total Stock)
The first step is determining the total number of available rooms. To approximate this, we conducted a series of simulated reservation requests across various configurations. Essentially, it was like asking a hotel desk, "Do you have X rooms available for one night on this date?"
Incremental Requests: We increased X in discrete units (e.g., increments of 5 rather than one-by-one) to reduce query volume. When the requested number of rooms exceeded the hotel's capacity, the structure no longer appeared in search results, signaling the availability limit.
Projections Over Time: By incrementing the booking date, we tracked room availability into the future, identifying how quickly hotels filled over time.
Total Stock of Rooms
Determining the total stock of rooms was more complex. Fortunately, the hotel environment in 2017 was relatively stable compared to today's dynamic Airbnb listings, where capacity can expand or shrink overnight. Adding rooms to a hotel requires physical construction, making stock changes slow and predictable.
Dynamic Factors: OTA platforms occasionally added new properties to their inventories or changed the share of rooms available online due to evolving strategies.
Future Projections: We identified the total stock by projecting searches far enough into the future (e.g., 6, 12, and 18 months ahead) to find periods of peak vacancy. These rolling weekly inquiries created a moving window for measuring total stock.
Advancing Industry Standards
While this approach has limitations, it provided complete independence in measurement and ensured exact comparability between hotels across any global location. This method represented a significant improvement over traditional KPIs, offering standardized insights for tracking occupancy trends on a global scale.
Reaching Global Scale
Scaling this project globally presented significant challenges, particularly in managing the volume of requests generated by our method. Applying this systematically across all inhabited regions of the world (yes, we had an algorithm to exclude deserts and oceans) required careful planning to minimize resource usage and avoid interference with the observed websites.
Back in 2017, proxy networks were not a necessity for accessing most platforms, so the primary concern was bandwidth rather than cost. However, to ensure the sustainability of the project, we adhered to a fundamental principle of web scraping: avoid disrupting the technical or business operations of the target websites. This required optimizing our query structure to reduce unnecessary traffic.
Focus on Cities
Rather than attempting to scrape data for the entire globe, we concentrated on the most important cities worldwide. This decision was guided by research on human behavior and urbanization trends.
In 2007, the global urban population surpassed 50%, and in 2023, it has grown to 57%. In regions like North America (83%), Latin America (82%), and Europe (75%), urban centers dominate. While urbanization is less pronounced in Asia (53%) and Africa (45%), cities remain critical hubs for travel.
By focusing on the top 1,000 cities globally, we captured the majority of meaningful travel flows. This method proved far more efficient than attempting to monitor less densely populated areas, enabling us to deliver insights at scale while reducing data collection overhead.
However, this city-focused strategy had a notable limitation. By excluding smaller but significant tourist destinations, we missed global coverage for certain hotel chains, such as Hilton, Hyatt, or Sheraton, which have properties in less urbanized areas or niche tourist locations. While this drawback was acceptable for our use case, it highlights the trade-offs inherent in balancing scalability with comprehensive coverage.
Targeting Large OTAs
We prioritized large online travel agency (OTA) platforms for several reasons:
Efficient Development: Scraping from a few large OTAs reduced the effort required for coding and maintenance.
Stable APIs: Many OTAs provide APIs that are deeply integrated into the travel ecosystem, changing infrequently and offering long-term reliability.
Robust Infrastructure: These platforms are built to handle immense volumes of legitimate traffic, making them capable of accommodating our queries without strain.
It’s important to note that our query volume was significantly lower than:
The legitimate traffic generated by real users.
The volume of scrapers that OTAs deploy against each other for competitive intelligence.
We ensured compliance with all applicable terms of service and used ethical scraping practices to collect data.
Discretionary Increments
To balance efficiency and scalability, we made deliberate choices to simplify our queries:
Simplified Variables: Rather than querying all possible combinations (e.g., every room type or every number of reserved rooms), we limited our requests to discrete increments that provided sufficient granularity.
Advance Booking Windows: We queried specific timeframes, such as 6, 12, and 18 months into the future, to capture meaningful patterns without unnecessary redundancy.
Advanced Algorithms: The "Tetris Rule"
We developed innovative algorithms to further optimize occupancy calculations. One method, which we nicknamed the Tetris Rule, involved querying room availability for longer stays across multiple days. This allowed us to infer occupancy rates without querying every single room configuration. While not perfect, this approach was highly efficient and reduced the computational and traffic load significantly.
The Closing of the Project and a Call for Data
As discussed extensively on this blog, web scraping costs have become a significant factor in data acquisition projects. Over time, these rising costs made large-scale data collection less competitive compared to alternative datasets, such as credit card transaction or email receipt data, which many funds already possessed for other purposes. This narrowing scope and increasing cost ultimately made the project less viable in the market.
Despite these challenges, I still consider this one of the most fascinating and ambitious data intelligence projects we’ve ever undertaken—and there have been many. This experience also became one of the driving forces behind the creation of the Data Boutique platform. Very few companies can afford to sustain data collection at scale like this, and even the largest funds face constraints in terms of budget and priorities.
At Data Boutique, our mission is to make all types of data—not just hotel reservation data—accessible to a wide range of projects. We recognize that OTAs, hotels, airlines, and travel agencies collectively gather enormous amounts of data. These stakeholders would benefit greatly from having a uniform, complete, and globally updated dataset, accessible to all while reducing their own costs.
In fact, we advocate not only for scraping but also for data sharing directly from OTAs. This approach would allow them to monetize their data and simultaneously lower their anti-bot expenses.
But that’s a broader conversation for another day. For now, this was the story of one project—challenging, inspiring, and a cornerstone of what we’ve built since.