Scraping The Inflation

Measuring inflation might seem the domain of economists and statisticians, but it is increasingly reliant on vast networks of web scraping—a practice as intricate as the economy it seeks to decode.

Dec 03, 2024

Price increase of a Luxury Item (Lady Dior Bag) 2021-2024

Using web scraping to calculate and improve real-time Consumer Price Indices (CPI) globally is a monumental challenge. The path forward lies in syndicating raw data collection efforts and making CPI goods baskets—used by national statistics institutes—public at bar-code and retailer level.

Web scraping acts as a synapse, transmitting information from the internet to intelligent systems.

Some of these systems are action-driven. They respond to specific triggers: adjusting the price of a product, a hotel room, or a flight based on conditions; or raising alerts when the latest Air Jordan or U2 tickets are listed online.

Others are observers. Their purpose is to inform—predicting which colors will dominate fashion next year or identifying the fastest-growing startup in proptech.

Among these observational challenges, inflation tracking stands out for its complexity and impact, requiring robust systems to decipher its story.

Inflation measurement, a force that shapes economies and everyday lives alike, is perhaps the most ambitious scraping challenge of all—requiring millions of invisible connections to uncover its hidden story.

Unlike action-oriented systems, which often operate on relatively narrow triggers, observer systems are vast, requiring a panoramic view of complex ecosystems. They are more complex and expensive to build, as they demand an intricate web of custom-built synapses.

Inflation: The Hard Quest

Inflation is one of the most globally impactful phenomena. It is visible to everyone and directly affects our daily lives. Yet, its true significance lies in its influence on policymaker decisions, which ripple out to shape economies, societies, and individuals.

As more products are offered online, new opportunities emerge to track price changes and monitor inflation more dynamically. However, despite its promise, capturing online pricing data is more complex than it initially seems—a nut harder to crack.

The Historical Data Drama

Long time series for inflation measurement — Long time series are essential

Building long time series through large-scale web scraping of key e-commerce websites is fraught with high costs and the instability of the observed universe, turning it into an exceptionally challenging marathon.

One of the key challenges in using web scraping for inflation measurement is the issue of representativeness. Online prices are just a subset of all prices in an economy, and no one knows precisely how representative they are—on one side, we've never measured them; on the other, they keep growing daily. To validate their insights, long time series are essential. But building these time series is a marathon, not a sprint—it can take years of data collection before they provide meaningful benchmarks against traditional methods like manual or scanner-based collection.

Complicating matters further, the process of refining how data is collected is iterative. It takes several iterations, often verified through benchmarking time series, to determine which features and attributes are most relevant, how to standardize them, and how to enrich the data. This means that both the start date of data collection and the moment proper collection begins are critical.

Adding to this complexity is the dynamic nature of the online environment. Websites and e-commerce platforms in 2018 were not the same as they are in 2024—in terms of quantity, detail, and structure. The number of products listed online has also grown dramatically. Take grocery retail as an example: many retailers only recently began displaying EAN or bar codes. This effectively resets the clock, wiping out the utility of historical data collected before these details were available.

The panel of websites chosen also has a significant impact. Different choices lead to different results. For instance, the assortment, promotions, and product selections in grocery retail have shifted dramatically before, during, and after the COVID-19 pandemic—and they continue to evolve today. This means the time series data is not consistent with itself, let alone with external data economists use to track the Consumer Price Index (CPI).

The Basket Selection Drama

Consumer Price Index (CPI) Baskets can vary

Replicating CPI goods baskets used by national statistics institutes—or even improving upon them—is hindered by the lack of detailed basket’s bar-code and retailer-level disclosures.

Different actors have different goals when it comes to tracking inflation. Central bank decision-makers focus on collecting prices that align as closely as possible with official CPI baskets. On the other hand, investment funds seeking financial opportunities (alpha) may share similar goals but for different reasons—anticipating central bank decisions to inform trading strategies.

However, CPI baskets are not universal. Each industry, consumer group, or even individual consumer has unique needs that demand customized baskets. For example, luxury goods, such as the Small Lady Dior Bag mentioned in the infographic above, may not be represented in official CPI measurements. Yet, with its retail price in France rising by a staggering 59% over four years, this data is critical for luxury brands, their consumers, and the entire supply chain. Sector-specific indices like this could help retailers, investors, and policymakers understand inflation at a granular level.

The primary challenge lies in the absence of public microdata-level disclosures about the exact products (e.g., bar codes) and retailers (e.g., specific locations) contributing to inflation calculations by central banks and national statistical institutes. While aggregate information is often disclosed, typically in the form of taxonomy updates, critical details remain opaque. For instance, we may know that dehumidification and air purification appliances were added to the Italian HCPI index by ISTAT in 2024, but we have no visibility into which brands and models are mapped to this category or the platforms where their prices are considered. Every web consumer knows the immense price arbitrage opportunities that can exist across different platforms or local retail chains.

Without such granular details, it becomes challenging for economists to replicate these indices from the ground up. Moreover, the lack of transparency limits public debate and the ability to develop improved indices that better reflect the economy.

Furthermore, web scraping primarily aids in collecting consumer inflation data, leaving out a significant portion of inflation related to B2B services. These areas often lack sufficient online visibility, making them more difficult to track through automated methods.

The Illusion of AI data collection

AI driven large-scale data collection is not yet there — AI-driven large-scale data collection has not yet reached the necessary level of reliability.

While customized baskets highlight the granularity required in inflation tracking, the challenges of gathering and maintaining accurate data at scale bring us to the role of automation and AI.

The repetitive, high-maintenance work of coding web scrapers seems like a natural fit for AI. Generative AI tools like spider.cloud, Firecrawl, or Jina.ai have sparked hope for automating large-scale data collection. However, the reality falls short of expectations.

While these tools show promise on low-complexity websites, they struggle with the vast variability of web structures found in large-scale inflation tracking projects. Debugging missing elements—or worse, correcting hallucinated data such as fabricated prices—often consumes more time than manually coding the scrapers in the first place.

Despite advancements, the reliability of AI-powered tools for massive data collection remains inconsistent, especially for the precision and stability required in inflation analysis. Addressing these challenges requires more than AI alone—it calls for human oversight, specialized expertise, and a robust infrastructure for quality assurance.

The details of our research on these and other tools warrant a deeper exploration, which I hope to cover in a future edition of this blog.

Breaking The Quest in Two

Separating global raw data collection from the open generation of indices—where national institutions transparently disclose the data they rely on—could create a market for independent, more efficient, and auditable representations of the global economy.

As is often the case, when a task is too titanic, splitting it into smaller, manageable parts is a way into an easier solution. In this case, the problem can be divided into two distinct challenges: raw data acquisition and index construction. These challenges can be executed separately on the market, by different actors, as they require very different skillsets.

1. Raw Data Acquisition

The first challenge is raw data acquisition. The absence of standardized data interfaces across websites and the current limitations of AI-assisted data collection make it impractical for most organizations to handle independently. Building a vast, intricate network of "synapses" to extract data from countless global sources requires immense resources and expertise.

Given the scale and complexity involved, it is inefficient—and even wasteful—for multiple organizations to construct their own infrastructure for the same purpose. Instead, a shared, syndicated platform for scraping—one that pools global capacity while maintaining quality and stability—would address this inefficiency. Such a platform could act as a collective "synaptic network," enabling organizations to channel their resources into transforming raw data into actionable insights.

To estimate the scale of the advantages of having access to “all” or most web data, rather than relying on the isolated efforts of individual organizations, consider this: In the same press release cited earlier, ISTAT claims it uses 33 million price quotations monthly to estimte Italian inflation.

Given that Italy has over 10,000 supermarkets and the average online representation of products per store is around 20,000, a web-scraped daily collection of grocery and consumer packaged goods (CPG) data could generate approximately 8 billion price quotations monthly. This represents a 200-fold increase in data points compared to ISTAT's current figures (or more, considering the figure does not solely comprise CPG data). Most importantly, if this raw data were made widely available to indepdendent operators, it would create opportunities for alternative indices and sub-indices, offering richer and more granular insights.

The only example I am aware of, that focuses on creating a distributed, open pool of independent data collection on the web is Data Boutique (yes, I may be biased). The platform focuses on dividing raw data acquisition from its elaboration, creating a scalable and efficient model for tackling the inflation-tracking challenge.

2. Index construction

While raw data acquisition addresses the technical challenge of data collection, index construction focuses on creating meaningful economic representations from this data.

This task—whether it involves creating sector-specific inflation metrics, subsector CPIs, or AI-enhanced indices—requires deep expertise in economics and industry-specific knowledge. Unlike scraping, which is a generalized capability, constructing meaningful indices depends on highly specialized skills and a nuanced understanding of the underlying data.

An open market where hedge funds, research institutes, and other private-sector entities compete to develop better indices would provide significant benefits. For this to work, governments, central banks, and national statistical institutes would need to disclose the precise microdata-level composition of their baskets, including bar codes and retailer-specific information. This transparency could foster competition akin to that of stock indices, where multiple providers vie to offer the most accurate representation of the economy—or specific sectors within it.

Who’s Been Dealing With This

While there seems to be a decline in publicly visible research projects using web scraping for inflation measurement in recent years, several noteworthy initiatives still stand out and are worth mentioning.

These projects illustrate the growing importance of web scraping in inflation measurement. Even as some initiatives move into the private domain or integrate into larger frameworks, they collectively underscore the value of automated data collection for understanding one of the most complex economic phenomena.

These initiatives demonstrate the increasing reliance on web scraping for inflation tracking but also highlight the fragmentation in efforts, underscoring the need for unified frameworks and collaborative approaches.

The Billion Prices Project (BPP)

The first project I have knowledge of, tackling this massive quest was the Billion Price Project (BPP), an academic initiative founded in 2008 by Professors Alberto Cavallo and Roberto Rigobon at the Massachusetts Institute of Technology (MIT). Its primary objective was to collect and analyze daily online prices from hundreds of retailers worldwide to conduct research in macro and international economics and to compute real-time inflation metrics.

The BPP gathered prices from over 300 online retailers across more than 70 countries, amassing approximately 5 million prices daily. The original BPP is no longer active, its methodologies and objectives have been carried forward through PriceStats (a private company that continues to produce daily inflation and purchasing power parity indicators in over 25 countries) and HBS Pricing Lab (an academic research initiative at Harvard Business School).

Italian Statistics Institute (ISTAT)

The Italian National Institute of Statistics (Istat) has integrated web scraping to collect price data for various goods and services. Since 2018, Istat has applied these techniques to sectors such as train transportation, electricity in the free market, town gas, and food delivery.

Statistics Sweden (SCB)

SCB has transitioned from traditional in-store price collection to digital data gathering methods, including web scraping, to measure inflation more accurately. This shift has enabled the collection of larger samples and more precise data, enhancing the reliability of Sweden's consumer price indices.

Office for National Statistics (ONS) in the UK

The ONS has explored the use of web-scraped data to produce experimental price indices, particularly in the clothing sector. By employing methods like the Clustering Large datasets Into Price indices (CLIP), the ONS aims to enhance the granularity and responsiveness of inflation metrics.

European Central Bank (ECB)

ECB issued a working paper “Nowcasting consumer price inflation using high-frequency scanner data: evidence from Germany” on how non-traditional high-frequency data such as web scraping and scanner data in combination with machine learning (ML) techniques can help with nowcasting, a forecasting technique used to estimate current or very near-future economic, financial, or other trends based on real-time data (the term combines "now" and "forecasting," reflecting its focus on providing immediate insights into current conditions, rather than predicting far-off future outcomes).

The research network PRISMA (Price-setting Microdata Analysis Network) itself is listing Web-scraped price data as a data source (although, it is still featured as “in progress”).

European Union (EU)

The EU issued in 2020 the Practical Guidelines on web scraping for the HICP (Harmonised Indices of Consumer Prices). These guidelines assist member states in implementing online price collection methods to improve the timeliness and accuracy of inflation measurement across Europe.

National Bank of Serbia (NBS)

The bank has investigated the use of web scraping to collect online prices for nowcasting inflation. The intention of the NBS was to cover as many items in the CPI as possible, in an endeavour to acquire a more reliable nowcast of the inflation central tendency.

Bank for International Settlements (BIS)

BIS launched (November 19, 2024) The Project Spectrum: This initiative explores the use of generative AI to automatically categorize vast amounts of product descriptions and price observations, enhancing the nowcasting of inflation. By processing extensive web-scraped data, the project aims to improve the timeliness and accuracy of inflation estimates.

U.S. Bureau of Labor Statistics (BLS)

The BLS states that it has incorporated web scraping to obtain necessary data, especially during periods when traditional data collection methods face challenges. The process isn’t probably fully developed, since the following year (2022) the National Academies of Sciences, Engineering, and Medicine issued a report to advise the BLS to accelerate the use of new data sources, including web scraping, to modernize the Consumer Price Index (CPI). The report emphasized automating web scraping for categories like food, electronics, and apparel to improve the timeliness and accuracy of inflation measurement.

Drawing Some Conclusions

Understanding inflation is vitally important—not only because of its tangible effects on daily life but also because it profoundly influences policy decisions that shape economies and societies. Technology has brought us closer to understanding this complex phenomenon, but we’re still only halfway there.

Political decision-making, fact-checking, and accountability would also benefit enormously from better tools for inflation observation. A comprehensive, reliable framework for tracking inflation—spanning consumer and business sectors—would empower policymakers and organizations to make informed decisions with greater transparency.

As discussed, syndicating web scraping approach would address inefficiencies and free organizations to focus on transforming data into actionable insights.

In the end, our ability to accurately observe and measure inflation will guide smarter decisions in business, politics, and as educated consumers.

Alex Lokhov

Dec 3, 2024

Thank you Andrea; this is very well-written.

Expand full comment

3 replies

Cristian Rougier

Hi, and thank you very much for your answer. I'm not economist, but i ever heard "inflation runs over basic products" and yes you will have the way to get data for more products than this.

But i think the goal of data (in a lot of cases) is help us to take decisions. In this case, will help too to build inflation index once a month. Suppose an scenario that 20 products up price between 0.5 and 1%, this is really very hard to perceive in traditional methods, and at end of month will noise the infl index. People don't lost in one product, but lost the same or more money with a little of each. i continue thinking that fast and opportune data in the right hands, will help peoples, more than others in this system..

Your new friend, Cristian

1 reply

6 more comments...

The Web Scraping Club

Discussion about this post