THE LAB #86: Querying Web Data using GPT-Like Web Interface
How LLM disrupted the concept of self-service business intelligence
In these pages, we focus mainly on how to extract data from the web; it’s normal, we’re on The Web Scraping Club, so it goes without saying.
However, gathering the correct data is just the first step of a complex journey that begins with a dataset and should culminate in insights for the final user, in whatever form is desired, ranging from Excel spreadsheets to dashboards or tables in a report.
Countless actors pivot around the concept of business intelligence, the process of extracting value from data. In my early career, I was sucked into it: there was the Data Warehouse boom, any corporation seemed to need a huge Oracle database where to store their data and spend several thousands dollars for ETLs that move data from it to the reporting system, and I was involved in first person in building this stuff.
Fifteen years later, tools changed, but the pains remained the same.
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
The need for BI
What didn’t change for sure is the need for decision-makers within a company to understand how the business is performing: how many revenues were generated last week, perhaps compared with the same week of the previous year. Are the costs rising or not? What’s the margin on item X or Y?
Especially when it comes to a company that produces physical products, there’s a high chance that data from different departments is siloed in various software systems, ranging from SAP to HR or production line systems. The task of the business intelligence unit is to gather all this information under one roof, typically a Data Warehouse (at least, it was so 15 years ago), and create views over it using data visualization tools (Qlik, Tableau, Looker, and many others).
Each decision maker has their own set of reports, focused on what matters the most to them. This escalates quickly, with hundreds of different reports produced, some of which overlap with each other, each typically having different numbers due to varying business logics across departments.
Suppose this proliferation is not kept under control. In that case, it means that after a few years, the business intelligence team will have so many reports to maintain that, essentially, there will be no more control over the distribution of insights within the company.
This episode is brought to you by our Gold Partners. Be sure to have a look at the Club Deals page to discover their generous offers available for the TWSC readers.
Self-service BI, the cure that never worked
There was a time when self-service BI was the buzzword of the industry. To avoid the proliferation of reports, the solution of tools like Looker and similar was quite simple.
Let’s create a standard data dictionary for the entire company, where we expose the data available for a particular person and then let them build the report they need, with a simplified UX compared to traditional old-school tools.
This, in theory, should have simplified the life of the business intelligence teams: they could focus on creating reports for a restricted number of people, while the other offices could build their own.
The truth is that, especially in old-school businesses, people usually don’t have the time, skills, or will (and in some cases, they lack all three altogether) to learn how to build on a new tool, so self-service BI tools haven’t delivered what they promised.
Not because they’re not good, but it’s just because these companies underestimated the friction involved in changing the tools and behavior of workers, especially in certain geographies, industries, and age ranges.
AI-powered BI: buzzword or reality?
With the advancements in LLM technology, the BI landscape is also likely to change rapidly.
While a restricted number of certified core reports will still be needed, there’s a chance to add a conversational interface where random questions can find their answers without the need to create ad hoc dashboards.
In this way, data (and web data) time to insights is accelerated: no more meetings to expose the need for a new chart, waiting for the BI team to create the dashboard for the final user, or, even worse, the final user that needs to find some spare time to create his dashboard.
Thanks to LLMs API and other packages, building a proof of concept for this solution is quite straightforward and this is what we’re doing in this episode of The Lab.
Before continuing with the article, I wanted to let you know that I've started my community in Circle. It’s a place where we can share our experiences and knowledge, and it’s included in your subscription. Give it a try at this link.
Building our AI Data Analyst
To illustrate this, let’s set up a proof of concept. Suppose we have purchased or collected a web-scraped dataset – for example, a dataset of Farfetch e-commerce product data (product names, categories, prices, etc.) from DataBoutique. The data file is stored on Amazon S3. We want to allow anyone to ask questions about this dataset in natural language and get answers instantly, without manually writing SQL or scanning through CSVs.
The scripts mentioned in this article are in the GitHub repository's folder 86.AI-POWERED-BI, available only to paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please use the following form to request access.
Our toolset will include:
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.