Web Scraping news recap - May 2023
The most interesting articles on Substack about data and web data
Hi everyone and welcome back to The Web Scraping Club, this post is our monthly review of what happened in the web scraping industry in May.
Substack and Data
Thanks to Substack’s community and suggestion features, I could discover so many exciting publications about data (and web data), so I decided to mention the most interesting articles I’ve read during this month.
by Chad Sanderson - Data Products newsletter
When things scale, in every department of a company, things can go pretty bad if the growth is not supported in a proper way.
In this article, Chad discusses the difficulties of scaling data products in a growing company.
Usually in a start-up data follows the ELT path: it gets extracted from various sources, loaded in a data lake, and then the different departments transform it, according to their needs and skills. This leads that every department has its own rules, sometimes duplicated or, even worse, the same KPI is calculated with different formulas.
In a small organization, this can be accepted because it’s the fastest way to get insights and all the people in the company are aware of the whole picture.
But as the company grows, all these inefficiencies and hidden issues are ruining the overall data quality, basically for one big problem: the lack of a single source of truth.
A place where data is stored, cleaned, and already transformed for the whole company: the good old data warehouse.
Chad’s article is great and, as a DWH analyst in my early days, I could not agree more about the picture he described. Also, I see a lot of parallelisms with web scraping projects.
You start with few scrapers, with no logging stored, no standards, and data quality checks since you detect issues with some queries. But then you find out that even a dozen of websites are hard to scrape in a continuous and accurate way if you don’t have a proper data ingestion and quality control process. Scale is killing your data quality and you don’t have a source of truth to refer to.
And this leads to the second post I want to share with you.
Databoutique.com is the company that I founded with Andrea and we’re in the web scraping industry for more than 10 years. With hundreds of websites scraped with a light and highly efficient team, we know something about scale and web scraping.
In this article, Andrea describes perfectly the difficulties in delivering high-quality web data to customers in a continuous and accurate way.
Just to be clear: a near real-time web data feed with 100% of uptime is a utopia, even for a single website, for the main reason that the source of the data, the target website, it’s not under our control.
But we can use several techniques to reduce the data quality issues in our in-house scraping projects.
Time series help us detect anomalies in data, when there are sudden drops or spikes in counting and values, but they’re not efficient against gradual degradations of data quality.
You can set your source of truth, by manually counting items on target websites. Not the most scalable thing out there, but the most efficient technique we’ve found up to now.
Last but not least, you can have redundancy, comparing your extraction with the one offered by other providers. This can be expensive (but not if you use databoutique.com as your provider ). Always about the costs of web scraping projects, I suggest also this other article from Data Boutique's Substack , where we help calculate the ROI (return on investment) of your web scraping projects.
Calculating ROI is useful not only for internal web scraping projects but also for data sellers that are approaching a new customer. Does this dataset provide enough value for my customer to give him an adequate ROI?
An excellent way to understand how customers in finance evaluate the ROI for alternative data and data partners is described in the next post.
Can the data provide insights that change the decisions of the business for the better? Know how you will measure if it’s adding value in the future and estimate those metrics in advance based on the inclusion of the data.
That’s the key point to calculate if the data you’re going to buy is enough valuable for you or not.
But to understand it, you need to find out any bias or limitation on the dataset itself.
But the ROI is not the only key point for a data provider to be accepted.
In this post, we can see a list of requirements a data provider should meet to be considered.
Is the data collected in an ethical and legit way? Especially in the financial world, this is key.
Is there some backtesting that proves the effectiveness of this dataset?
Is the provider capable to adapt to feedback quickly?
These and other great points in the original article, which is a must-read, like all the substack, if you want to enter as a data provider in the alternative data.
Video of the month
About the complexity of web scraping operations, I’ve been invited by Smartproxy to talk in their webinar series. We discussed the web scraping industry in general and best practices to handle challenges and complexities in this always-changing world.
The most-read article of the month
And for the third month in a row, AI seems to be the most interesting topic even in the web scraping industry.
The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
If you wish to receive articles like this directly in your email, you can subscribe below.