From 0 to 2 Billion Prices scraped per months
Some high level principles to scale a web data business
In this post of The Web Scraping Club blog, I’ll write about what we did at Databoutique.com to scale from 0 to 2 Billion prices per month scraped, bootstrapped, and with a minimal team of developers. I’ll write the principles that lead the developments in our company and let us scale as a data provider keeping low costs and head count.
Things are far from perfect and we made tons of mistakes in the journey but I think it's an impressive milestone and hope someone could get away with some hints after the reading.
In other words, this is the kind of post I wish I'd read before starting the journey.
Think big from the start
When we started we had an empty database and an idea of what kind of data we wanted to get from the web: prices.
Both of us founders were coming from business intelligence consulting and designing the data model was the easiest part for us. We knew that things could escalate quickly and a good data model is a foundation for building a sustainable data business. Also focusing on only one kind of data simplified a lot the job.
We thought from the first moment about what tables were likely to grow over time and tried to build everything in order to avoid processes being stuck because of queries lasting too long.
This is applied in every tech layer we have in-house, trying to balance the two needs: a quick time to production and a sustainable solution in the long term.
Keep it stupid simple, whenever you can. This is the key to managing the complexity of these large projects. Nobody likes writing documentation of processes, so they almost should be able to talk by themselves.
But choosing simplicity is not simple as it seems: it means taking some hard and counterintuitive decisions.
As we said before about the data model, choosing the simplicity and accepting losing some details of the products we're scraping has pros and cons: we are not able to fulfill requests from customers who want to dig into the highest level of details of the product they are seeing but as a huge pro makes our job much easier and the business scalable in different industries since we adopt a common data model for everything sold online.
But also at the process level, we decided to build in-house our very simple ETLs in python and integrate them into our homemade logging system. It's a bit like reinventing the wheel but we didn't want to add the overhead of handling also a third-party ETL like Airflow or others.
We didn't need ETLs with thousands of features we would never use: we have a very straightforward and standardized pipeline, from the result of the scraper to the database. Keeping this pipeline simple made us even simpler to write and run our ETLs.
I understood only later the importance of standardization of the internal processes and I think the advantages are very underrated in common thinking.
But standardization, when it does not become bureaucracy, has a great advantage: it leads to simplicity (see paragraph before).
Having standards in writing scrapers, as an example, means that you have 2/3 variants of scrapers that gave the kickstart to hundreds of different scrapers running on different websites.
It means that everyone in the web scraping team, even on its first day of organization, knows what to expect in the behavior of the scrapers he's working on.
They log in the same way, have the same input structure and output fields, and use the same tech stack.
All the scrapers are launched in the same way, so when adding a new website to the scope means basically starting writing a scraper from a template and, when finished, pushing it to production and adding a line to a table to schedule its run.
Processes as a LEGO
Planning, simplicity, and standardization allow us to see our whole process’s ecosystems like a Lego building.
Every brick is a simple step of the process that combined with others, adds to the wall of a beautiful castle.
This means that we have bricks that allow us to launch a VM instance in AWS, GPC, or AZURE, or a brick that gives the choice to our scraper to use or not some proxies and, depending on the parameters, choose the vendor and the type of proxy.
The concept is similar to having internal APIs for every step of the process, and this is where the future of our company is headed.
But this approach is extended also to non-technical tasks like data quality checks and data enrichments.
Splitting complex processes into small and simple autonomous tasks means that these tasks can be accomplished faster and with a shorter learning curve, improving the efficiency of the whole company.
During these years we had several freelancers coming and going and making their onboarding and their operating tasks clear and simple is a key point for us.
As an example, when we assign to a web scraping freelance a task that requires fixing a running scraper, he already can see the full picture: what is the cloud provider where it runs, the code of the scraper from our repository, the type of issue, what happened in the latest execution thanks to the logs and what "type" of scraper it is.
When working on it he can test it to the preferred cloud provider and when the issue is fixed, he can push the code and notify our QA team that after a second check promote it to production.
The only 2 things that are not automated are the 2 human interventions.
Keep an eye on everything
As I've already mentioned, we've built our internal log system, not because on the market there were no choices but because we needed something simple yet tailor-made for our needs.
Writing scrapers is not like writing common software because the input of the program (the website) changes without no advice. So the scraper keeps running without any error but you can find yourself with partial or missing data.
For this reason, we needed two layers of controls:
the process controls, that let us know of any issues on the software side, like errors on ETLs, launching instances, and so on.
data quality controls, given our experience in the industry, are tailored to our needs. We know what we expect from every website (an approximate count of items, number of countries, brands, and so on) and we compare these expectations with reality.
All the thousands of rows of daily logs then are distilled and lead to issues that, depending on the type, are then routed to our engineering or to our web scraping team, again automatically.
The more accurate the issue is, the easier for the team in charge to find the solution, reducing the time for getting a fix in production.
Being bootstrapped, other things to monitor constantly are the costs and usage of cloud providers. Again a daily automated report that uses the APIs of the providers helps a lot in pulling or pushing the gas according to the financials.
There's no magical recipe for making a web data company work, but what we found out is that being bootstrapped we need to raise our productivity every day. And to do this, we need planning, simplicity, standardization, modularity, and monitoring in our processes.
I hope this reading makes you think about some aspects of your work that can be improved in a certain way, if you think I've missed something just let me know in the comments below or via DM.
The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.