Ensuring data quality in web scraping projects
An example of modern web data quality control pipeline
Data quality is one of the critical pain points in web scraping: how do you know the fields in the output are correctly mapped to the information you’re looking for? Did you scrape the whole scope of your interest? Is data formatted in the correct way?
As web scraping projects increase their scope, answering all these questions becomes more and more difficult, leading to less quality in extracted data.
In this article, we’ll see some common techniques to improve data quality in your web scraping operations.
Monitoring the scraping process
Data quality starts from designing talkative scrapers. In fact, during their executions, they can encounter several return codes that could imply issues in data gathering.
As an example, a 404 error as a return code of your HTTP requests indicates that a page is not found. Maybe there’s a broken link in the target website and the scraper halted without moving forward to the next items to scrape.
Or there could have been some networking error or some anti-bot blocking after some requests, so we have a wide range of events that could lead to incomplete data collection.
Including a recap of all the return codes encountered by your scraper, just like Scrapy does, allows you to be more efficient in investigating eventual issues in your data pipeline.
Of course, when a scraper returns 0 results, that’s the easiest case to investigate, but when it returns partial data, you could find the answers on why it happened by having a look at their log.
Both in cases you’re using Scrapy or not, storing this execution recap in a unique place for all the executions is vital as the number of scrapers grows, as it helps to keep track of all the outcomes without the need to open thousands of different log files.
Data Ingestion
Sometimes it can happen that the scrapers’ selectors break after a website changes and we scrape data in a format we don’t expect, like for example having long text where we expect numbers, and so on.
While we could check this event directly on the scraper, doing so when loading data on a database allows us to have a centralized point of control, instead of hundreds of scrapers with the same rules (hopefully).
Automatic Data quality controls
There’s no need to say that the tests you can make on your data depend on the data type itself.
While for product prices you can make some automatic checks on the numeric fields, if you’re extracting qualitative data, like the text of reviews or product descriptions, these controls can be useless.
However, there are still some general rules to follow independently from the data type.
Data completeness
Generally speaking, from your scraper you’re expecting a certain number of items in output.
If you’re collecting reviews, usually there’s a total number of reviews on the main page and, if this is the correct number, you can use it to check the number of items in your output.
This can be done programmatically by sending an alert to your logging system, both at the end of the scraping or in later stages of your data pipeline.
Once you’ve set a baseline that you’re sure is correct, you have a starting point for more advanced controls.
As an example, you can use this number to monitor future executions. I suppose the number of reviews of a product keeps growing, so if one day you see a drop in the number of results, you can send another alert to your logging system.
At Databoutique.com, as an example, we use the Ground Truth method. Every seller, when submitting a new website, needs to enter the number of items expected for a certain brand on the website. This value will be peer-reviewed and updated over time, both from him and from other sellers. As soon as the output of its extraction drops over a certain threshold from this value, he gets an alert that the data publication is halted. Since it can be a natural drop, the seller can submit the new ground truth value and data continue to be published again.
Data coherence
These controls could be applied only in some cases but they are extremely useful, especially when dealing with prices.
Let’s say we’re scraping pricing from an e-commerce website that sells in several countries.
So we’ll have the same items sold in different countries, with different currencies and prices, but we expect them to be coherent in two directions: over places and over time.
By coherent over places I mean that, once normalized all the prices from every country to a reference currency, let’s say USD, we expect that the average USD price per each country, given different taxation and markup, should not differ over a certain threshold. The average price in Japan cannot be 10 times the average price in Germany. If it is so, there’s a typical issue in reading the currency symbol, which is translated to the wrong currency code. In fact, Japan and China have the same ¥ symbol for their currency but one JPY is 0.050 CNY. The same occurs for the dollar symbol or the corona of the Nordic European countries.
By coherent by time, instead, I mean we can monitor the average price for one single country over time, just like we did for item count. As soon as we see a drop or a spike, we send an alert.
Data quality in qualitative fields
When handling qualitative fields instead, like descriptions or text, it could be much more difficult to understand if we’re getting everything we need or if the text we’re getting is different from the expectations.
The easiest case it we already know the domain of the target field (as an example, we’re expecting only “Man” and “Woman” in the gender field of an e-commerce website), so we can set up some alerts if we have different values.
Manual data quality control
On the same level, we can check formats for fields like e-mail and URLs, at least that they have an “@” or if they start with HTTP but this could not be enough.
Who guarantees that the URL pointing to a product image is working and leads to the correct one?
Unluckily, at the moment, I believe there’s no other solution than sample inspecting of the results, but I’d be happy to know more from you if you have smarter ideas.
The same applies to fields like product descriptions: we can check if they’re empty or too short but cannot control their content if the selector we’re using is still working, or if it’s pointing to another portion of the website.
Data Publishing
Only after our web scraped data passed all the previous steps, we can be enough confident to publish our data and deliver it to our end users.
Since we’re talking about web data where we don’t have any control over the source, a certain level of “noise” in the data is endemic but we can try to mitigate it.
Web data quality is an ongoing process where techniques and tools are constantly evolving. I’ve recently discovered changedetection.io, which seems useful for monitoring changes in web pages and in the outcome of selectors. I’ve never used it but I’ll try it in one of the next posts. Do you have any other tools to suggest? Please write it in the comment section.