Change detection for web scraping: tools and techniques
How proactively get warned before your scraper breaks
One of the main difficulties of web-scraped data pipelines is that we don’t have any control and power over the data source. While for companies’ internal data pipelines, there are plenty of tools and practices, like data contracts, for data management and propagation, in web scraping we know when a website has changed its structure only when our scraper doesn’t work anymore
Of course, this is not the optimal solution: if we need to deliver our data within a tight timeline, there could not be enough time to fix the scraper.
But web scraping is becoming increasingly mature as an industry and more services are launched every day on the market to help us improve our processes.
What kind of changes we could expect?
There are several aspects we should consider when thinking about change detection in web scraping data pipelines and they could be grouped into two main categories.
HTML modifications
This is the obvious and more common case, when a website changes its own HTML code, there’s a chance that the scrapers’ selectors won’t work anymore.
Depending on the website, the industry where it operates, and the frequency of the content change, there could be some predictivity in these occurrences: For example, in fashion retail, when a new collection is sold on the website, usually there’s also some maquillage that could lead to changes in the HTML. The same occurs when there’s the discount season or Black Friday.
But these are soft signals not correlated to the ground truth of every single website, where changes could happen all the time.
Tech stack changes
Web scraping activity is impacted not only by changes in HTML codes but also if the website’s owner decides to add or change an anti-bot solution, which could mean a different approach on the web scraping side, from the tools used to the running environment.
There also could be changes in the framework chosen for building the website, but since this implies also a change in the HTML, this case could be added to what we’ve already seen before.
How to mitigate the effects of website changes
To reduce the impacts of HTML changes on the target website, following some best practices when writing the scraper is the key.
Use the website’s API whenever possible: they are less prone to changes and they are not impacted by changes in the HTML structure
If not possible, use the JSON created by the framework used to build the website. As an example, when a website is created with React and needs to list items on a page, the result of the query of the database could be read in the “window.__PRELOADED_STATE__ =” JSON string inside the HTML. You can read more about this data retrieval process on the React official documentation.
The same happens with “__NEXT_DATA__” data in the Next.js framework.
Use the most generic selector expression possible, to reduce the number of portions of HTML involved in it that could break. Example: instead of writing the following XPATH expression
//div/span/a[@class="price"]/text()
write directly
//a[@class="price"]/text()
so that everything happening in the higher-level nodes doesn’t impact the success of the expression.
After we apply all these best practices, what we can do is monitor the website to check if something changes between our scrapers executions.
In fact, if we have a daily or intraday data refresh we immediately see the outcome of the scrapers and can intercept soon any issue in the data downloaded, if we have a weekly or a monthly refresh, it means that a website has more day where it could have changed without us being aware of it.
Useful tools for change detection
As mentioned before, we’re lucky enough to live in a tech era where there’s a tool for almost everything, and this niche doesn’t make any exception.
If we want to discover if a website added an anti-bot solution to its tech stack, we could use Wappalyzer. It’s a free browser extension that analyzes the cookies and other signals coming from the page you’re visiting and shows you information on the whole tech stack used by the website’s developers.
The section that interests us the most is the security one, where almost every well-known anti-bot can be detected if installed.
I’m sure there are many other similar tools on the market but I’ve still didn’t find one that can be so precise in detecting anti-bots, but feel free to write me at pier@thewebscraping.club if you’re using another one.
Wappalyzer gathers all this information in a database that can be accessed via API, so you can easily build your tech stack monitoring tool on the websites you’re accessing. There’s only one big issue: the API access costs 450 USD per month, a price not per every pocket.
Another tool useful for discovering changes in the HTML code of websites is changedetection.io, a service founded by Leigh Morresi. Given a URL, you can ask to be advised if something in the website changes, not only from a visualization perspective but also using selectors.
In this way, you can set up a test using one or more selectors for your key fields in your scraper, and when they stop returning the expected value, you’re alerted and can take the actions needed to solve the issue.
You can also add proxies so the checks can be performed even on websites with anti-bot installed, there’s a section where you can use a visual editor to highlight the part of the website that needs to be monitored, and there a plenty of integrations with messaging platform, where to receive your alerts.
The price is quite affordable for every freelancer working in web scraping (around 9 USD per month), I still haven’t used it in production but planning to do it soon.
Do you know any other tool for change detection on websites? Please let me know at pier@thewebscraping.club