The Web Data Extraction Summit 2023 wrap up
What happened in the latest edition of Zyte's in-person event
The 2023 edition of the Web Data Extraction Summit ended last Thursday and, while I’m still in Dublin, I just wanted to share my two cents on the event with the Club’s readers, especially with the ones who could not be here.
Does it make any sense to make an in-person event in 2023?
Totally, it does. Even if any tech-related subject can be explained online with some webinar, being reunited under one roof and meeting each other helps us understand that we’re not alone. We could work in the same landscape, even be competitors with each other, but this doesn’t mean we should not meet. Web scraping seems a niche and companies are sparse around the world, so it could be difficult to find in our everyday life an occasion to exchange ideas in person with someone in the same industry. In the end, a rising tide lifts all boats, and we need to work together to make this tide rise more and more.
How big is the web scraping industry?
In the 2022 edition, market research shown during the event estimated that 6 Billion USD in 2023 would be spent on web scraping.
Of course, it’s an estimate, but I’ve got the sensation it’s very underestimated, firstly because the research was done before the generative AI boom. Nowadays, everyone who needs to train an AI model needs data and most of it comes from web scraping. There are no figures available, but given the size of the web scraping operations it made, probably only OpenAI spent 6B in web scraping in 2023. On top of that, there are many companies who probably are still not admitting publicly they’re doing web scraping and much many that would like to do it, but still are unserved due to a lack of skills and low budgets. So I think, and hope, the market is much larger than we perceive.
But let’s go back to the 2023 event, to wrap up the insights that emerged from the brilliant speeches taken.
AI will be more and more present in web scraping solutions
From the first speech by Shane, CEO of Zyte, to some of the others by Konstantin and Iain a major trend has been confirmed: AI will be embedded in web scraping solutions, to solve both the anti-bot bypassing issue but also the parsing one. Of course, the solution should be scalable, robust, and reliable, and, at the same time, we should be able to overwrite the decisions that AI made with our own. In this way, we have control of what happens and don’t have any surprises in the output but also save some time when creating a web scraper.
The death of datacenter proxies?
This provoking argument was the center of the presentation made by Isaac Coleman from Rayobyte. The market has seen a decline in their usage as more and more websites are blocking them when used for scraping but the real question is another one: are you able to understand how much your scraping operations cost and, if so, will your company be able to survive if you suddenly need to use more expensive solutions to avoid anti-bots, since these are becoming more and more harder do bypass? Something that is worth more than one thought!
A deep dive on anti-bot
Another great speech was made by Fabien Vauchelles, which you might already know because of his work on Scrapoxy, an advanced proxy manager, or for the article he wrote some months ago about reverse engineering a mobile API.
His explanation of how anti-bots work, which technique they use to detect bots, and, consequently, what we can do to avoid being detected, was truly interesting. It was at the same time highly understandable but very technical and you should have a look at it as soon as its recording is out.
The web scraping legal boundaries
Another speech you cannot miss is the one by the always-brilliant Sanaea Daruwalla, Chief of Legal and People at Zyte. With some clear and practical examples, she made us understand what are the boundaries to stay between to keep our scraping operations fully legal.
There were also some insights about what it means, from a legal perspective, to do web scraping to get data to use for training AI models, what the key aspects to be aware of and be careful since the legislation in this field is still at the beginning.
The advantages of a marketplace specialized in web-data
Last but not least, I could not avoid mentioning the speech of Andrea from Databoutique.com, where he examined what the introduction of dedicated marketplaces means for the web scraping industry. In a few words, for companies who are looking for web data, it’s like moving from a made-to-order artisan to mass distribution stores: you won’t get the most customized product but, for a cheaper price and in no time, you should be able to get standard products to satisfy common needs. This makes enter to the market companies who could not afford the made-to-order and, from a seller's perspective, drives more sales for the same product. It’s a win-win situation if the marketplace does its job to ensure the quality and the legality of the products exposed.
I didn’t mention all the other great speeches but they were all worth listening to, you get great insight into different aspects of the industry, from large-scale extractions to alternative and location data peculiarities.
Considering how much effort has been put into the organization of the event (it took about 6 months of work), I wanted to say a huge thank you to the whole Zyte team for having created such an event, and hope to see you next year!
I would like to see some scripts or records of the legal boundaries discussion. This topic needs to be explained very deeply!