These months are the busiest for the web scraping industry, as most events are attended.
Last week, we had the Austin date of the Extract Summit by Zyte, which, unfortunately, I was unable to attend. Yesterday in Vilnius, we had Oxycon 2025, where I had the privilege of being called as a speaker among other great panelists. While we wait for the recordings to be publicly available, I'd like to share a brief summary of the key topics covered.
AI, LLMs, and scraping
Of course, this is the unavoidable topic, but it can be declined in several ways.
With the speech of Zia Ahmad we had a taste of how “old school” AI, like computer vision and NLP, is still relevant in sentiment analysis and CAPTCHA solving, which is a fresh perspective.
Each day, we’re inundated with news about the latest LLMs, benchmarks, and improvements, and we forget that traditional AI is still usable and relevant for solving specific tasks.
But of course, most of the speeches, including mine, were about LLMs and how to use them in scraping.
AI Studio
Let’s start with the latest developments coming from the Oxylabs team. We had a sneak peek at some of the tools inside the new AI studio.
In particular, we’ve seen how we can get data from different websites without writing a single line of code, but with just a prompt.
Once the studio is configured in your IDE (in this case, Cursor), you can prompt that you desire data from a list of websites, in particular, some categories of items, and the magic happens.
The AI Browser opens the website, the crawler scans the website and extracts the URLs that match the criteria in the prompt, while the scraper extracts data from them, in the desired output format. And the most interesting part is that all the tools are powered by Oxylabs unblocking solutions, so you don’t have to worry about anti-bot protections.
LLMs for content creators and scraping professionals
Always about LLMs, my speech was about how they’re helping me in the “content creator” career and as a scraping professional.
I described how I keep myself informed by using LLMs, which extract data from an infinite list of web sources and create a summary for me, allowing me to decide what to read and when. This was simply not possible with traditional scraping, so we can say that LLMs enabled the scraping at scale, especially for this use case.
Another interesting use case is to make the LLM write a scraper for you, which is another recurring argument, at least in the chat we had at the event. In this case, with Cursor, we could describe our scraping process, best practices, and rules, so that we clearly describe to the LLM what we desire as an outcome. MCPs, instead, solved the issue of having real-time HTML code we can apply these rules to, so we can have a fully working scraper just by prompting our needs.
Content creation and AI
After the speech I was asked about how, as a content creator, I’m living the fact that AI companies scrape my articles and this was another recurring topic during the day.
There’s no simple answer to that, and the landscape is evolving so rapidly that this may not be true in the following months. For the moment, content creators, especially in some industries and business models, I think, will have a hard time.
For decades, we have begged Google to show us on the first page of its results, and it was a clear trade-off: we create content as you like, so that we can be found, and more traffic can be directed to our website. This doesn’t always mean better content, but that’s another story.
Now we’re living something unprecedented: companies with enormous scraping budgets have gathered large portions of the web, mostly for their internal use, and it’s up to them when to show or not the source of that information they’re displaying.
Yes, I can also see on The Web Scraping Club that some new visitors are coming from ChatGPT, Claude, or Perplexity. However, what I don’t see are the AI users who read my content in a summarized way and didn’t (or couldn't) visit my website.
Luckily for me, I’m writing about web scraping, which is constantly evolving, so people interested in this topic are more likely to subscribe and read firsthand content rather than read one that has been chewed and digested. However, this is not always true for other companies and industries, and the pay-per-crawl compensation, in my opinion, isn’t a solution since the math does not work.
New times, old issues: parsing, scaling, and anti-bots
Last but not least, we had some interesting panels and speeches about the foundation of web scraping: HTML parsing and bypassing anti-bots.
Oxylabs presented their new tool for parsing HTML, which I’m keen to try in order to understand how this can streamline scrapers’ production.
The speech by Fred de Villamil about how NilsenIQ scaled its scraping operations using the right combination of great people, smart processes, and wise resource management is enlightening, as large-scale scraping is always a challenge for anyone in this industry.
The latest panel about anti-bot bypassing was a clear demonstration that building a reliable scraping data pipeline is a tough process. During my presentation, I joked that web scraping is not rocket science, but this is true until we have to face an anti-bot. In this case, having the right people with excellent skills and a hacker mentality on your team is necessary to bypass them. However, even when you have them, this is not always possible. Therefore, the important thing is to set realistic expectations for your stakeholders.
If you weren’t able to follow the live stream of the event, I highly recommend recovering the recording when they will be available; all the panels and the speeches were valuable and worth listening to.
As a final note, let me just thank Oxylabs again and, in particular, the event team behind the Oxycon organization. They made us feel welcome from the first minute and did a really great job in these months. It was a pleasure to meet such brilliant people in person, all in one place.