The Zyte's Extract Summit 2024 Wrap up

From advanced scraping techniques to AI, here are the latest trends shown in the summit

Oct 20, 2024

The annual Zyte in-person conference, this time in Austin, took place on the 9th and 10th of October.

Unfortunately, I could not be there, and it was a real pity: I love to meet all the main actors in the web scraping industry in person, and this is the only occasion we have to gather. Hopefully, there will be a chance to meet next year!

Anyway, I’ve just seen all the videos published on YouTube, and in this post, I’m trying to summarize the main trends that emerged at the conference.

One of them is pretty clear from the speech titles: LLMs are now at the center of the stage, both as data extractors and as data users.

LLMs for data extracting

The latest talk of the day is by Asim Shrestha , co-founder of Reworkd AI, a startup that creates LLM agents to scrape data from the web.

In this speech, he discussed the challenges of scraping data from the web from an LLM perspective, such as the kind of input you need to pass and how to reproduce the browser state. In fact, passing the raw HTML works, but it's probably not the best option on larger websites with a lot of scripts and other nonrelevant code.

The Reworkd approach is interesting since it uses a so-called 2D rendered page, a sort of Markdown file that keeps the page's visual formatting and annotates its different elements, allowing the prompt to be written more efficiently.

But a good prompt is not enough to build a production-ready product: this should be built starting by splitting the different actions needed by a human to create a scraper so that every single one can be modified as the specific website changes to adapt to the long tail of cases we have on the web.

Last but not least, the output should be tested constantly to ensure the LLM is still working after website changes or bans. It was a great talk with a lot of inspirational themes, at least for me, as I’m just approaching these techniques.

On the same page, Jan Curn, Apify's CTO, shows how LLMs' knowledge can be improved by passing fresh data from the web using Apify actors to transform web pages into markdown files that will be passed to LLMs later.

Of course, speaking of AI scraping at Extract Summit, Ian Lennon, CPO at Zyte, introduced us to Zyte API and its potential.

I’m becoming increasingly confident that this family of tools will be the future of web scraping.

What I appreciate about Zyte API is its variable cost per request: as also Ian mentioned, if you’re using Zyte API for a website without an anti-bot, the runtime cost of the scraper is close to the one we would have by using Python.

Always from the Zyte team, Iván Sánchez, Senior Data Scientist at Zyte, gave a great speech on understanding the challenges and techniques when using LLMs to extract data from the web.

LLMs are also used for data engineering

Neelabh Pant, Senior Manager of Data Science at Walmart, has shown another interesting use case for LLMs.

While data pipelines work great in theory, the data ingested in the real world in most cases have some flaws: it can be incomplete or wrong and require some fixing and integration that can be painful and time-intensive.

Thanks to LLMs, some of these steps can be automated, especially when you need to complete your input data with some content coming from unstructured data. For example, let’s say the “color” column is missing for some products in the input dataset, but this information is available in their description. You can use an LLM to integrate the color information.

This also works when you need to enrich your datasets using unstructured data, for example, if you want to perform a sentiment analysis on the reviews.

Check the TWSC YouTube Channel

Other great talks

There’s no extract summit without a speech about the legal landscape of web scraping. This year, together with Sanaea Daruwalla, Chief Legal & People Officer at Zyte, there was Hope Skibitsky, Partner at Quinn Emanuel, Stacey Brandenburg, Shareholder at ZwillGen, and Don D'Amico, Founder & CEO at Glacier Network and former General Counsel at Neudata.

The legality or not of scraping operations is always a hot topic, and this panel also confirms that it’s a complex subject. The answer may vary depending on what you're scraping, how you're doing it, and for what you’re using the data.

One of the main topics on the legal landscape for web scraping is legal breaches: the panelist explained clearly the difference between browse wrap and click wrap terms of services and when they’re legally enforceable.

Another interesting point that emerged in the latest case of X vs. Bright Data is X's allegation about the impact of scraping on the target server. While the court dismissed the case on this occasion, it’s important to remember that every action we perform on the target server could cause monetary damage, so keeping our operations polite and lightweight is essential.

Of course, this panel cannot end without discussing legal cases in the AI world, given the number of cases spawned in the past year.

Another talk deserving a mention is the one by Matthew Blumberg, Co-founder of Charity Engine.

He showed us how Charity Engine created a distributed infrastructure for different activities, one of them is web scraping.

One of the most incredible things about it is that most of the revenues go to charity, which makes this approach unique.

Thanks to the Zyte team for creating the best event on web scraping, which gets more interesting year after year. This is a short wrap-up of the day, but I strongly recommend having a look at the full YouTube playlist with all the videos from the summit; they’re all extremely interesting.

Like this article? Share it with your friends who might have missed it or leave feedback for me about it. It’s important to understand how to improve this newsletter.

Rate This Article

You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.

Invite another web scraping person

The Web Scraping Club

Discussion about this post