AI and data: different faces of the same coin
How the training of LLMs changed the web data industry
AI was not invented with LLMs: it was founded as an academic discipline in 1956, and commercial solutions using some artificial intelligence have spread in many fields for at least twenty years.
Think about dynamic pricing algorithms, price optimization, supervised classification, autonomous driving, clustering of any entity, and sentiment analysis. These are all mature fields of application, and any company can build their tools based on these algorithms.
The advent of ChatGPT (and also Midjourney) some years ago was a slap in the face of the non-business audience: every person in the world could touch the power of a new type of AI, the so-called generative AI, with his hands.
After decades of research, we can say that AI has become mainstream and is everywhere, literally.
But what are the differences between traditional and generative AI that made this subject famous?
Traditional and generative AI: a matter of scope and scale
First of all, a big disclaimer: I’m not an AI specialist, so that I won’t discuss the technical details of the subject. But I’m feeling firsthand the repercussions of this shift in the demand for web data, and I tried to reflect on why this is happening.
The first big difference between the “traditional AI” models and generative AI is the scope of their application.
Traditional AI was designed and trained to solve one specific task, be it driving autonomously on the road, suggesting the price of an item, clustering together different customers, or detecting whether workers were wearing helmets on the building site; each model solved one specific task, even a complex one like driving safely in the traffic.
ChatGPT, like all the other generative AI applications that require a prompt in input, is designed to receive an open request from the user, understand the context, and provide a plausible, unpredetermined answer. As you can easily imagine, this involves using an incredible amount of data in input and computing power to understand it.
In just a few years, ChatGPT became the eighth most visited website in the world, according to Similarweb.
In addition to the infrastructure maintenance, which should be state-of-the-art, you can only imagine the variety of prompts asked to ChatGPT in a single day. To give a meaningful answer to everyone, you can only imagine the amount of data of different natures that should be ingested and processed.
This leads to the second big difference between traditional and generative AI: the scale of the operations.
While almost every business could build its own “traditional AI” solution, at least the easiest one, like dynamic pricing or clustering, few companies in the world have the resources to train a new LLM from the ground, given the capital needed for training these models.
For traditional AI, the data collection has a “normal” scale: if you’re using external data for your algorithm, you have a limited scope to scrape. Prices for your dynamic pricing or revenue optimization service, social media for your sentiment analysis SaaS, etc. I cannot even imagine (since it’s not publicly disclosed) the list of websites scraped to feed ChatGPT.
That’s why the easiest way for your company to use LLMs is to stand on the shoulders of giants and customize existing models for your specific needs.
If you want to see a practical experiment on creating your custom GPT, don’t miss the next Oxylabs webinar. I’ll show you how to build a knowledge base, from scraping data to training a custom GPT of The Web Scraping Club.
You can try The Web Scraping Club GPT to see my test results at this link.
How generative AI impacted web scraping
AI, by definition, needs data to train algorithms and models. As mentioned in the previous paragraph, generative AI requires a huge amount of data, which presents more challenges than ever in different fields.
Given for granted that the engineers on OpenAI and similar companies can bypass all the anti-bot protection found on their way, there are still critical challenges when you need to feed your models with the whole meaningful part of the web.
Data management and usage
It is difficult to ingest and train the models on this huge amount of data. As we can see from this prompt, ChatGPT's knowledge lags by one year, even considering all the resources they have for this.
To provide fresher data, ChatGPT recently added a web search capability, which allows you to browse the web and return the latest information regarding your prompt, just like Perplexity.
Data and output quality
Controlling the data quality when scraping prices or factual elements is much easier than unstructured data. You can check if a number is really a number and if certain fields have the values you expected. With unstructured data, you scrape textual content from social or news feeds, and you can only check, when possible if you scraped the whole post or article.
Imagine having thousands of structured and unstructured data feeds from websites of any domain: how can you control the content you ingest? How can you avoid your LLM answer to some prompts using hate speech, racist words, or suggesting harmful things, something that is spread all over the web?
While the input of the training data could be filtered, surely the scope of the answers is fenced so that nothing harmful can be used as a response to a prompt.
Again, it’s a huge task since the model should understand what’s dangerous and not, sarcasm, hate speech, and controversial content.
Data collection
Yes, OpenAI used a common crawl to ingest a large chunk of the web's history. It also has commercial deals with websites like Reddit. But, in the end, like every AI company, it needs to scrape websites to get more data, and the more, the better.
And not always using best practices.
Copyright infringement
Do you remember when Google News shut down in Spain because the editors thought it was stealing their business and the government wanted to tax Google for showing news on their website? Well, after eight years, Google News came back after finding an agreement with the editors. Reading between the lines, traffic to the editors' website dropped, they were losing money, and they asked Google to return.
Well, now it can be worse. While the newest content, ChatGPT, can link to external websites and send traffic to them, all the previous content ingested in the model can be shown to users without linking to any other website. This means the profits derived from the search remain in OpenAI. This happens for all the copyrighted content ingested. It can be hard to understand if it has been scraped and elaborated to be shown to the end user, and this is true for every content format: images, videos, audio, and text.
Given all these challenges, how has the advent of generative AI impacted the web scraping industry?
Well, for sure, the need for data is good for the whole industry: proxy providers, extracting tools, and data companies are involved in this data race, not only to sell their services to big players but also to the new startups that are using generative AI today. Web scraping has become more mainstream since more people are now aware of what can be done by extracting data from the web.
On the other hand, this pressure on websites and businesses with content available on the web will, I’m afraid, make scraping harder, forcing them to use more anti-bots or paywalls for their content.
Like this article? Share it with your friends who might have missed it or leave feedback for me about it. It’s important to understand how to improve this newsletter.
You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.