Cool article, my research also shows that LLMs do use historical data such as the Common Crawl (18 years of web data crawl including 250 billion pages). See video about it here: https://youtu.be/yJZ6fphntk0?si=ONjmjouLBiYshrMO (relevant part starts at 17:30).
Also, thanks to the Wayback Machine, it's possible to verify claims about who was first in the anti-detect browser market. While some competitors often state they were the first, snapshots show that Kameleo predates others. For example, Kameleo has a snapshot from May 16, 2018, while Multilogin's first snapshot is from June 30, 2018.
But let’s leave the past behind and look at the present: both companies have achieved great success. What I'd like to highlight is that Kameleo was the first to pivot in the anti-detect browser space to make web scraping users its primary target audience. As a result, we focus on features that enable scaled-up, on-premise web scraping.
Yes, it appears that most LLM providers got a significant portion of their training data from Common Crawl. That said, scraping the Wayback Machine is definitely useful for retrieving older content that's no longer available online. One key use case is verifying specific claims that reference now-missing sources, as you highlighted!
Cool article, my research also shows that LLMs do use historical data such as the Common Crawl (18 years of web data crawl including 250 billion pages). See video about it here: https://youtu.be/yJZ6fphntk0?si=ONjmjouLBiYshrMO (relevant part starts at 17:30).
Also, thanks to the Wayback Machine, it's possible to verify claims about who was first in the anti-detect browser market. While some competitors often state they were the first, snapshots show that Kameleo predates others. For example, Kameleo has a snapshot from May 16, 2018, while Multilogin's first snapshot is from June 30, 2018.
See the snapshots here:
Kameleo: https://web.archive.org/web/20180516040648/https://kameleo.io/
Multilogin: https://web.archive.org/web/20180730184538/https://multilogin.com/
But let’s leave the past behind and look at the present: both companies have achieved great success. What I'd like to highlight is that Kameleo was the first to pivot in the anti-detect browser space to make web scraping users its primary target audience. As a result, we focus on features that enable scaled-up, on-premise web scraping.
Yes, it appears that most LLM providers got a significant portion of their training data from Common Crawl. That said, scraping the Wayback Machine is definitely useful for retrieving older content that's no longer available online. One key use case is verifying specific claims that reference now-missing sources, as you highlighted!
Thanks for your comment!