The Lab #44: Scraping the dark web

Scraping the dark web with Playwright and Brave

Mar 07, 2024

∙ Paid

When talking about web scraping, we usually think about websites that are publicly accessible and indexed on search engines.

But there’s a much larger portion of the web, approximately 400-500 times bigger, called the deep web, which is the amount of data not indexed by search engines. Think about your Gmail or your Drive account, that’s the classical example of the deep web.

Even more hidden from the mainstream, there’s the Dark Web, which is estimated around 0.01-5% of the standard internet. It’s accessible by using the TOR network, which guarantees anonymity both on the client and the server side.

Before going on with the article, I’d like to invite you to a webinar with Tamas, CEO of Kameleo.

We’ll talk about anti-detect browsers and their role in the web scraping industry.

Join us at this link: https://register.gotowebinar.com/register/8448545530897058397

What’s the TOR network?

The TOR network is an open-source project designed to enable anonymous communication across the internet. It was initially developed in the mid-1990s by the United States Naval Research Laboratory employees, with the primary aim of protecting government communications. However, it was later released into the public domain, where it has since evolved into a crucial tool for a wide range of users who demand privacy and anonymity online.

Just like web scraping and cryptocurrencies, the TOR network is just a technology that can be used for good or for bad: while it’s a crucial tool for guaranteeing the security of journalists and activists in some countries, the dark web is often linked to criminal activities.

flat screen computer monitors on table — Photo by Kaur Kristjan on Unsplash

How TOR Works: The Basics

At its core, TOR is based on the principle of onion routing. This process involves encrypting internet traffic and routing it through a series of relays, each peeling away a layer of encryption, before reaching its final destination. This method ensures that the original data and the identity of the user remain concealed throughout the transmission.

Here's a step-by-step breakdown of how it works:

Layered Encryption: When a user initiates a connection through the TOR network, their data is encrypted multiple times, akin to layers of an onion.
Routing Through Relays: The encrypted data packet is then sent through a randomly selected path of at least three relays within the TOR network. These relays are volunteer-operated servers distributed across the globe.
Peeling Off Layers: As the data packet traverses each relay, a layer of encryption is peeled off, revealing the next relay's address in the chain. However, no single relay ever has access to both the originating IP address and the final destination of the data, ensuring the anonymity of the user.
Final Destination: Upon reaching the final relay in the chain, known as the exit node, the last layer of encryption is removed, and the data is sent to its intended destination. The destination server receives the request as coming from the exit node, with no trace of the original sender's identity.

For a more complete guide, I suggest this article from Robert Heaton’s blog.

Challenges and Considerations on the TOR Network

While the TOR network provides significant advantages, it also faces challenges and considerations that users should be aware of:

Performance: The process of routing traffic through multiple relays can slow down internet speeds, affecting the user experience.
Malicious Use: The anonymity offered by TOR has attracted illicit activities, including illegal marketplaces and services on the Dark Web. This has led to scrutiny and attempts by authorities to compromise TOR's anonymity.
Exit Node Vulnerability: Since the exit node decrypts the data before sending it to the final destination, it can potentially be monitored. Users are advised to employ end-to-end encryption, such as HTTPS, to mitigate this risk.

What is an Onion URL and where to find them?

An onion URL is a web address used specifically within the TOR (The Onion Router) network to access websites anonymously and securely. These URLs are distinct from standard web addresses in several key aspects, primarily in their structure and the level of privacy and security they offer. Onion URLs are designed to provide both the website hosts and their visitors with anonymity, making it extremely difficult to trace the identity or location of either party.

Structure of an Onion URL

An onion URL typically consists of a seemingly random string of alphanumeric characters followed by ".onion" as the domain suffix. For example, a typical onion URL might look like this:

http://3g2upl4pq6kufc4m.onion/

This randomness is a feature of the onion service's public key, a part of its cryptographic identity, ensuring that the URL is unique and secure.

How Onion URLs Work

Onion URLs and the services they lead operate through the TOR network, which uses multi-layer encryption and a global network of volunteer-operated servers to anonymize internet traffic. When a user accesses an onion URL, their request is encrypted and routed through multiple relays in the TOR network before reaching the destination. Each relay decrypts a layer of encryption to reveal the next relay's address, but no single relay ever knows both the origin and the destination of the data, ensuring the anonymity of the user and the website.

Where to find Onion URLs?

Due to the anonymous nature of the Dark Web, discovering onion URLs can be challenging. However, there are several methods and resources that individuals can use to find onion URLs for legitimate purposes, such as accessing privacy-focused services, participating in secure communication, or retrieving information from regions under heavy censorship. Here are some of the most reliable ways to find onion URLs:

1. Dark Web Search Engines

Several search engines are designed specifically for the Dark Web and can be accessed using the TOR browser. These search engines index onion sites and provide a more familiar way for users to find onion URLs. Some of the notable dark web search engines include:

Ahmia.fi: Offers a search interface similar to surface web search engines and provides access to a wide range of onion services.
DuckDuckGo Onion Search: The privacy-focused search engine DuckDuckGo has an onion version that can be used within the TOR network to search for onion websites.
Candle: A minimalist search engine for the Dark Web, Candle provides basic search capabilities for finding onion sites.

2. Directories and Indexes

There are directories and indexes that list onion URLs across various categories, from news and media to more niche interests. These directories can serve as a starting point for exploring the Dark Web:

The Hidden Wiki: One of the most well-known directories on the Dark Web, The Hidde Wiki offers links to various onion sites, categorized by their content.
OnionDir: Another directory that lists onion sites by category, OnionDir can help users find sites of interest on the Dark Web.

3. Forums and Discussion Boards

Dark Web forums and discussion boards can be valuable resources for discovering onion URLs. Users share links and discuss services, providing insights into trustworthy sites and warning against potential scams. Participation in these communities can lead to discovering a wide range of onion URLs.

4. Specialized Websites

Certain legitimate onion websites often link to other onion services, creating a network of related sites. For example, news organizations with a presence on the Dark Web may link to sources or databases relevant to their reporting.

How do you scrape onion websites?

Now we have seen how the TOR network works and how to reach an Onion website, we can proceed by creating our scraper for a TOR website.

You can find the code of the scrapers on The Web Scraping Club GitHub repository, available for paying readers. If you’re one of them but don’t have access, write me at pier@thewebscraping.club with your GH username.

TWSC GitHub Repository

Keep reading with a 7-day free trial

Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.