This video demonstrates how to scrape websites hosted on the dark web, specifically those using Tor hidden services or .onion addresses. The presenter, John Hammond, explains the process of setting up Tor, using the torify command, configuring Tor's settings, and then leveraging Python libraries like requests-python-socks to programmatically access and scrape these sites. He also touches on the potential applications of this technique, such as threat intelligence gathering.
torify Command: It introduces the torify command as a wrapper for tunneling other command-line tools through Tor, though it requires specific configuration in the torrc file to function correctly.torrc): Key settings like enabling the control port and disabling default authentication are modified in the /etc/tor/torrc file to allow programmatic interaction with the Tor network.requests-python-socks library in Python to make HTTP requests to .onion addresses, effectively automating the scraping of dark web content.torpy and stem are mentioned as alternatives for interacting with the Tor network, offering different levels of control and functionality.Yes, that's an accurate interpretation. The video's focus is not on casual browsing of dark web sites out of curiosity. Instead, it's primarily geared towards the analysis of cybercrime-related content and the automation of data gathering from these sources.
The presenter demonstrates how to programmatically interact with onion sites (which can be used to scrape content) for purposes like:
While the tools and techniques shown could technically be used to find any onion site, the context and examples provided strongly lean towards security research and cybercrime investigation rather than general exploration. The idea of populating onion sites based on keywords is not directly covered, but rather searching existing onion sites for specific keywords related to threats or illicit activities.
That's a great clarifying question, and you've hit on a key point that might seem contradictory at first glance.
In basic steps, the Python script does require you to provide the onion link within the script itself. It does not randomly discover active onion sites in real-time.
Here's why and how it works:
requests-python-socks library (or similar configurations) tells your Python script to route its HTTP requests through the Tor network. This is what provides the anonymity and allows it to access the .onion address, which is not accessible via standard internet browsing.Why this approach?
So, to reiterate: you provide the onion URL in the Python script, and the script uses Tor to access that specific URL and scrape its content.
The speaker does not explicitly mention the challenges of scraping onion sites that require credentials with 2FA or captchas. The examples shown in the video are primarily for accessing publicly available HTML content on sites like ransomware leak sites.
Here's how the scraping discussed in the video would work for those types of sites, and why it wouldn't directly handle sites with 2FA or captchas:
BeautifulSoup, although not explicitly shown in this snippet, it's a common next step) would be used to:
<a> tags to extract URLs (including other .onion links or external links).Limitations Regarding 2FA and Captchas:
The methods demonstrated in the video would not be able to:
To scrape sites with 2FA or captchas, more advanced techniques would be necessary, such as:
In essence, the video showcases scraping for readily available information on the dark web, not for content protected by robust authentication or anti-bot measures.