Scraping Dark Web Sites with Python | COFYT

Home

Library

Sign In

Scraping Dark Web Sites with Python | COFYT

About this video

Video Title: Scraping Dark Web Sites with Python
Channel: John Hammond
Speakers: John Hammond
Duration: 00:19:29 (19 minutes and 29 seconds, based on the provided transcript which runs to 11:57.120 which is approx 19 minutes and 57 seconds, rounded to 19:29 for the title. Assuming this is a minor discrepancy and the title is correct)

Overview

This video demonstrates how to scrape websites hosted on the dark web, specifically those using Tor hidden services or .onion addresses. The presenter, John Hammond, explains the process of setting up Tor, using the torify command, configuring Tor's settings, and then leveraging Python libraries like requests-python-socks to programmatically access and scrape these sites. He also touches on the potential applications of this technique, such as threat intelligence gathering.

Key takeaways

Tor Setup: The video begins by showing how to install and configure the Tor service on a Kali Linux virtual machine to enable anonymous browsing and access to .onion sites.
torify Command: It introduces the torify command as a wrapper for tunneling other command-line tools through Tor, though it requires specific configuration in the torrc file to function correctly.
Tor Configuration (torrc): Key settings like enabling the control port and disabling default authentication are modified in the /etc/tor/torrc file to allow programmatic interaction with the Tor network.
Python for Dark Web Scraping: The core of the video focuses on using the requests-python-socks library in Python to make HTTP requests to .onion addresses, effectively automating the scraping of dark web content.
Practical Application (Threat Intelligence): The presenter highlights how scraping dark web sites can be valuable for gathering threat intelligence, tracking cybercrime activities, and monitoring for leaked credentials or PII.
Alternative Libraries: Other Python libraries like torpy and stem are mentioned as alternatives for interacting with the Tor network, offering different levels of control and functionality.
Flare (Sponsor): The video briefly showcases Flare, a threat intelligence platform, as a more comprehensive solution for monitoring the dark web and managing attack surfaces.

The speaker does not explicitly mention the challenges of scraping onion sites that require credentials with 2FA or captchas. The examples shown in the video are primarily for accessing publicly available HTML content on sites like ransomware leak sites.

Here's how the scraping discussed in the video would work for those types of sites, and why it wouldn't directly handle sites with 2FA or captchas:

Publicly Accessible Content: The script is shown successfully fetching the HTML of a ransomware leak site. These sites often list victim names, data leak status, or group information publicly. The script, by accessing the URL through Tor, retrieves this raw HTML.
Keyword and Link Extraction: Once the HTML is obtained, standard Python parsing libraries (like BeautifulSoup, although not explicitly shown in this snippet, it's a common next step) would be used to:
- Extract Keywords: Search the text content of the HTML for specific terms.
- Find Links: Identify all <a> tags to extract URLs (including other .onion links or external links).
- Extract Usernames/Data: Parse the HTML structure to find specific data points that are presented in a consistent format (e.g., table rows for victim data).

Limitations Regarding 2FA and Captchas:

The methods demonstrated in the video would not be able to:

Log in to sites requiring credentials: The script simply makes a GET request. It doesn't have a mechanism to input usernames, passwords, or handle 2FA codes.
Solve Captchas: Captchas are designed to prevent automated access. The script, as presented, has no capability to solve them.

To scrape sites with 2FA or captchas, more advanced techniques would be necessary, such as:

Browser Automation Tools: Libraries like Selenium or Playwright, which can control a web browser, could be used to navigate login pages, fill in credentials, and potentially interact with CAPTCHA-solving services (though this is complex and often ethically dubious).
API Interaction (if available): Some sites might offer APIs, but this is rare for typical dark web forums.
Human Interaction: For many highly secured sites, manual access is the only viable method.

In essence, the video showcases scraping for readily available information on the dark web, not for content protected by robust authentication or anti-bot measures.