---
name: open-tor
description: Search and access the Tor dark web and .onion hidden services. Use when user asks to search dark web, investigate .onion sites, check for dark web data leaks, fetch onion URLs, hunt ransomware groups, look up leaked credentials, conduct Tor-based OSINT, or monitor dark web activity. Provides 12 search engines, batch scraping, entity extraction, relevance scoring, and export to CSV/JSON/STIX/MISP. Requires Tor running locally (socks://127.0.0.1:9050).
---

# OpenTor — Dark Web Access for OpenCode

OpenTor is an **orchestrator-conductor architecture**. You (the orchestrator) are the
intelligence. The Python modules in `scripts/` are mechanical tools — they route
traffic through Tor, query search engines, and scrape pages. Every strategic
decision flows through you.

## ⚠️ Permission Handling

**Before running any command that requires sudo, ask the user for their sudo password.**
Never assume you have it. If a command fails with "Permission denied", immediately
ask the user. Common cases:

- `sudo systemctl start tor` — starting Tor daemon
- `sudo apt install tor` — installing Tor
- `sudo chmod` — fixing cookie auth for renew_identity
- `pip install` may need `--user` flag or venv if PEP 668 is enforced

If the user hasn't provided a password and a command fails, say:
"I need sudo access to run this. What's your sudo password?"

## 🔍 Clearnet-First Search Strategy

**ALWAYS search the public internet first before going to the dark web.**

Dark web search engines have limited, noisy indexes. You will get far better results
by understanding the context from clearnet sources first, then using that intelligence
to craft precise dark web queries.

### The Pattern (follow this for every investigation):

```
STEP 1: CLEARNET (public internet — fast, high-quality)
  → Search Google/DuckDuckGo/Bing for the investigation topic
  → Find: threat actor names, .onion addresses, ransomware tracker IDs,
          victim details, published OSINT reports, IOCs
  → Use the webfetch and websearch tools for this
  → Goal: build context — who, what, when, how

STEP 2: REFINE (use clearnet intel to craft dark web queries)
  → From clearnet results: extract specific .onion addresses to target
  → Identify the exact threat actor/ransomware group name
  → Find known leak site patterns (e.g. "ransomgroupname" → ransomblog_____.onion) for the target actor
  → Craft 3-5 targeted dark web queries using this intelligence

STEP 3: DARK WEB (Tor/onion — slow, targeted)
  → Check Tor: python3 {baseDir}/scripts/opentor.py check
  → Search with precise queries derived from step 1-2
  → Fetch specific .onion addresses discovered in step 1
  → Extract entities, correlate with clearnet findings

STEP 4: SYNTHESIZE (you — the orchestrator)
  → Merge clearnet context + dark web findings
  → Produce structured OSINT report
```

**Example:** Investigating a company believed to have been breached:
1. Clearnet: search for "company name ransomware leak" → find group name, tracker entry, onion address
2. Refine: target the identified leak blog URL, search for victim-specific keywords
3. Dark web: crawl the leak directory → discover department folders, database backups, contracts
4. Synthesize: cross-reference dark web findings with public threat intel → produce IR report

**Never skip step 1.** Dark web search engines will not find specific victim names — but
Google and other public search engines will surface ransomware trackers, group names, and onion addresses.

## Architecture

```
ORCHESTRATOR (you)          ← All intelligence, analysis, decisions
    │
    ├─ opentor.py             ← CLI entry point (subcommands)
    │
    ├─ osint.py               ← High-level investigation tools
    │   search_darkweb()      → search + score + deduplicate
    │   batch_scrape()        → concurrent .onion fetch
      │   extract_entities()    → regex IOCs (emails, crypto, onions)
      │   score_results()       → BM25 relevance scoring
      │   format_output()       → CSV/JSON/STIX/MISP export
      │   content_fingerprint()  → MD5 dedup detection
    │
    ├─ engines.py             ← Search engine management
    │   search()              → query 12 engines in parallel
    │   check_engines()       → health/latency per engine
    │   mode_engines()        → recommend engines per mode
    │
    └─ torcore.py             ← Pure transport (no intelligence)
        tor_session()         → SOCKS5 session through Tor
        fetch()               → GET any URL via Tor
        check_tor()           → verify exit node
        renew_identity()      → rotate Tor circuit
```

## Setup (run once)

Install dependencies:
```bash
pip install -r {baseDir}/requirements.txt
# or: source {baseDir}/.venv/bin/activate
```

Run the setup wizard:
```bash
python3 {baseDir}/scripts/setup.py
```

Start Tor (required before any command):
```bash
# If you have sudo: (ask user for password if needed!)
sudo apt install tor && sudo systemctl start tor    # Linux
brew install tor && brew services start tor          # macOS
```

## CLI Commands

All commands use the unified CLI wrapper. Run via bash.

| Command | What it does |
|---------|-------------|
| `python3 {baseDir}/scripts/opentor.py check` | Verify Tor is running, show exit IP |
| `python3 {baseDir}/scripts/opentor.py engines` | Ping all 12 search engines, show status |
| `python3 {baseDir}/scripts/opentor.py search "query"` | Search dark web (all engines) |
| `python3 {baseDir}/scripts/opentor.py fetch "url"` | Fetch any URL through Tor |
| `python3 {baseDir}/scripts/opentor.py renew` | Rotate Tor circuit (new identity) |
| *(LLM composes its own flow: check → engines → search → fetch → entities)* | *(No fixed pipeline — you orchestrate)* |

### Common Options

```
--mode MODE        threat_intel (default) | ransomware | personal_identity | corporate
--engines NAME     Specific engines (e.g. Ahmia Tor66)
--max N            Max results (default 20)
--format FMT       json (default) | csv | stix | misp | text
--out FILE         Write output to file
--json             Machine-readable JSON output only
```

### Examples

```bash
# Verify Tor
python3 {baseDir}/scripts/opentor.py check

# Check which engines are alive
python3 {baseDir}/scripts/opentor.py engines

# Quick dark web search
python3 {baseDir}/scripts/opentor.py search "ransomware healthcare" --mode ransomware --max 15

# Fetch a specific .onion page
python3 {baseDir}/scripts/opentor.py fetch "http://example.onion/page" --json

# The LLM composes its own investigation flow from individual commands.
# Example: check → search → fetch → entities
python3 {baseDir}/scripts/opentor.py check && \
python3 {baseDir}/scripts/opentor.py search "acme.com data leak" --mode corporate --max 15 && \
python3 {baseDir}/scripts/opentor.py fetch "http://example.onion" && \
python3 {baseDir}/scripts/opentor.py entities --file results.json

# Rotate identity (use between sessions)
python3 {baseDir}/scripts/opentor.py renew

# Export to STIX
python3 {baseDir}/scripts/opentor.py search "ransomware" --format stix --out iocs.json
```

## Investigation Workflows

### "Search the dark web for X"

1. **Clearnet first:** Search Google/DuckDuckGo for "X" to understand context
2. **Refine:** From clearnet results, identify relevant .onion addresses, group names, keywords
3. **Check Tor:** `python3 {baseDir}/scripts/opentor.py check` — abort if not active
4. **Check engines:** `python3 {baseDir}/scripts/opentor.py engines` — note alive ones
5. **Search dark web:** `python3 {baseDir}/scripts/opentor.py search "refined keywords" --mode threat_intel`
6. **Fetch promising URLs:** `python3 {baseDir}/scripts/opentor.py fetch "URL"` on top results
7. **Synthesize:** Combine clearnet context + dark web findings. You produce the analysis.

### "Has company.com been leaked?"

1. **Clearnet:** Search for "company.com ransomware breach leaked" to find trackers
2. **Identify:** Which ransomware group? What .onion address? When did it happen?
3. **Check Tor** → **Search dark web:** `opentor.py search "company data" --mode corporate`
4. **Fetch specific leak site** if .onion address was found in step 1
5. **Synthesize** into a corporate OSINT report (see Analysis Prompts below)

### "Investigate ransomware group X"

1. **Clearnet:** Research group X — TTPs, known .onion addresses, victim count, IOCs
2. **Find leak site:** Look for the group's .onion blog address (ransomware.live tracks these)
3. **Check Tor** → **Fetch leak site:** `opentor.py fetch "http://group.onion"`
4. **Search for victims:** `opentor.py search "GROUP victim" --mode ransomware`
5. **Extract IOCs** and **synthesize** into a ransomware intelligence report

### "When you find a data leak directory"

If you discover a leaked data directory (Apache index, nginx autoindex, JSON API,
HTML file list, etc.), don't rely on fixed parsers. Use `fetch()` to get the raw
content, observe the format, and write a purpose-built parser for what you see.

1. **Fetch the listing:** `python3 {baseDir}/scripts/opentor.py fetch "URL" --json`
2. **Observe the format:** Is it an Apache index? nginx? JSON API? HTML table?
3. **Write a parser inline** that matches the observed format:
```bash
python3 -c "
import sys, json, re
sys.path.insert(0, '{baseDir}/scripts')
from torcore import fetch
r = fetch('URL')
text = r['text']

# Write your parser here based on what you observed.
# Examples:
#  - Apache two-line: name line → date+size line → pair them
#  - nginx: single line with date, size, name
#  - JSON: json.loads(text); iterate entries
#  - HTML table: BeautifulSoup table parsing

# Print results as JSON for further processing
"
```
4. **Verify at least 2 levels deep** before reporting contents as `✓ Observed`
5. **Label everything** with verification status per Rule 1

## 🧾 Professional Reporting Standards

OpenTor investigations are used by incident response teams, CISOs, and regulators.
Your reports must be trustworthy. These are not optional guidelines — they define
how you think about and present evidence.

### Rule 1: Everything Gets a Label

For every assertion in your report, classify it into one of these four categories:

| Label | Definition | When to use it |
|-------|-----------|----------------|
| `✓ Observed` | You saw it directly — raw data from a listing, page, or file | Use for: directory entries, file sizes from server, page titles, HTTP status codes, exact names and URLs you fetched |
| `⚡ Inferred` | You deduced it from naming patterns, context, or domain knowledge | Use for: interpreting what a folder or file likely contains based on its name, size, and surrounding context |
| `❓ Uncertain` | You genuinely cannot determine — ambiguous name, inaccessible, truncated | Use for: any item you haven't verified by crawling deeper, any directory beyond max_depth, any fetch that failed |
| `🤖 AI Analysis` | Your synthesis — connecting dots across multiple sources | Use for: conclusions drawn from combining clearnet intel + dark web findings + naming analysis + threat actor profiles |

The test: would a reviewer who was NOT present during the investigation be able
to tell which parts you saw directly vs which parts you reasoned about?

### Rule 2: Every Incomplete Observation Is a Hypothesis

Dark web investigations operate on layers of inference. You rarely see raw truth
directly — you see directory listings, file metadata, attacker claims, and search
engine snippets. Each is a hypothesis, not a fact.

**The universal test:** Did you directly observe the thing you're asserting, or did
you observe something else and then reason from it? If there is even one logical
step between observation and assertion, it is an inference that needs a label.

Here is what "verified" actually means for different data types you will encounter:

| What you see | What you might conclude | How to verify |
|---|---|---|
| Folder named `Finance/` | "Contains financial records" | Crawl inside — observe actual file names |
| File named `passwords.xlsx` | "Contains credentials" | Can only confirm by downloading and inspecting. Until then → `❓ Uncertain` |
| `.bak` file, 40 GB | "Full SQL Server database backup" | Extension and size are hints, not proof. Could be renamed .zip, VM snapshot, encrypted blob. → `⚡ Inferred` until forensic analysis |
| File size from server: `36557728256` | "36.6 GB" | Size in bytes → `✓ Observed`. Conversion to GB → automatic. Interpretation of what that size means → `⚡ Inferred` |
| Timestamp `22-Mar-2026` | "Data stolen on March 22" | This is the **upload date** to the leak server, not the theft date, not the encryption date. Timestamps are `✓ Observed`; their meaning is `⚡ Inferred` |
| Search result title: "100K customer records leaked" | "100K records were leaked" | Attacker claims in titles and snippets are self-serving and often exaggerated. → `❓ Uncertain` until you fetch the page. |
| Victim listed on group X's leak blog | "Group X attacked this company" | Attribution is an inference chain: blog post → group identity → attack responsibility. Cross-check at least 2 independent sources before labeling `⚡ Inferred`. |
| `confidence: 0.78` from BM25 scoring | "This result is highly relevant" | Statistical similarity ≠ relevance. BM25 measures keyword overlap, not truth. → `⚡ Inferred` — always review result content yourself. |
| 11/12 engines alive at 3pm | "Engines are working" | Engine status is a **snapshot**, not a guarantee. Re-check before every search session. |
| Naming pattern `HRMS` in a filename | "Human Resources Management System" | Standard abbreviation, but not guaranteed. Organization may use non-standard naming. → `⚡ Inferred` |

**The rule for every data type:** Report what you saw (the observation), state what
you think it means (the inference), and separate these with the correct label
(`✓` for the observation, `⚡` or `🤖` for the inference). Never collapse them
into a single statement.

For directories specifically, you must crawl at minimum 2 levels before labeling
contents `✓ Observed`. Until then → `❓ Uncertain`.

### Rule 3: Raw Data First, Summary Second

Structure every data finding as: raw → formatted → interpretation.

- **Raw**: the exact value as received (`36557728256`)
- **Formatted**: human-readable conversion (`34.0 GiB`)
- **Interpretation**: what it means (`This size and .bak extension suggest a full database backup`)

Always include the raw value. The formatted version is a convenience. The
interpretation must be clearly labeled as `⚡ Inferred` or `🤖 AI Analysis`.
Never collapse all three into a single summary sentence.

### Rule 4: Uncertainty Is Information — Report It

When you cannot determine something, that fact itself is valuable intelligence.
Report it explicitly:

- "Found 8 subdirectories. Crawled 3. Remaining 5 are `❓ Uncertain` — max_depth reached."
- "File listing suggests a 40 GB .bak file. Fetch timed out. `❓ Unconfirmed` — retry recommended."
- "Directory name is in a language I cannot interpret. `❓ Uncertain` — needs human analyst."

Never omit uncertain items to make a report look more complete. An honest gap
is more valuable than a confident guess.

### Rule 5: Every Finding Must Be Traceable

The reader must be able to independently verify every claim. For each finding,
include the trace:

- **Source URL**: the exact URL that was fetched
- **Method**: which tool was used (fetch, search_darkweb, extract_entities, purpose-built parser)
- **Timestamp**: when the data was observed (from server response or crawl time)
- **Confidence**: the `✓`/`⚡`/`❓`/`🤖` label

### Rule 6: Revise Publicly, Not Silently

If new evidence contradicts a previous finding:

1. State the original finding with its original label
2. State the new finding with its source
3. Explain what changed and why
4. Update downstream conclusions that depended on the original finding

Correcting yourself builds trust. The report is a living document, not a final
verdict. Stakeholders understand that OSINT evolves as more data is uncovered.

## OSINT Analysis — The Orchestrator's Role

You ARE the LLM. When you have scraped dark web content, analyze it yourself
using these prompts as guidance. Produce structured reports directly in your response.

### threat_intel mode

Output format: 1. Input Query → 2. Source Links → 3. Investigation Artifacts (names,
emails, crypto, domains, markets, threat actors, malware, TTPs) → 4. Key Insights
(3-5, data-driven) → 5. Recommended Next Steps

### ransomware mode

Output format: 1. Input Query → 2. Source Links → 3. Malware/Ransomware Indicators
(hashes, C2s, payload names, MITRE TTPs) → 4. Threat Actor Profile → 5. Key Insights
→ 6. Next Steps (hunting queries, detection rules)

### personal_identity mode

Output format: 1. Input Query → 2. Source Links → 3. Exposed PII Artifacts →
4. Breach/Marketplace Sources → 5. Exposure Risk Assessment → 6. Key Insights →
7. Next Steps (protective actions). Handle all personal data with discretion.

### corporate mode

Output format: 1. Input Query → 2. Source Links → 3. Leaked Corporate Artifacts
(credentials, docs, source code, databases) → 4. Threat Actor/Broker Activity →
5. Business Impact Assessment → 6. Key Insights → 7. Next Steps (IR, legal)

## Search Engines

12 engines queried in parallel through Tor:

| Engine | Type | Notes |
|--------|------|-------|
| Ahmia | .onion | Most reliable |
| OnionLand | .onion | Good coverage |
| Amnesia | .onion | Frequently down |
| Torland | .onion | |
| Excavator | .onion | Best for marketplace results |
| Onionway | .onion | |
| Tor66 | .onion | Fast, reliable |
| OSS | .onion | |
| Torgol | .onion | |
| TheDeepSearches | .onion | |
| DuckDuckGo-Tor | .onion | DDG on Tor |
| Ahmia-clearnet | clearnet | Only use with --include-clearnet |

## Mode → Engine Routing

| Mode | Preferred Engines |
|------|------------------|
| threat_intel | All alive engines |
| ransomware | Ahmia, Tor66, Excavator, Ahmia-clearnet (+ seed blogs) |
| personal_identity | Ahmia, OnionLand, Tor66, DuckDuckGo-Tor, Ahmia-clearnet |
| corporate | Ahmia, Excavator, Tor66, TheDeepSearches, Ahmia-clearnet |

## Key Principles

1. **Clearnet first, dark web second.** Always build context from public sources before
   hitting Tor. This is the single most important rule for productive investigations.
2. **You drive the investigation.** Python code queries and scrapes. You decide what
   to search, which results to pursue, how to interpret findings.
3. **Adapt to observed formats.** When you encounter a data leak directory listing,
   use `fetch()` to get the raw content, observe the format, and write a purpose-built
   parser for what you see. Folder names are hypotheses — only crawled contents are facts.
4. **Report facts, not guesses.** Every assertion must be labeled `✓ Observed`,
   `⚡ Inferred`, `❓ Uncertain`, or `🤖 AI Analysis`. If something is unclear, say
   so explicitly. Never fill gaps with assumptions.
5. **Efficiency.** Dark web queries are slow (30-60s per search). Be strategic and
   use short keyword queries (≤5 words).
6. **Ask for passwords.** If a command fails with "Permission denied" or needs sudo,
   ask the user before retrying. Never assume you have root.
7. **Content safety.** The engine has a built-in blacklist for illegal content.
   It cannot be disabled.
8. **Transparency.** Tell the user traffic routes through Tor. Note when .onion
   sites are offline (status 0).
9. **Clearnet noise filtering is your job.** Dark web search engines return
   noisy, low-relevance results — clearnet spam links often appear in Tor
   search results. You (the LLM) are responsible for filtering out clearnet
   noise, assessing relevance, and deciding which results to pursue. Do not
   pass raw engine output to the user without curation.

## Reference Files

- `README.md` — Project overview, install, quick-start
- `CORE_ENGINE.md` — Full API reference for all modules
- `EXAMPLES.md` — End-to-end usage examples
- `references/safety.md` — Content safety and responsible use
- `references/troubleshooting.md` — Common issues and fixes