Comprehensive OSINT Guide

A practitioner’s reference for Open Source Intelligence — methodology, collection disciplines, tooling, pivoting techniques, and operational security. Compiled from 34 research sources.


Table of Contents

  1. Fundamentals
  2. The OSINT Lifecycle
  3. People OSINT (HUMINT/SOCMINT)
  4. Company & Corporate OSINT
  5. Infrastructure & Network OSINT
  6. Domain, DNS & Certificate Intel
  7. Social Media Intelligence
  8. Geolocation & Imagery (GEOINT)
  9. Breach, Leak & Paste Intel
  10. Metadata Extraction
  11. Code & Repository OSINT
  12. Dark Web & Threat Intel
  13. IoT & Device Discovery
  14. Tools Reference
  15. Automation & Visualization
  16. AI-Assisted OSINT
  17. Operational Security
  18. Legal & Ethical Considerations
  19. Quick Reference

1. Fundamentals

Open Source Intelligence (OSINT) is the discipline of collecting, correlating, and analyzing information that is publicly or legally available to produce actionable intelligence. “Open source” does not mean “easy” or “low value” — it means no clandestine collection is involved. The sources are lawful: the skill lies in knowing where to look, how to pivot, and how to assemble fragments into a coherent picture.

Why it matters:

Use casePractitioners
Adversary reconnaissanceRed teams, pentesters, bug bounty hunters
Attack surface managementBlue teams, security engineers, CISOs
Threat intelligenceSOC analysts, CTI teams, IR responders
Fraud and KYC investigationFinancial crime analysts, compliance
Journalism and researchInvestigative reporters, academic researchers
Law enforcementMissing persons, criminal investigations
Due diligenceM&A, investor research, hiring
Personal self-defensePrivacy audits, stalker detection

Core principles:

  • Every fact is a pivot. An email address is not an endpoint — it is a seed for breach lookups, social profile enumeration, domain registrations, and search engine dorks.
  • Triangulate before trusting. Any single source can be wrong, stale, or planted. Cross-reference at least two independent sources before treating a data point as confirmed.
  • Document as you go. If you cannot reproduce a finding in six months, it did not happen. Screenshot, hash, archive.
  • Stay passive until you must be active. The default mode is observation. Only escalate to direct interaction when the intelligence you need cannot be harvested from existing public records.
  • Scope creep kills investigations. Define the question up front and resist chasing shiny tangents unless they directly serve the objective.

Passive vs. active collection:

PassiveActive
Contact with targetNone — consult third-party data onlyDirect queries against target infrastructure
Detection riskNear zeroLogs, rate limits, WAF alerts
Data freshnessCan be stale (days to years)Real-time
Examplescrt.sh, Shodan, archive.org, Google dorksNmap scan, directory brute-force, HTTP probing
When usedAlways first; enumerate scope and contextAfter passive exhausted, to confirm/expand

The cardinal rule: finish passive recon before touching the target. Anything you can learn from Censys or certificate transparency logs is something you do not need to poke a production server for.


2. The OSINT Lifecycle

Every investigation, whether a two-hour recon sprint or a month-long deep-dive, follows the same phases. Discipline here separates practitioners from tourists.

Phase 1: Planning & Requirements

Write down the question. Who is the target? What decisions will the intelligence support? What is in scope, and what is off-limits? What is the deadline? What format does the deliverable take? Investigations without a defined question wander forever.

Phase 2: Collection

Gather raw data from identified sources. The temptation is to start here — resist it until the plan is clear. Collection spans subdomains, WHOIS records, social profiles, PDF metadata, breach dumps, code repos, certificate logs, historical archives, and more. Keep raw artifacts separate from processed notes.

Phase 3: Processing

Normalize the data. Dedupe subdomains, resolve hostnames to IPs, extract EXIF from images, parse PDFs for authors. Convert everything into a form you can query and pivot against. A messy collection phase dies here.

Phase 4: Analysis

Turn data into intelligence. Correlate findings across sources: the email on the WHOIS record matches the Gravatar on GitHub, which matches a LinkedIn photo, which matches a conference speaker bio. Link analysis tools (Maltego, spreadsheets, link graphs) help surface non-obvious connections.

Phase 5: Dissemination

Deliver findings in the format the consumer expects. A bug bounty report, a pentest recon appendix, an executive briefing, a due diligence memo. Include provenance for every claim — where it came from, when it was collected, and how confident you are.

Phase 6: Feedback

Does the intelligence answer the question? What was missed? What should have been collected sooner? Feed lessons back into Phase 1 for the next engagement.


3. People OSINT (HUMINT/SOCMINT)

People investigations map a target’s digital footprint: identifiers, aliases, affiliations, locations, and relationships. The process is iterative — each fact opens new pivots.

Starting identifiers

SeedImmediate pivots
Full nameSearch engines, LinkedIn, Wikipedia, voter rolls, academic directories
EmailHaveIBeenPwned, Hunter.io, Gravatar, breach dumps, domain WHOIS, Google
UsernameSherlock, WhatsMyName, Namechk, Maigret
Phone numberPhoneInfoga, truecaller, reverse lookup, Telegram/WhatsApp checks
Profile photoReverse image search (Google, Yandex, TinEye, PimEyes)
EmployerLinkedIn, press releases, company filings
AddressProperty records, voter rolls, Google Maps Street View

The Maltego-style pivot graph

Treating a person investigation as a graph (nodes = identifiers/entities, edges = “associated with”) prevents losing track of where each fact originated. A typical pivot chain from a Maltego-style workflow:

  1. Name → search page titles, Wikipedia, personal website
  2. Personal website → footer emails, phone numbers, historical WHOIS (DomainTools)
  3. WHOIS email → other domains registered with same email (reverse WHOIS via WhoXY)
  4. Social handles (marc_clotet, marcclotetoficial) → Instagram, Twitter, Facebook profiles
  5. Mutual followers/following → close contacts, private accounts
  6. Affiliated company (mentioned in bio) → corporate registry → officers → other affiliated parties
  7. Historical Hotmail address uncovered via DomainTools → Pipl person search → age, relatives, locations
  8. Phone number → messaging app profile photos, account discovery

Each step is a pivot from a confirmed entity to new related entities. Maltego Transforms automate the individual hops; you can run the same workflow manually with curl, whois, and careful note-taking.

Username enumeration

Most people reuse handles across platforms. Tools that check hundreds of sites in parallel:

  • Sherlock — Python tool checking 300+ social networks for a username
  • WhatsMyName — web/CLI tool with a community-maintained JSON list of sites
  • Maigret — fork of Sherlock with richer profile data extraction
  • Namechk / KnowEm — brand/username availability checkers repurposed for OSINT

A hit on a niche forum is often worth more than another Twitter account — niche forums surface real interests, writing samples, and contact patterns.

Email enumeration and validation

  • Hunter.io — finds email addresses by domain, infers patterns (first.last@, flast@), verifies deliverability
  • Email permutator — generates plausible addresses from a name plus domain
  • HaveIBeenPwned — reveals which breaches an email appears in (reveals services used)
  • Gravatarhttps://gravatar.com/<md5(email)>.json returns profile if registered
  • Epieos / Holehe — checks dozens of services for account registration without triggering password reset emails

Phone numbers

  • PhoneInfoga — country, carrier, line type, breach hits
  • Truecaller — crowdsourced caller ID; risky (reveals your query to Truecaller)
  • Messaging apps — adding a number to contacts often reveals the registered profile name and avatar (opsec-heavy; use burner)
  • Google Images / Google Lens — best for product, landmark, and Western content
  • Yandex Images — best for faces and people; still the strongest for facial matches
  • TinEye — best for finding the original source and earliest occurrence
  • PimEyes / FaceCheck — facial recognition across the open web (paid, ethically fraught)
  • Bing Visual Search — decent for products and landmarks

4. Company & Corporate OSINT

Company investigations combine infrastructure recon with corporate filings, personnel mapping, and vendor/technology fingerprinting.

Corporate identity sources

SourceData
OpenCorporatesGlobal company registry metadata — officers, addresses, status
SEC EDGARUS public company filings (10-K, 10-Q, insider transactions)
Companies House (UK)Officers, filings, beneficial owners
Bureau van Dijk / OrbisPaid, comprehensive global company intelligence
Dun & BradstreetBusiness credit, corporate family trees
Crunchbase / PitchBookFunding, investors, board members (paid tiers for depth)
LinkedIn company pagesHeadcount, departments, employee list
Full Contact / ClearbitEnrichment APIs — size, industry, tech stack, key people

Subsidiary and domain discovery

Large companies have sprawling digital footprints. Start from the primary name and expand:

  • Reverse WHOIS (WhoXY, DomainTools) — find all domains registered to the same name or email. Remember: WhoXY requires exact-string matches, so “Blizzard Entertainment” and “Blizzard Entertainment, Inc” will return different sets.
  • Trademark search — USPTO, EUIPO filings reveal product codenames and subsidiaries.
  • Press releases and SEC filings — mention subsidiary names that never appear on the website.
  • Job postings — often mention internal tool names, cloud providers, and office locations.

Employee enumeration

  • LinkedInsite:linkedin.com "Acme Corp" via Google reveals public profiles even without login
  • Hunter.io / phonebook.cz — bulk email harvest by domain
  • GitHub — commits by @company.com email addresses expose engineers
  • Conference talks, CVE credits, paper authorship — lists specialists
  • RocketReach, Lusha, Apollo — sales tools repurposed for contact discovery (paid)

Technology fingerprinting

Knowing the stack narrows exploitation research later:

  • BuiltWith / Wappalyzer — web stack detection from rendered HTML and headers
  • Shodan / Censys — banner grabs reveal server software and versions
  • DNS records — MX (pphosted.com = Proofpoint), SPF (spf.protection.outlook.com = O365), CNAMEs revealing CDN/CMS
  • JavaScript bundles — library imports, API endpoints, third-party integrations

5. Infrastructure & Network OSINT

This is where OSINT crosses most directly into security recon. The goal is to enumerate every externally reachable asset and catalog what is running on it — without touching the target.

IP space and ASN

  • ARIN / RIPE / APNIC / LACNIC / AFRINIC RDAP — WHOIS for IP blocks, netblock ownership
  • bgp.he.net — AS number lookups, peering relationships, announced prefixes
  • ipinfo.io / ipdata — enrichment APIs with geoloc, ASN, org
  • RIPEstat — authoritative routing, abuse contacts, historical data

A company that owns its own ASN signals maturity and gives you a clean IP perimeter. A company entirely on cloud (all AWS/GCP/Azure) means you map their domains back to cloud ranges instead.

Search engines for infrastructure

These are the indispensable tools. None of them touch the target; they query pre-indexed scan data.

ToolBest forFree tier
ShodanBanners, service versions, SCADA/ICS, webcams, IoT, vulnerability filters (vuln:)Limited queries, paid plans unlock filters
CensysCertificate search, service fingerprinting, precise field queries250 searches/month
Netlas.ioDomains, IPs, WHOIS, DNS combined; Maltego integration50 searches/day, 2500 results/month
FOFAChinese alternative, strong for APAC infrastructureLimited
ZoomEyeAnother Chinese alternativeLimited
BinaryEdgeScans, leaked databases, risk scoringPaid
GreyNoiseClassifies “background noise” IPs to filter scan trafficCommunity tier
Hunter.howCyberspace search engineLimited

Shodan query patterns:

hostname:example.com
ip:203.0.113.0/24
port:22 country:US
product:"nginx"
vuln:CVE-2021-44228
org:"Acme Corp"
ssl:"example.com"

Censys field queries:

parsed.names: example.com
services.service_name: "HTTP" and location.country: "United States"
services.tls.certificates.leaf_data.names: "*.example.com"
autonomous_system.asn: 13335

Cloud asset discovery

Most modern targets live in public cloud. Mapping cloud assets:

  • S3 buckets — use awscli or boto3 (authenticated anonymous checks surface more than unauthenticated HTTP probes). Bucket names need to be globally unique; try company-name variants with common prefixes/suffixes: qa, dev, staging, prod, bak, backup, logs, assets, uat, legacy, internal, public, private, docs.
  • Digital Ocean Spaces — same API shape as S3, separate namespace to enumerate.
  • Azure blob storage<name>.blob.core.windows.net
  • GCS buckets<name>.storage.googleapis.com
  • Firebase databases<name>.firebaseio.com
  • Dangling CNAMEs — records pointing to deleted cloud resources are ripe for subdomain takeover. The can-i-take-over-xyz repo catalogs the fingerprints.

Historical data

  • Wayback Machine (archive.org) — snapshots of old pages, forgotten endpoints, robots.txt evolution, admin panel references
  • CommonCrawl — bulk web archive suitable for scripted search
  • SecurityTrails — historical DNS records, WHOIS changes, subdomain discovery
  • DomainTools — historical WHOIS (the closest thing to a time machine for registration data)
  • Google Cache — the cached view is gradually being removed, but still useful when present

Historical data is often more valuable than current data. A subdomain that vanished last year may still point at a forgotten S3 bucket. An old WHOIS record may contain an admin’s personal email that was scrubbed from the current record.


6. Domain, DNS & Certificate Intel

Domain-level intel is the connective tissue of infrastructure OSINT.

Subdomain enumeration

Passive sources pull from pre-indexed databases — no traffic to the target:

  • crt.sh — free, no rate limit, queries Certificate Transparency logs. Every TLS cert issued is logged publicly, so issuing a cert for hr.example.com is enough to discover that subdomain even before it goes live.
  • Certificate Transparency (transparencyreport.google.com) — Google’s CT aggregator
  • VirusTotal — passive DNS from submissions
  • SecurityTrails / DNSDumpster / Netcraft — historical DNS aggregators
  • Subfinder — orchestrates queries against 30+ passive sources
  • Amass (intel/enum passive mode) — OWASP’s enumeration framework
  • Assetfinder — tomnomnom’s lightweight passive finder
  • chaos-client — ProjectDiscovery’s Chaos dataset

Active enumeration adds brute-force, permutation, and zone walking:

  • puredns / shuffledns — mass DNS resolution with wildcards filtered
  • altdns / dnsgen — permutation wordlists from discovered names
  • dnsrecon — zone transfers, cache snooping
  • massdns — raw resolver for large lists

Active validation

Once you have a list of candidate names, resolve and probe them:

  • dnsx — bulk resolution, filtering by record type
  • httpx — probes live HTTP(S) services, captures titles, tech stacks, status codes
  • aquatone — visual recon; screenshots and clusters subdomains by similarity
  • gowitness / eyewitness — alternative screenshotters

DNS record mining

Beyond A/CNAME, records leak intel:

RecordIntel
MXEmail provider (Google, Microsoft, Proofpoint, Mimecast)
TXTSPF records list third-party senders (Marketo, Salesforce, Zendesk); DKIM selectors hint at tooling
SRVExposes specific services (XMPP, SIP, LDAP)
CAAAllowed certificate authorities
NSDNS provider (Route53, Cloudflare, NS1)
SOAAdmin email, zone refresh parameters

Weak or missing SPF/DMARC (e.g. v=DMARC1; p=none) signals exploitable email spoofing potential. DKIMValidator is a classic utility for testing DMARC alignment without interacting with the target infrastructure.

WHOIS

  • Current WHOIS — often redacted under GDPR for individuals, still useful for corporate registrants
  • Historical WHOIS — DomainTools, WhoisXML API, Whoxy — the unredacted gold
  • Reverse WHOIS — find all domains sharing a registrant email, name, phone, or organization

7. Social Media Intelligence

SOCMINT is high-signal but high-noise. Treat public social media as a window into a target’s relationships, routines, locations, and interests — and treat closely curated accounts (execs, celebrities) as performative artifacts, not ground truth.

Platform-by-platform quick reference

PlatformHigh-value intel
LinkedInEmployer history, skills, internal tool mentions, team maps, location
Twitter/XWriting style, real-time location, device fingerprints (Twitter for iPhone etc.), interests, connections
FacebookFamily, relationships, check-ins, photos, events, groups
InstagramGeolocation from photos, friend networks, routines, physical spaces
TikTokSchedules, location context, behavioral patterns
RedditLong-form writing, niche communities, real-world interests
GitHubCode, commits, emails, working hours, associated accounts
Strava/fitness appsRoutines, home location, military base exposure
TelegramPhone number → profile → channels joined
DiscordReal-time presence, community affiliations

Techniques

  • Close contact mapping — mutual followers on Instagram, Facebook friend overlap, Twitter interaction graphs
  • Temporal analysis — timestamp clusters reveal timezone, sleep schedule, work hours
  • Linguistic fingerprinting — consistent phrasing across accounts links aliases
  • Photo OSINT — backgrounds reveal location, device clock shows timezone, reflections leak environment
  • Story/ephemeral content — archive quickly; gone in 24 hours

Tools

  • Sherlock / WhatsMyName / Maigret — username across platforms
  • Osintgram — Instagram enumeration (rate-limited; may violate ToS)
  • Twint / snscrape — Twitter scraping without API (fragile post-API lockdown)
  • Social Analyzer — API and CLI for social profile discovery
  • Social-Searcher — keyword monitoring across platforms
  • Blackbird — username/email search across 500+ sites
  • IntelX — indexed social content and leaked data

Note: platform APIs and terms have tightened significantly. Many classic scraping tools are in a state of perpetual repair. Always check recency.


8. Geolocation & Imagery (GEOINT)

Determining where a photo, video, or person is located from visual evidence.

Classic technique stack

  • Shadow analysis — sun angle gives latitude and time of day (SunCalc, Suncalc.org)
  • Landmark identification — monuments, logos, business signage
  • Language and script — signage language narrows region
  • Vegetation — tree species and agriculture indicate climate zone
  • Vehicle makes and license plate formats — country/region disambiguation
  • Electrical plug shapes and pole construction — power grid standards vary by region
  • Road markings — lane widths, stripe patterns, sign shapes (MUTCD vs. Vienna Convention)
  • Architecture — roofing styles, window frames, construction materials

Reverse image search

Run the same image through all of these — coverage varies wildly:

  • Google Images / Lens
  • Yandex Images — still the best for Russian/Eastern European and general face matching
  • TinEye — best for finding originals and earliest occurrences
  • Bing Visual Search
  • Baidu — better for Chinese content

AI-assisted geolocation

Modern multimodal models can synthesize the classic technique stack in seconds. The Hackers Arise walkthrough demonstrates using custom GPTs like GeoGuessr GPT for first-pass geolocation: upload an image, ask where it was taken. These models do not do reverse image search — they visually reason over architectural, vegetation, and signage cues. They are often wrong on specifics but provide a valuable starting framework of observations (“the road signs suggest Cyrillic-script Eastern Europe; the utility pole style matches post-Soviet construction”).

Practitioners should treat AI guesses as hypotheses, not conclusions, and verify every claim against ground-truth imagery.

Mapping and imagery sources

  • Google Maps / Earth Pro — Street View, historical imagery, 3D buildings
  • Yandex Maps / Mapillary / KartaView — alternative street-level imagery, stronger coverage in some regions
  • Sentinel Hub / EO Browser — free satellite imagery (Sentinel-2, Landsat)
  • Planet Labs — commercial high-cadence satellite imagery
  • OpenStreetMap — community mapping with extractable POI data
  • Overpass Turbo — query OSM for arbitrary features (e.g. “all churches in this bounding box with a spire over 30m”)
  • Wikimapia — crowd-sourced photo-annotated POI database

Video and live stream OSINT

  • EarthCam / Insecam — aggregated public webcams (many unintentional)
  • Windy.com — live webcams for weather
  • YouTube geosearch tools — find videos shot within a geographic radius
  • FlightRadar24 / ADS-B Exchange — real-time civilian aircraft tracking
  • MarineTraffic / VesselFinder — real-time ship AIS data
  • RailSense / similar — train tracking by region

EXIF and video metadata

Raw camera files contain GPS coordinates by default unless stripped. Most social platforms strip EXIF, but platforms that preserve it (Flickr, some forums, raw email attachments) can hand-deliver the answer. Tools: exiftool, ExifTool Online, Jeffrey's Image Metadata Viewer.


9. Breach, Leak & Paste Intel

Credentials, PII, and internal data exposed through historical breaches and paste sites are a cornerstone of offensive OSINT.

Breach lookup services

  • HaveIBeenPwned — free, non-commercial use; reveals which breaches an email appears in. Also exposes pastes containing the email. Pwned Passwords lets you check whether a specific password has been seen in any breach without sending the password (k-anonymity via SHA-1 prefix).
  • Dehashed — paid, searchable index of actual credential content
  • IntelligenceX / IntelX — indexed breach and leak content, darknet sources
  • LeakCheck / Snusbase / LeakPeek — commercial breach databases
  • Breach-parse / h8mail — local tools for searching personal breach archives

Operational notes

  • HIBP tells you an email was in Collection #1, but not the password. Commercial services provide the cleartext, if ethically acceptable for your engagement.
  • “Sensitive” breach flags (Ashley Madison, etc.) require judgment — referencing them in a client deliverable is frequently inappropriate even when technically accurate.
  • Breach data ages: a password from 2013 is probably not current, but hints at password patterns and reveals services the user has engaged with.
  • Pastes live and die quickly. If a paste URL 404s, check Google cache and Wayback Machine immediately.

Paste sites and dumps

  • Pastebin — classic source, still productive
  • Ghostbin / Hastebin / Rentry / Privatebin — newer alternatives
  • GitHub Gist — frequently overlooked; indexed by Google (site:gist.github.com)
  • Telegram channels — many dump channels operate exclusively on Telegram
  • Darknet forums — BreachForums, XSS, Exploit — require careful opsec

10. Metadata Extraction

Documents, images, and files published by a target frequently leak internal usernames, software versions, file paths, and timestamps.

Document metadata

  • exiftool — the canonical CLI tool; handles EXIF, XMP, IPTC, PDF metadata, Office documents
  • FOCA (Fingerprinting Organizations with Collected Archives) — downloads documents from a target domain, extracts metadata in bulk, builds org charts from author fields
  • metagoofil — FOCA-alike in Python, uses Google/Bing to find documents by filetype on a target domain
  • PDFiD / peepdf — PDF internals inspection
  • oletools — OLE/Office document internals
  • mat2 — metadata anonymization tool; useful for understanding what it strips and therefore what is leaked

Google dorks for document hunts

site:example.com filetype:pdf
site:example.com filetype:xlsx
site:example.com filetype:docx
site:example.com ext:doc OR ext:docx OR ext:xls OR ext:xlsx
site:example.com "for internal use only"

What metadata reveals

FieldLeak
AuthorInternal username (often the domain login)
Creation softwareMicrosoft Office 2016, LibreOffice 7.4 — software inventory
Last modified byAnother internal user
PrinterPrinter model and possibly IP
Revision historyEarlier drafts, collaborators
Embedded imagesSecondary EXIF data
HyperlinksInternal SharePoint/intranet URLs
File pathsC:\Users\jdoe\Documents\... reveals username

11. Code & Repository OSINT

Source code hosting platforms are a gold mine. Every commit is a historical record, and secrets leak constantly.

GitHub search techniques

Surface leaked credentials and sensitive content:

"org:acmecorp" password
"org:acmecorp" apikey
"@acmecorp.com" password
filename:.env acmecorp
filename:config.yml acmecorp
"BEGIN RSA PRIVATE KEY" acmecorp
extension:sql acmecorp INSERT INTO users

Note that GitHub’s secret scanning revokes many tokens automatically, so old dumps may have stale credentials — still useful for mapping services used.

Tools

  • gitleaks — scans repos and Git history for secret patterns
  • trufflehog — entropy-based secret detection, supports GitHub org scanning
  • git-secrets — AWS Labs tool; primarily for preventing commits but usable for audit
  • gitrob — catalogs secrets across an organization’s public repos
  • github-dorks / gh-dork — curated dork lists
  • GitHound / GitMiner — deep search across public GitHub

Pivoting from a single repo

  • Commit metadata — author email, name, timestamps (working hours)
  • .github/CODEOWNERS — team structure
  • Issue comments — internal tool names, vendors, ticket systems
  • PR reviewers — collaboration networks
  • Starred/forked repos — interests, technology exposure
  • GitHub Pages — hosted sites under <user>.github.io often have separate content

Beyond GitHub

  • GitLab.com — same techniques, smaller dork coverage
  • Bitbucket — less searchable but still scannable
  • Self-hosted instances — Gitea, Forgejo, cgit — find via Shodan (http.title:"Gitea")
  • DockerHub — images often ship with embedded secrets or leaked file paths
  • npm / PyPI / crates.io — package authors, private package mentions in public packages

12. Dark Web & Threat Intel

Aggregators and commercial platforms fold darknet content, malware telemetry, and threat actor intelligence into the OSINT pipeline.

Platforms

  • Intel 471 — cybercriminal forum and actor intelligence
  • Recorded Future — broad threat intel with OSINT and closed-source blend
  • CloudSEK (XVigil) — external threat monitoring, brand exposure, dark web
  • Flashpoint — illicit community monitoring
  • DarkOwl — darknet content search
  • ShadowDragon (SocialNet, etc.) — investigative toolkits with 200+ data sources integrated
  • ZeroFox — brand protection, social and dark web
  • Digital Shadows / ReliaQuest — digital risk protection
  • Maltego + Transform Hub — glue for integrating many of the above

Threat intel feeds

  • MISP — open-source threat intelligence sharing platform
  • AlienVault OTX — free community threat exchange
  • abuse.ch (URLhaus, MalwareBazaar, ThreatFox, Feodo Tracker) — free high-quality IoC feeds
  • VirusTotal Intelligence — paid search over submitted samples, URLs, domains
  • GreyNoise — distinguishes targeted scans from internet background noise

13. IoT & Device Discovery

Specialized search for internet-connected devices and sensors, from industrial control systems to smart home devices.

  • Shodan — still the best for ICS/SCADA (port:502, port:102, category:ics)
  • Censys — complementary coverage
  • ZoomEye — strong APAC IoT coverage
  • Thingful — the “search engine for the Internet of Things” — aggregates public IoT sensor data (air quality, weather, energy, transport) across millions of devices globally, suitable for environmental research and urban analytics
  • Kamerka — geolocation-focused ICS/IoT scanner using Shodan/Binary Edge data
  • Insecam — lists public webcams (many with default credentials)

These tools are powerful for researchers mapping exposure and for defenders cataloging their own attack surface. They are equally abused by attackers — defenders should track their own presence in them.


14. Tools Reference

A consolidated lookup of the tools practitioners reach for. The overlap between “OSINT tool” and “recon tool” is large; most of these appear repeatedly in the source surveys.

Frameworks and aggregators

ToolPurpose
MaltegoGraph-based link analysis, Transform Hub with 70+ data sources, the standard for investigations that must produce a visual link chart
SpiderFootAutomated OSINT framework, 200+ modules, web UI, runs scheduled scans, correlates findings
recon-ngFramework with Metasploit-style module system for recon workflows
theHarvesterEmail, subdomain, employee name enumeration from search engines and PGP servers
OSINT Framework (osintframework.com)Curated web directory of tools by category; not a scanner, but the best starting map of the ecosystem
IntelTechniques (OSINT Techniques)Michael Bazzell’s methodology and tool collection

Maltego

  • Model: graph of entities (Person, Domain, IP, Email, etc.) connected by relationships. Transforms run against an entity to produce related entities.
  • Data sources: the Transform Hub integrates DomainTools, Shodan, Pipl, OpenCorporates, Censys, Have I Been Pwned, Vetric, Netlas, IBM Watson, and many more. Many are paid.
  • Use cases: person of interest investigations, corporate link analysis, threat actor attribution, fraud networks.
  • Typical workflow: seed with names, domains, or emails → run passive Transforms → pivot on interesting results → prune noise → export as report or visual graph. A complete person investigation can move from a name to Wikipedia to personal website to historical WHOIS to personal email to person profile (age, relatives) in a handful of Transform runs.

SpiderFoot

  • Model: modular scanner with 200+ modules, each tapping a specific data source. Configure the target and scan profile, run, review.
  • Data sources: Shodan, VirusTotal, HIBP, SecurityTrails, HackerTarget, crt.sh, Censys, IntelX, and many more (some require API keys).
  • Use cases: baseline external exposure audit, continuous monitoring, bug bounty asset discovery, threat investigation.
  • Strengths: fire-and-forget automation, depth of coverage, built-in correlation rules that highlight interesting findings across modules.

theHarvester

  • Model: CLI tool that queries search engines, DNS sources, and PGP key servers for emails, subdomains, IPs, and employee names.
  • Sources: Google, Bing, DuckDuckGo, LinkedIn, Baidu, crt.sh, Shodan, Censys, and many more.
  • Typical invocation: theHarvester -d example.com -l 500 -b all
  • Strengths: simple, scriptable, pairs well with automation pipelines.

recon-ng

  • Model: Metasploit-style framework (workspaces, modules, options). Modules fetch specific data types into a workspace database.
  • Strengths: good persistence of results across sessions, scriptable, reasonable module coverage for core recon tasks.
  • Typical flow: workspaces create acme → add seed domains → run recon/domains-hosts/* modules → export.

Sherlock

  • Purpose: username enumeration across 300+ social sites. python3 sherlock jdoe.
  • Strengths: fast, easy, no API keys. Good for alias discovery.
  • Caveats: false positives on generic 200 responses; validate manually.

Shodan

  • Purpose: search engine over internet-connected service banners. Queries scan data, not live services.
  • Filters: port:, product:, version:, org:, hostname:, country:, vuln:, category:, ssl:, http.title:, http.html:
  • CLI: shodan host 1.2.3.4, shodan search 'apache country:US', shodan download, shodan parse
  • Best for: attack surface snapshots, finding forgotten assets, identifying vulnerable software at scale.

Censys

  • Purpose: internet-wide scan data with particular strength in TLS certificates and precise field queries.
  • Query language: Lucene-style with parsed fields. services.service_name: "HTTP" and parsed.names: example.com
  • Strengths: certificate history, subdomain discovery via cert parsed names, strong API.
  • Free tier: 250 web searches/month; API access requires a paid plan.

Specialized tools referenced across the surveys

CategoryTools
Subdomain enumSubfinder, Amass, Assetfinder, chaos-client, Findomain, Sublist3r
HTTP probinghttpx, aquatone, gowitness, EyeWitness
URL discoverywaybackurls, gau, katana, hakrawler, gospider
Port scanningNmap, Masscan, RustScan, naabu
Content discoveryffuf, gobuster, feroxbuster, dirsearch
Email huntingHunter.io, theHarvester, phonebook.cz, Clearbit, Skymem
Username huntingSherlock, WhatsMyName, Maigret, Namechk, Holehe
Image searchGoogle Lens, Yandex, TinEye, PimEyes
Metadataexiftool, FOCA, metagoofil, mat2
PhonePhoneInfoga
BreachHaveIBeenPwned, Dehashed, h8mail, IntelX
GeolocationSunCalc, Overpass Turbo, Mapillary, GeoGuessr GPT
VisualizationMaltego, Gephi, yEd
IoTShodan, Censys, Thingful, Kamerka
Dark webIntelX, DarkOwl, Ahmia
Continuous monitoringSpiderFoot, Recon-ng, custom crons, ShadowDragon

Commercial platforms

The surveys repeatedly reference a cluster of commercial OSINT/threat intel platforms for enterprise use: Maltego, ShadowDragon, Recorded Future, Intel 471, Flashpoint, CloudSEK XVigil, ZeroFox, DarkOwl, SpiderFoot HX, Babel Street, Dataminr, Palantir Gotham. These bundle data access, analyst tooling, and curated feeds at cost. Free alternatives exist for most individual capabilities; the commercial value is integration, freshness, and support.


15. Automation & Visualization

Manual OSINT is unsustainable past a few targets. Automation and visualization amplify the analyst.

Automation patterns

  • Scripts orchestrating free tools — a shell script that runs subfinder → httpx → nuclei → slack notify gives continuous monitoring on a cron
  • Recon-ng workspaces — persistent state across sessions
  • SpiderFoot scans — scheduled or triggered by webhook
  • Custom Python pipelinesrequests, beautifulsoup, platform APIs, networkx for graphs
  • Jupyter notebooks — for exploratory analysis with inline visualization

One practitioner-authored pipeline (ODIN) strings together WHOIS, reverse WHOIS, subdomain discovery, DNS records, Shodan, RDAP, email harvesting, breach lookups, paste searches, and bucket hunting into a single run against a target name and primary domain, producing a structured report. The underlying techniques are the ones in this guide; the automation just glues them together.

Visualization

Human eyes excel at spotting patterns in graphs that are invisible in tables.

  • Maltego — the reference tool for investigative link analysis
  • Gephi — open-source network visualization for large graphs
  • yEd — free diagramming with auto-layout for medium graphs
  • Neo4j — graph database for queryable link analysis at scale
  • D3.js / vis.js / cytoscape.js — web-based custom visualizations
  • Kibana / Grafana — dashboards for continuous OSINT feeds

16. AI-Assisted OSINT

Multimodal LLMs have become legitimate OSINT accelerants. They do not replace classic techniques but compress the first-pass analysis dramatically.

Where AI helps

  • Image analysis — a multimodal model can enumerate visible clues (signage, architecture, vegetation, vehicles) and propose geolocations in seconds
  • Document summarization — long PDFs, financial filings, court documents
  • Translation and transliteration — foreign-language sources at scale
  • Link extraction — pulling structured entities (names, dates, orgs) from unstructured text
  • Writing style analysis — comparing two corpora for likely authorship
  • Code understanding — interpreting obfuscated JS, reverse engineering APIs
  • Query generation — proposing Google dorks, Shodan filters, or Censys queries from natural-language intent

Where AI fails

  • Hallucinated facts — models confidently fabricate names, dates, and attributions
  • Stale training data — nothing past the cutoff
  • Confirmation bias — will happily pretend to “find” what you ask for
  • Source attribution — outputs typically lack provenance

The rule: AI outputs are hypotheses. Every claim must be independently verified against a primary source before it enters a deliverable.

Specific tools and workflows

  • GeoGuessr GPT and similar custom GPTs — image geolocation first-pass
  • ChatGPT / Claude with vision — general image and document analysis
  • Recon agents — emerging autonomous agents that chain passive recon tools (early stage; reliability is poor)
  • AI-powered dark web monitoring — vendors offering semantic search over crawled forum content
  • AI entity extraction (IBM Watson NLU, spaCy, transformer-based NER) — scalable entity extraction from corpora

Several of the 2026 tool surveys highlight AI-enabled workflows shipping in mainstream platforms (Maltego, SpiderFoot, Hunchly, ShadowDragon) for automated enrichment, pattern detection, and natural-language query translation.


17. Operational Security

OSINT is only passive if you do it right. Sloppy operators leak as much as they collect. Whether you are a defender running recon on your own company, an investigator looking into hostile actors, or a researcher probing sensitive communities, the target should never learn you were looking.

Attribution risks

  • IP address — the target’s analytics and logs capture it
  • User-Agent — fingerprints browser, OS, sometimes tool
  • Account identity — logging into LinkedIn to view a profile attaches your real name
  • Cookies / localStorage — cross-session tracking
  • Referer headers — leaks where you clicked from
  • DNS lookups — your ISP sees every domain you resolve
  • Browser fingerprint — canvas, fonts, screen size, timezone
  • TLS JA3/JA4 — tooling-specific TLS fingerprints
  • Timing patterns — your working hours reveal your timezone

Layered defenses

  • Dedicated investigation VM — never mix with personal or work browsing. Keep it disposable (snapshots, revert after every engagement).
  • Separate OS profile or container — at minimum, a segregated browser profile
  • VPN or residential proxy — Mullvad, IVPN, Proton VPN, or a commercial residential proxy for sensitive investigations. Know the provider’s logging policy.
  • Tor — for the most sensitive operations and dark-web access. Never log into personal accounts over Tor.
  • Burner accounts — sock puppets with their own email, phone (VoIP or burner SIM), aged over time, with plausible background activity
  • Hardened browser — Firefox with resist fingerprinting, uBlock Origin, Cookie AutoDelete, NoScript; or Tor Browser; or Brave with strict settings
  • Screenshot and archive tools with opsec-safe settings — Hunchly is purpose-built for investigators and captures every page automatically, with hash verification
  • Separate phone / hardware — for investigations where device fingerprinting matters
  • No personal accounts, ever — a single Google login while “just checking something” burns the entire persona

Sock puppet hygiene

  • Create accounts well in advance; aged accounts draw less suspicion
  • Use non-obvious names; avoid giveaway patterns (sequential usernames, shared avatars)
  • Build plausible activity: followers, posts, reactions over weeks or months
  • Different sock for different investigations — compartmentalize
  • Record credentials and backstory in a secure, central store
  • Never cross-contaminate between sock, work, and personal identities
  • Accept that sock puppets burn — plan for rotation

Hunchly and investigation capture

Hunchly is one of the few tools in the space purpose-built for investigative OSINT capture. It records every page an investigator visits, preserving exact HTML, screenshots, hashes, and a searchable case database. This solves two perennial problems: (1) reproducibility — you can demonstrate exactly what was on the page when you looked, and (2) note-keeping — the tool captures in real time instead of after the fact. For any investigation that may be scrutinized (legal, regulatory, publication), capture-by-default tooling is essential.

Safe data handling

  • Treat collected PII as sensitive from the moment it arrives
  • Encrypt investigation data at rest
  • Scrub workstations between engagements if commingling is a risk
  • Understand your deliverable’s exposure — who will see this report, and does it contain information that could re-identify protected sources?
  • Observe retention limits — delete when no longer needed

OSINT is legal in broad strokes but varied in detail, and ethical only when practiced with judgment.

  • Computer Fraud and Abuse Act (US) and similar — unauthorized access laws. Passive consumption of public data is safe; active probing without authorization is not.
  • GDPR (EU) — applies to processing personal data of EU residents. Investigators must have a lawful basis; “legitimate interest” often applies but must be documented.
  • CFAA precedent — scraping public data from websites is generally legal (hiQ v. LinkedIn and progeny), but terms-of-service violations can create civil exposure.
  • Platform ToS — scraping LinkedIn, Facebook, Instagram commonly violates ToS even if legal. Accounts can be banned; repeat offenders can face lawsuits.
  • Anti-stalking and harassment laws — aggregating public data about an individual can become unlawful harassment depending on intent and jurisdiction.
  • Breach data handling — possessing breach data is often legal, but further use (extorting victims, publishing PII) is not.
  • Export controls — some OSINT tooling is regulated under dual-use export regimes.

Ethical guardrails

  • Purpose test — can you articulate why you need this intelligence and who benefits?
  • Proportionality test — is the depth of collection proportional to the stakes?
  • Harm test — could publishing this information enable stalking, doxing, or physical harm?
  • Consent test — would the subject reasonably expect this information to be collected and used this way?
  • Transparency test — could you defend your methodology openly if challenged?

Investigators routinely face situations where the legal answer and the ethical answer diverge. A finding that is legal to discover may be unethical to publish. A technique that is clearly ethical may be restricted by platform ToS. Practitioners who survive long-term in the field develop judgment, not just skills.

Defender’s perspective

Defenders using these techniques against their own organization are on firm legal ground — you have implicit authorization over your own assets. The real risks are:

  • Accidentally probing a third party — vendors, customers, partners, lookalike domains
  • Storing personal data of employees — even collected from public sources, it falls under privacy law
  • Tipping off attackers — noisy recon against your own infrastructure can alert adversaries that you are looking

19. Quick Reference

The five-minute external exposure check

Run this on your own domain periodically:

# Subdomains
subfinder -d example.com -all -silent | tee subs.txt
cat subs.txt | dnsx -silent | tee live.txt
cat live.txt | httpx -silent -title -tech-detect -status-code

# Certificates
curl -s "https://crt.sh/?q=%25.example.com&output=json" | jq -r '.[].name_value' | sort -u

# Shodan
shodan search "ssl:example.com" --fields ip_str,port,product,version

# Historical URLs
echo "example.com" | waybackurls | sort -u

# Leaked secrets on GitHub
# Manual: https://github.com/search?q=%22example.com%22+password&type=code

Seed-to-report pivot map

NAME ──┬──▶ search engines ──▶ Wikipedia, personal sites
       ├──▶ LinkedIn ──▶ employer, history
       ├──▶ socials ──▶ aliases ──▶ Sherlock ──▶ more platforms
       └──▶ images ──▶ reverse search ──▶ more accounts

EMAIL ─┬──▶ HIBP ──▶ breach list ──▶ services used
       ├──▶ Hunter.io ──▶ company patterns ──▶ more employees
       ├──▶ Gravatar ──▶ profile image
       ├──▶ Google ──▶ forum posts, paste hits
       └──▶ historical WHOIS ──▶ owned domains

DOMAIN ┬──▶ crt.sh ──▶ subdomains
       ├──▶ subfinder/amass ──▶ more subdomains
       ├──▶ whoxy ──▶ reverse WHOIS ──▶ related domains
       ├──▶ Shodan hostname: ──▶ services
       ├──▶ DNS ──▶ MX/TXT ──▶ vendors
       └──▶ wayback ──▶ historical endpoints

IP ────┬──▶ Shodan/Censys ──▶ services, vulns
       ├──▶ RDAP ──▶ owner, netblock
       ├──▶ reverse DNS ──▶ hostnames
       └──▶ bgp.he.net ──▶ ASN ──▶ more IPs

IMAGE ─┬──▶ Google Lens/Yandex/TinEye ──▶ source
       ├──▶ exiftool ──▶ GPS, camera, timestamp
       ├──▶ AI analysis ──▶ location hypothesis
       └──▶ visual clues ──▶ landmark/sign/architecture

Common Google dorks

site:target.com filetype:pdf
site:target.com ext:doc OR ext:docx OR ext:xls OR ext:xlsx
site:target.com inurl:admin
site:target.com intitle:"index of"
site:target.com "password" OR "confidential"
site:github.com "target.com"
site:linkedin.com/in "Target Corp"
site:pastebin.com "target.com"
site:s3.amazonaws.com target
"@target.com"
intext:"@target.com" site:pastebin.com

Common Shodan queries

hostname:target.com
ssl:"target.com"
org:"Target Corp"
port:3389 country:US org:"Target Corp"
http.title:"Login" hostname:target.com
product:"nginx" version:"1.18.0" hostname:target.com
vuln:CVE-2023-1234
has_screenshot:true port:5900
category:ics country:US

Checklist: before declaring recon complete

  • All known domains and subdomains enumerated from at least three passive sources
  • Certificate transparency logs checked for last 90 days
  • Historical WHOIS reviewed for original/hidden contact data
  • Wayback Machine checked for historical endpoints and scrubbed content
  • Shodan and Censys both queried for hostname and org
  • Cloud bucket namespaces checked (S3, Spaces, Azure, GCS)
  • GitHub/GitLab/Bitbucket searched for leaked secrets and configs
  • Employee emails and usernames harvested
  • Key employees’ breach exposure checked
  • Metadata extracted from published documents
  • DNS records analyzed for third-party vendors (SPF/MX/CNAME)
  • Dangling DNS records screened for takeover potential
  • All findings documented with source URL, timestamp, and confidence level
  • Raw artifacts archived separately from analysis notes
  • Opsec review: no personal accounts touched, no direct target interaction beyond what’s documented

Closing notes

OSINT rewards patience and punishes shortcuts. The tools listed here will all be different in two years — platforms will lock down, APIs will change, services will die, and new ones will appear. What persists is the methodology: ask a clear question, collect broadly, process rigorously, analyze honestly, cite meticulously, and protect the investigation from blowback. Every identifier is a pivot. Every fact needs a source. Every finding needs a second source.

The defender’s version of this guide is the same document read sideways: every technique an attacker can use to map your external footprint is a technique you should be running against yourself, on a schedule, with alerts. The asymmetry between attackers and defenders collapses when defenders start doing their own OSINT first.