Comprehensive OSINT Guide

A practitioner’s reference for Open Source Intelligence — methodology, collection disciplines, tooling, pivoting techniques, and operational security. Compiled from 34 research sources.

Fundamentals
The OSINT Lifecycle
People OSINT (HUMINT/SOCMINT)
Company & Corporate OSINT
Infrastructure & Network OSINT
Domain, DNS & Certificate Intel
Social Media Intelligence
Geolocation & Imagery (GEOINT)
Breach, Leak & Paste Intel
Metadata Extraction
Code & Repository OSINT
Dark Web & Threat Intel
IoT & Device Discovery
Tools Reference
Automation & Visualization
AI-Assisted OSINT
Operational Security
Legal & Ethical Considerations
Quick Reference

1. Fundamentals

Open Source Intelligence (OSINT) is the discipline of collecting, correlating, and analyzing information that is publicly or legally available to produce actionable intelligence. “Open source” does not mean “easy” or “low value” — it means no clandestine collection is involved. The sources are lawful: the skill lies in knowing where to look, how to pivot, and how to assemble fragments into a coherent picture.

Why it matters:

Use case	Practitioners
Adversary reconnaissance	Red teams, pentesters, bug bounty hunters
Attack surface management	Blue teams, security engineers, CISOs
Threat intelligence	SOC analysts, CTI teams, IR responders
Fraud and KYC investigation	Financial crime analysts, compliance
Journalism and research	Investigative reporters, academic researchers
Law enforcement	Missing persons, criminal investigations
Due diligence	M&A, investor research, hiring
Personal self-defense	Privacy audits, stalker detection

Core principles:

Every fact is a pivot. An email address is not an endpoint — it is a seed for breach lookups, social profile enumeration, domain registrations, and search engine dorks.
Triangulate before trusting. Any single source can be wrong, stale, or planted. Cross-reference at least two independent sources before treating a data point as confirmed.
Document as you go. If you cannot reproduce a finding in six months, it did not happen. Screenshot, hash, archive.
Stay passive until you must be active. The default mode is observation. Only escalate to direct interaction when the intelligence you need cannot be harvested from existing public records.
Scope creep kills investigations. Define the question up front and resist chasing shiny tangents unless they directly serve the objective.

Passive vs. active collection:

	Passive	Active
Contact with target	None — consult third-party data only	Direct queries against target infrastructure
Detection risk	Near zero	Logs, rate limits, WAF alerts
Data freshness	Can be stale (days to years)	Real-time
Examples	crt.sh, Shodan, archive.org, Google dorks	Nmap scan, directory brute-force, HTTP probing
When used	Always first; enumerate scope and context	After passive exhausted, to confirm/expand

The cardinal rule: finish passive recon before touching the target. Anything you can learn from Censys or certificate transparency logs is something you do not need to poke a production server for.

2. The OSINT Lifecycle

Every investigation, whether a two-hour recon sprint or a month-long deep-dive, follows the same phases. Discipline here separates practitioners from tourists.

Phase 1: Planning & Requirements

Write down the question. Who is the target? What decisions will the intelligence support? What is in scope, and what is off-limits? What is the deadline? What format does the deliverable take? Investigations without a defined question wander forever.

Phase 2: Collection

Gather raw data from identified sources. The temptation is to start here — resist it until the plan is clear. Collection spans subdomains, WHOIS records, social profiles, PDF metadata, breach dumps, code repos, certificate logs, historical archives, and more. Keep raw artifacts separate from processed notes.

Phase 3: Processing

Normalize the data. Dedupe subdomains, resolve hostnames to IPs, extract EXIF from images, parse PDFs for authors. Convert everything into a form you can query and pivot against. A messy collection phase dies here.

Phase 4: Analysis

Turn data into intelligence. Correlate findings across sources: the email on the WHOIS record matches the Gravatar on GitHub, which matches a LinkedIn photo, which matches a conference speaker bio. Link analysis tools (Maltego, spreadsheets, link graphs) help surface non-obvious connections.

Phase 5: Dissemination

Deliver findings in the format the consumer expects. A bug bounty report, a pentest recon appendix, an executive briefing, a due diligence memo. Include provenance for every claim — where it came from, when it was collected, and how confident you are.

Phase 6: Feedback

Does the intelligence answer the question? What was missed? What should have been collected sooner? Feed lessons back into Phase 1 for the next engagement.

3. People OSINT (HUMINT/SOCMINT)

People investigations map a target’s digital footprint: identifiers, aliases, affiliations, locations, and relationships. The process is iterative — each fact opens new pivots.

Starting identifiers

Seed	Immediate pivots
Full name	Search engines, LinkedIn, Wikipedia, voter rolls, academic directories
Email	HaveIBeenPwned, Hunter.io, Gravatar, breach dumps, domain WHOIS, Google
Username	Sherlock, WhatsMyName, Namechk, Maigret
Phone number	PhoneInfoga, truecaller, reverse lookup, Telegram/WhatsApp checks
Profile photo	Reverse image search (Google, Yandex, TinEye, PimEyes)
Employer	LinkedIn, press releases, company filings
Address	Property records, voter rolls, Google Maps Street View

The Maltego-style pivot graph

Treating a person investigation as a graph (nodes = identifiers/entities, edges = “associated with”) prevents losing track of where each fact originated. A typical pivot chain from a Maltego-style workflow:

Name → search page titles, Wikipedia, personal website
Personal website → footer emails, phone numbers, historical WHOIS (DomainTools)
WHOIS email → other domains registered with same email (reverse WHOIS via WhoXY)
Social handles (marc_clotet, marcclotetoficial) → Instagram, Twitter, Facebook profiles
Mutual followers/following → close contacts, private accounts
Affiliated company (mentioned in bio) → corporate registry → officers → other affiliated parties
Historical Hotmail address uncovered via DomainTools → Pipl person search → age, relatives, locations
Phone number → messaging app profile photos, account discovery

Each step is a pivot from a confirmed entity to new related entities. Maltego Transforms automate the individual hops; you can run the same workflow manually with curl, whois, and careful note-taking.

Username enumeration

Most people reuse handles across platforms. Tools that check hundreds of sites in parallel:

Sherlock — Python tool checking 300+ social networks for a username
WhatsMyName — web/CLI tool with a community-maintained JSON list of sites
Maigret — fork of Sherlock with richer profile data extraction
Namechk / KnowEm — brand/username availability checkers repurposed for OSINT

A hit on a niche forum is often worth more than another Twitter account — niche forums surface real interests, writing samples, and contact patterns.

Email enumeration and validation

Hunter.io — finds email addresses by domain, infers patterns (first.last@, flast@), verifies deliverability
Email permutator — generates plausible addresses from a name plus domain
HaveIBeenPwned — reveals which breaches an email appears in (reveals services used)
Gravatar — https://gravatar.com/<md5(email)>.json returns profile if registered
Epieos / Holehe — checks dozens of services for account registration without triggering password reset emails

Phone numbers

PhoneInfoga — country, carrier, line type, breach hits
Truecaller — crowdsourced caller ID; risky (reveals your query to Truecaller)
Messaging apps — adding a number to contacts often reveals the registered profile name and avatar (opsec-heavy; use burner)

Reverse image search

Google Images / Google Lens — best for product, landmark, and Western content
Yandex Images — best for faces and people; still the strongest for facial matches
TinEye — best for finding the original source and earliest occurrence
PimEyes / FaceCheck — facial recognition across the open web (paid, ethically fraught)
Bing Visual Search — decent for products and landmarks

4. Company & Corporate OSINT

Company investigations combine infrastructure recon with corporate filings, personnel mapping, and vendor/technology fingerprinting.

Corporate identity sources

Source	Data
OpenCorporates	Global company registry metadata — officers, addresses, status
SEC EDGAR	US public company filings (10-K, 10-Q, insider transactions)
Companies House (UK)	Officers, filings, beneficial owners
Bureau van Dijk / Orbis	Paid, comprehensive global company intelligence
Dun & Bradstreet	Business credit, corporate family trees
Crunchbase / PitchBook	Funding, investors, board members (paid tiers for depth)
LinkedIn company pages	Headcount, departments, employee list
Full Contact / Clearbit	Enrichment APIs — size, industry, tech stack, key people

Subsidiary and domain discovery

Large companies have sprawling digital footprints. Start from the primary name and expand:

Reverse WHOIS (WhoXY, DomainTools) — find all domains registered to the same name or email. Remember: WhoXY requires exact-string matches, so “Blizzard Entertainment” and “Blizzard Entertainment, Inc” will return different sets.
Trademark search — USPTO, EUIPO filings reveal product codenames and subsidiaries.
Press releases and SEC filings — mention subsidiary names that never appear on the website.
Job postings — often mention internal tool names, cloud providers, and office locations.

Employee enumeration

LinkedIn — site:linkedin.com "Acme Corp" via Google reveals public profiles even without login
Hunter.io / phonebook.cz — bulk email harvest by domain
GitHub — commits by @company.com email addresses expose engineers
Conference talks, CVE credits, paper authorship — lists specialists
RocketReach, Lusha, Apollo — sales tools repurposed for contact discovery (paid)

Technology fingerprinting

Knowing the stack narrows exploitation research later:

BuiltWith / Wappalyzer — web stack detection from rendered HTML and headers
Shodan / Censys — banner grabs reveal server software and versions
DNS records — MX (pphosted.com = Proofpoint), SPF (spf.protection.outlook.com = O365), CNAMEs revealing CDN/CMS
JavaScript bundles — library imports, API endpoints, third-party integrations

5. Infrastructure & Network OSINT

This is where OSINT crosses most directly into security recon. The goal is to enumerate every externally reachable asset and catalog what is running on it — without touching the target.

IP space and ASN

ARIN / RIPE / APNIC / LACNIC / AFRINIC RDAP — WHOIS for IP blocks, netblock ownership
bgp.he.net — AS number lookups, peering relationships, announced prefixes
ipinfo.io / ipdata — enrichment APIs with geoloc, ASN, org
RIPEstat — authoritative routing, abuse contacts, historical data

A company that owns its own ASN signals maturity and gives you a clean IP perimeter. A company entirely on cloud (all AWS/GCP/Azure) means you map their domains back to cloud ranges instead.

Search engines for infrastructure

These are the indispensable tools. None of them touch the target; they query pre-indexed scan data.

Tool	Best for	Free tier
Shodan	Banners, service versions, SCADA/ICS, webcams, IoT, vulnerability filters (`vuln:`)	Limited queries, paid plans unlock filters
Censys	Certificate search, service fingerprinting, precise field queries	250 searches/month
Netlas.io	Domains, IPs, WHOIS, DNS combined; Maltego integration	50 searches/day, 2500 results/month
FOFA	Chinese alternative, strong for APAC infrastructure	Limited
ZoomEye	Another Chinese alternative	Limited
BinaryEdge	Scans, leaked databases, risk scoring	Paid
GreyNoise	Classifies “background noise” IPs to filter scan traffic	Community tier
Hunter.how	Cyberspace search engine	Limited

Shodan query patterns:

hostname:example.com
ip:203.0.113.0/24
port:22 country:US
product:"nginx"
vuln:CVE-2021-44228
org:"Acme Corp"
ssl:"example.com"

Censys field queries:

parsed.names: example.com
services.service_name: "HTTP" and location.country: "United States"
services.tls.certificates.leaf_data.names: "*.example.com"
autonomous_system.asn: 13335

Cloud asset discovery

Most modern targets live in public cloud. Mapping cloud assets:

S3 buckets — use awscli or boto3 (authenticated anonymous checks surface more than unauthenticated HTTP probes). Bucket names need to be globally unique; try company-name variants with common prefixes/suffixes: qa, dev, staging, prod, bak, backup, logs, assets, uat, legacy, internal, public, private, docs.
Digital Ocean Spaces — same API shape as S3, separate namespace to enumerate.
Azure blob storage — <name>.blob.core.windows.net
GCS buckets — <name>.storage.googleapis.com
Firebase databases — <name>.firebaseio.com
Dangling CNAMEs — records pointing to deleted cloud resources are ripe for subdomain takeover. The can-i-take-over-xyz repo catalogs the fingerprints.

Historical data

Wayback Machine (archive.org) — snapshots of old pages, forgotten endpoints, robots.txt evolution, admin panel references
CommonCrawl — bulk web archive suitable for scripted search
SecurityTrails — historical DNS records, WHOIS changes, subdomain discovery
DomainTools — historical WHOIS (the closest thing to a time machine for registration data)
Google Cache — the cached view is gradually being removed, but still useful when present

Historical data is often more valuable than current data. A subdomain that vanished last year may still point at a forgotten S3 bucket. An old WHOIS record may contain an admin’s personal email that was scrubbed from the current record.

6. Domain, DNS & Certificate Intel

Domain-level intel is the connective tissue of infrastructure OSINT.

Subdomain enumeration

Passive sources pull from pre-indexed databases — no traffic to the target:

crt.sh — free, no rate limit, queries Certificate Transparency logs. Every TLS cert issued is logged publicly, so issuing a cert for hr.example.com is enough to discover that subdomain even before it goes live.
Certificate Transparency (transparencyreport.google.com) — Google’s CT aggregator
VirusTotal — passive DNS from submissions
SecurityTrails / DNSDumpster / Netcraft — historical DNS aggregators
Subfinder — orchestrates queries against 30+ passive sources
Amass (intel/enum passive mode) — OWASP’s enumeration framework
Assetfinder — tomnomnom’s lightweight passive finder
chaos-client — ProjectDiscovery’s Chaos dataset

Active enumeration adds brute-force, permutation, and zone walking:

puredns / shuffledns — mass DNS resolution with wildcards filtered
altdns / dnsgen — permutation wordlists from discovered names
dnsrecon — zone transfers, cache snooping
massdns — raw resolver for large lists

Active validation

Once you have a list of candidate names, resolve and probe them:

dnsx — bulk resolution, filtering by record type
httpx — probes live HTTP(S) services, captures titles, tech stacks, status codes
aquatone — visual recon; screenshots and clusters subdomains by similarity
gowitness / eyewitness — alternative screenshotters

DNS record mining

Beyond A/CNAME, records leak intel:

Record	Intel
MX	Email provider (Google, Microsoft, Proofpoint, Mimecast)
TXT	SPF records list third-party senders (Marketo, Salesforce, Zendesk); DKIM selectors hint at tooling
SRV	Exposes specific services (XMPP, SIP, LDAP)
CAA	Allowed certificate authorities
NS	DNS provider (Route53, Cloudflare, NS1)
SOA	Admin email, zone refresh parameters

Weak or missing SPF/DMARC (e.g. v=DMARC1; p=none) signals exploitable email spoofing potential. DKIMValidator is a classic utility for testing DMARC alignment without interacting with the target infrastructure.

WHOIS

Current WHOIS — often redacted under GDPR for individuals, still useful for corporate registrants
Historical WHOIS — DomainTools, WhoisXML API, Whoxy — the unredacted gold
Reverse WHOIS — find all domains sharing a registrant email, name, phone, or organization

SOCMINT is high-signal but high-noise. Treat public social media as a window into a target’s relationships, routines, locations, and interests — and treat closely curated accounts (execs, celebrities) as performative artifacts, not ground truth.

Platform-by-platform quick reference

Platform	High-value intel
LinkedIn	Employer history, skills, internal tool mentions, team maps, location
Twitter/X	Writing style, real-time location, device fingerprints (Twitter for iPhone etc.), interests, connections
Facebook	Family, relationships, check-ins, photos, events, groups
Instagram	Geolocation from photos, friend networks, routines, physical spaces
TikTok	Schedules, location context, behavioral patterns
Reddit	Long-form writing, niche communities, real-world interests
GitHub	Code, commits, emails, working hours, associated accounts
Strava/fitness apps	Routines, home location, military base exposure
Telegram	Phone number → profile → channels joined
Discord	Real-time presence, community affiliations

Techniques

Close contact mapping — mutual followers on Instagram, Facebook friend overlap, Twitter interaction graphs
Temporal analysis — timestamp clusters reveal timezone, sleep schedule, work hours
Linguistic fingerprinting — consistent phrasing across accounts links aliases
Photo OSINT — backgrounds reveal location, device clock shows timezone, reflections leak environment
Story/ephemeral content — archive quickly; gone in 24 hours

Tools

Sherlock / WhatsMyName / Maigret — username across platforms
Osintgram — Instagram enumeration (rate-limited; may violate ToS)
Twint / snscrape — Twitter scraping without API (fragile post-API lockdown)
Social Analyzer — API and CLI for social profile discovery
Social-Searcher — keyword monitoring across platforms
Blackbird — username/email search across 500+ sites
IntelX — indexed social content and leaked data

Note: platform APIs and terms have tightened significantly. Many classic scraping tools are in a state of perpetual repair. Always check recency.

8. Geolocation & Imagery (GEOINT)

Determining where a photo, video, or person is located from visual evidence.

Classic technique stack

Shadow analysis — sun angle gives latitude and time of day (SunCalc, Suncalc.org)
Landmark identification — monuments, logos, business signage
Language and script — signage language narrows region
Vegetation — tree species and agriculture indicate climate zone
Vehicle makes and license plate formats — country/region disambiguation
Electrical plug shapes and pole construction — power grid standards vary by region
Road markings — lane widths, stripe patterns, sign shapes (MUTCD vs. Vienna Convention)
Architecture — roofing styles, window frames, construction materials

Reverse image search

Run the same image through all of these — coverage varies wildly:

Google Images / Lens
Yandex Images — still the best for Russian/Eastern European and general face matching
TinEye — best for finding originals and earliest occurrences
Bing Visual Search
Baidu — better for Chinese content

AI-assisted geolocation

Modern multimodal models can synthesize the classic technique stack in seconds. The Hackers Arise walkthrough demonstrates using custom GPTs like GeoGuessr GPT for first-pass geolocation: upload an image, ask where it was taken. These models do not do reverse image search — they visually reason over architectural, vegetation, and signage cues. They are often wrong on specifics but provide a valuable starting framework of observations (“the road signs suggest Cyrillic-script Eastern Europe; the utility pole style matches post-Soviet construction”).

Practitioners should treat AI guesses as hypotheses, not conclusions, and verify every claim against ground-truth imagery.

Mapping and imagery sources

Google Maps / Earth Pro — Street View, historical imagery, 3D buildings
Yandex Maps / Mapillary / KartaView — alternative street-level imagery, stronger coverage in some regions
Sentinel Hub / EO Browser — free satellite imagery (Sentinel-2, Landsat)
Planet Labs — commercial high-cadence satellite imagery
OpenStreetMap — community mapping with extractable POI data
Overpass Turbo — query OSM for arbitrary features (e.g. “all churches in this bounding box with a spire over 30m”)
Wikimapia — crowd-sourced photo-annotated POI database

Video and live stream OSINT

EarthCam / Insecam — aggregated public webcams (many unintentional)
Windy.com — live webcams for weather
YouTube geosearch tools — find videos shot within a geographic radius
FlightRadar24 / ADS-B Exchange — real-time civilian aircraft tracking
MarineTraffic / VesselFinder — real-time ship AIS data
RailSense / similar — train tracking by region

EXIF and video metadata

Raw camera files contain GPS coordinates by default unless stripped. Most social platforms strip EXIF, but platforms that preserve it (Flickr, some forums, raw email attachments) can hand-deliver the answer. Tools: exiftool, ExifTool Online, Jeffrey's Image Metadata Viewer.

9. Breach, Leak & Paste Intel

Credentials, PII, and internal data exposed through historical breaches and paste sites are a cornerstone of offensive OSINT.

Breach lookup services

HaveIBeenPwned — free, non-commercial use; reveals which breaches an email appears in. Also exposes pastes containing the email. Pwned Passwords lets you check whether a specific password has been seen in any breach without sending the password (k-anonymity via SHA-1 prefix).
Dehashed — paid, searchable index of actual credential content
IntelligenceX / IntelX — indexed breach and leak content, darknet sources
LeakCheck / Snusbase / LeakPeek — commercial breach databases
Breach-parse / h8mail — local tools for searching personal breach archives

Operational notes

HIBP tells you an email was in Collection #1, but not the password. Commercial services provide the cleartext, if ethically acceptable for your engagement.
“Sensitive” breach flags (Ashley Madison, etc.) require judgment — referencing them in a client deliverable is frequently inappropriate even when technically accurate.
Breach data ages: a password from 2013 is probably not current, but hints at password patterns and reveals services the user has engaged with.
Pastes live and die quickly. If a paste URL 404s, check Google cache and Wayback Machine immediately.

Paste sites and dumps

Pastebin — classic source, still productive
Ghostbin / Hastebin / Rentry / Privatebin — newer alternatives
GitHub Gist — frequently overlooked; indexed by Google (site:gist.github.com)
Telegram channels — many dump channels operate exclusively on Telegram
Darknet forums — BreachForums, XSS, Exploit — require careful opsec

10. Metadata Extraction

Documents, images, and files published by a target frequently leak internal usernames, software versions, file paths, and timestamps.

Document metadata

exiftool — the canonical CLI tool; handles EXIF, XMP, IPTC, PDF metadata, Office documents
FOCA (Fingerprinting Organizations with Collected Archives) — downloads documents from a target domain, extracts metadata in bulk, builds org charts from author fields
metagoofil — FOCA-alike in Python, uses Google/Bing to find documents by filetype on a target domain
PDFiD / peepdf — PDF internals inspection
oletools — OLE/Office document internals
mat2 — metadata anonymization tool; useful for understanding what it strips and therefore what is leaked

Google dorks for document hunts

site:example.com filetype:pdf
site:example.com filetype:xlsx
site:example.com filetype:docx
site:example.com ext:doc OR ext:docx OR ext:xls OR ext:xlsx
site:example.com "for internal use only"

What metadata reveals

Field	Leak
Author	Internal username (often the domain login)
Creation software	Microsoft Office 2016, LibreOffice 7.4 — software inventory
Last modified by	Another internal user
Printer	Printer model and possibly IP
Revision history	Earlier drafts, collaborators
Embedded images	Secondary EXIF data
Hyperlinks	Internal SharePoint/intranet URLs
File paths	`C:\Users\jdoe\Documents\...` reveals username

11. Code & Repository OSINT

Source code hosting platforms are a gold mine. Every commit is a historical record, and secrets leak constantly.

GitHub search techniques

Surface leaked credentials and sensitive content:

"org:acmecorp" password
"org:acmecorp" apikey
"@acmecorp.com" password
filename:.env acmecorp
filename:config.yml acmecorp
"BEGIN RSA PRIVATE KEY" acmecorp
extension:sql acmecorp INSERT INTO users

Note that GitHub’s secret scanning revokes many tokens automatically, so old dumps may have stale credentials — still useful for mapping services used.

Tools

gitleaks — scans repos and Git history for secret patterns
trufflehog — entropy-based secret detection, supports GitHub org scanning
git-secrets — AWS Labs tool; primarily for preventing commits but usable for audit
gitrob — catalogs secrets across an organization’s public repos
github-dorks / gh-dork — curated dork lists
GitHound / GitMiner — deep search across public GitHub

Pivoting from a single repo

Commit metadata — author email, name, timestamps (working hours)
.github/CODEOWNERS — team structure
Issue comments — internal tool names, vendors, ticket systems
PR reviewers — collaboration networks
Starred/forked repos — interests, technology exposure
GitHub Pages — hosted sites under <user>.github.io often have separate content

Beyond GitHub

GitLab.com — same techniques, smaller dork coverage
Bitbucket — less searchable but still scannable
Self-hosted instances — Gitea, Forgejo, cgit — find via Shodan (http.title:"Gitea")
DockerHub — images often ship with embedded secrets or leaked file paths
npm / PyPI / crates.io — package authors, private package mentions in public packages

12. Dark Web & Threat Intel

Aggregators and commercial platforms fold darknet content, malware telemetry, and threat actor intelligence into the OSINT pipeline.

Platforms

Intel 471 — cybercriminal forum and actor intelligence
Recorded Future — broad threat intel with OSINT and closed-source blend
CloudSEK (XVigil) — external threat monitoring, brand exposure, dark web
Flashpoint — illicit community monitoring
DarkOwl — darknet content search
ShadowDragon (SocialNet, etc.) — investigative toolkits with 200+ data sources integrated
ZeroFox — brand protection, social and dark web
Digital Shadows / ReliaQuest — digital risk protection
Maltego + Transform Hub — glue for integrating many of the above

Threat intel feeds

MISP — open-source threat intelligence sharing platform
AlienVault OTX — free community threat exchange
abuse.ch (URLhaus, MalwareBazaar, ThreatFox, Feodo Tracker) — free high-quality IoC feeds
VirusTotal Intelligence — paid search over submitted samples, URLs, domains
GreyNoise — distinguishes targeted scans from internet background noise

13. IoT & Device Discovery

Specialized search for internet-connected devices and sensors, from industrial control systems to smart home devices.

Shodan — still the best for ICS/SCADA (port:502, port:102, category:ics)
Censys — complementary coverage
ZoomEye — strong APAC IoT coverage
Thingful — the “search engine for the Internet of Things” — aggregates public IoT sensor data (air quality, weather, energy, transport) across millions of devices globally, suitable for environmental research and urban analytics
Kamerka — geolocation-focused ICS/IoT scanner using Shodan/Binary Edge data
Insecam — lists public webcams (many with default credentials)

These tools are powerful for researchers mapping exposure and for defenders cataloging their own attack surface. They are equally abused by attackers — defenders should track their own presence in them.

14. Tools Reference

A consolidated lookup of the tools practitioners reach for. The overlap between “OSINT tool” and “recon tool” is large; most of these appear repeatedly in the source surveys.

Frameworks and aggregators

Tool	Purpose
Maltego	Graph-based link analysis, Transform Hub with 70+ data sources, the standard for investigations that must produce a visual link chart
SpiderFoot	Automated OSINT framework, 200+ modules, web UI, runs scheduled scans, correlates findings
recon-ng	Framework with Metasploit-style module system for recon workflows
theHarvester	Email, subdomain, employee name enumeration from search engines and PGP servers
OSINT Framework (osintframework.com)	Curated web directory of tools by category; not a scanner, but the best starting map of the ecosystem
IntelTechniques (OSINT Techniques)	Michael Bazzell’s methodology and tool collection

Maltego

Model: graph of entities (Person, Domain, IP, Email, etc.) connected by relationships. Transforms run against an entity to produce related entities.
Data sources: the Transform Hub integrates DomainTools, Shodan, Pipl, OpenCorporates, Censys, Have I Been Pwned, Vetric, Netlas, IBM Watson, and many more. Many are paid.
Use cases: person of interest investigations, corporate link analysis, threat actor attribution, fraud networks.
Typical workflow: seed with names, domains, or emails → run passive Transforms → pivot on interesting results → prune noise → export as report or visual graph. A complete person investigation can move from a name to Wikipedia to personal website to historical WHOIS to personal email to person profile (age, relatives) in a handful of Transform runs.

SpiderFoot

Model: modular scanner with 200+ modules, each tapping a specific data source. Configure the target and scan profile, run, review.
Data sources: Shodan, VirusTotal, HIBP, SecurityTrails, HackerTarget, crt.sh, Censys, IntelX, and many more (some require API keys).
Use cases: baseline external exposure audit, continuous monitoring, bug bounty asset discovery, threat investigation.
Strengths: fire-and-forget automation, depth of coverage, built-in correlation rules that highlight interesting findings across modules.

theHarvester

Model: CLI tool that queries search engines, DNS sources, and PGP key servers for emails, subdomains, IPs, and employee names.
Sources: Google, Bing, DuckDuckGo, LinkedIn, Baidu, crt.sh, Shodan, Censys, and many more.
Typical invocation: theHarvester -d example.com -l 500 -b all
Strengths: simple, scriptable, pairs well with automation pipelines.

recon-ng

Model: Metasploit-style framework (workspaces, modules, options). Modules fetch specific data types into a workspace database.
Strengths: good persistence of results across sessions, scriptable, reasonable module coverage for core recon tasks.
Typical flow: workspaces create acme → add seed domains → run recon/domains-hosts/* modules → export.

Sherlock

Purpose: username enumeration across 300+ social sites. python3 sherlock jdoe.
Strengths: fast, easy, no API keys. Good for alias discovery.
Caveats: false positives on generic 200 responses; validate manually.

Shodan

Purpose: search engine over internet-connected service banners. Queries scan data, not live services.
Filters: port:, product:, version:, org:, hostname:, country:, vuln:, category:, ssl:, http.title:, http.html:
CLI: shodan host 1.2.3.4, shodan search 'apache country:US', shodan download, shodan parse
Best for: attack surface snapshots, finding forgotten assets, identifying vulnerable software at scale.

Censys

Purpose: internet-wide scan data with particular strength in TLS certificates and precise field queries.
Query language: Lucene-style with parsed fields. services.service_name: "HTTP" and parsed.names: example.com
Strengths: certificate history, subdomain discovery via cert parsed names, strong API.
Free tier: 250 web searches/month; API access requires a paid plan.

Specialized tools referenced across the surveys

Category	Tools
Subdomain enum	Subfinder, Amass, Assetfinder, chaos-client, Findomain, Sublist3r
HTTP probing	httpx, aquatone, gowitness, EyeWitness
URL discovery	waybackurls, gau, katana, hakrawler, gospider
Port scanning	Nmap, Masscan, RustScan, naabu
Content discovery	ffuf, gobuster, feroxbuster, dirsearch
Email hunting	Hunter.io, theHarvester, phonebook.cz, Clearbit, Skymem
Username hunting	Sherlock, WhatsMyName, Maigret, Namechk, Holehe
Image search	Google Lens, Yandex, TinEye, PimEyes
Metadata	exiftool, FOCA, metagoofil, mat2
Phone	PhoneInfoga
Breach	HaveIBeenPwned, Dehashed, h8mail, IntelX
Geolocation	SunCalc, Overpass Turbo, Mapillary, GeoGuessr GPT
Visualization	Maltego, Gephi, yEd
IoT	Shodan, Censys, Thingful, Kamerka
Dark web	IntelX, DarkOwl, Ahmia
Continuous monitoring	SpiderFoot, Recon-ng, custom crons, ShadowDragon

Commercial platforms

The surveys repeatedly reference a cluster of commercial OSINT/threat intel platforms for enterprise use: Maltego, ShadowDragon, Recorded Future, Intel 471, Flashpoint, CloudSEK XVigil, ZeroFox, DarkOwl, SpiderFoot HX, Babel Street, Dataminr, Palantir Gotham. These bundle data access, analyst tooling, and curated feeds at cost. Free alternatives exist for most individual capabilities; the commercial value is integration, freshness, and support.

15. Automation & Visualization

Manual OSINT is unsustainable past a few targets. Automation and visualization amplify the analyst.

Automation patterns

Scripts orchestrating free tools — a shell script that runs subfinder → httpx → nuclei → slack notify gives continuous monitoring on a cron
Recon-ng workspaces — persistent state across sessions
SpiderFoot scans — scheduled or triggered by webhook
Custom Python pipelines — requests, beautifulsoup, platform APIs, networkx for graphs
Jupyter notebooks — for exploratory analysis with inline visualization

One practitioner-authored pipeline (ODIN) strings together WHOIS, reverse WHOIS, subdomain discovery, DNS records, Shodan, RDAP, email harvesting, breach lookups, paste searches, and bucket hunting into a single run against a target name and primary domain, producing a structured report. The underlying techniques are the ones in this guide; the automation just glues them together.

Visualization

Human eyes excel at spotting patterns in graphs that are invisible in tables.

Maltego — the reference tool for investigative link analysis
Gephi — open-source network visualization for large graphs
yEd — free diagramming with auto-layout for medium graphs
Neo4j — graph database for queryable link analysis at scale
D3.js / vis.js / cytoscape.js — web-based custom visualizations
Kibana / Grafana — dashboards for continuous OSINT feeds

16. AI-Assisted OSINT

Multimodal LLMs have become legitimate OSINT accelerants. They do not replace classic techniques but compress the first-pass analysis dramatically.

Where AI helps

Image analysis — a multimodal model can enumerate visible clues (signage, architecture, vegetation, vehicles) and propose geolocations in seconds
Document summarization — long PDFs, financial filings, court documents
Translation and transliteration — foreign-language sources at scale
Link extraction — pulling structured entities (names, dates, orgs) from unstructured text
Writing style analysis — comparing two corpora for likely authorship
Code understanding — interpreting obfuscated JS, reverse engineering APIs
Query generation — proposing Google dorks, Shodan filters, or Censys queries from natural-language intent

Where AI fails

Hallucinated facts — models confidently fabricate names, dates, and attributions
Stale training data — nothing past the cutoff
Confirmation bias — will happily pretend to “find” what you ask for
Source attribution — outputs typically lack provenance

The rule: AI outputs are hypotheses. Every claim must be independently verified against a primary source before it enters a deliverable.

Specific tools and workflows

GeoGuessr GPT and similar custom GPTs — image geolocation first-pass
ChatGPT / Claude with vision — general image and document analysis
Recon agents — emerging autonomous agents that chain passive recon tools (early stage; reliability is poor)
AI-powered dark web monitoring — vendors offering semantic search over crawled forum content
AI entity extraction (IBM Watson NLU, spaCy, transformer-based NER) — scalable entity extraction from corpora

Several of the 2026 tool surveys highlight AI-enabled workflows shipping in mainstream platforms (Maltego, SpiderFoot, Hunchly, ShadowDragon) for automated enrichment, pattern detection, and natural-language query translation.

17. Operational Security

OSINT is only passive if you do it right. Sloppy operators leak as much as they collect. Whether you are a defender running recon on your own company, an investigator looking into hostile actors, or a researcher probing sensitive communities, the target should never learn you were looking.

Attribution risks

IP address — the target’s analytics and logs capture it
User-Agent — fingerprints browser, OS, sometimes tool
Account identity — logging into LinkedIn to view a profile attaches your real name
Cookies / localStorage — cross-session tracking
Referer headers — leaks where you clicked from
DNS lookups — your ISP sees every domain you resolve
Browser fingerprint — canvas, fonts, screen size, timezone
TLS JA3/JA4 — tooling-specific TLS fingerprints
Timing patterns — your working hours reveal your timezone

Layered defenses

Dedicated investigation VM — never mix with personal or work browsing. Keep it disposable (snapshots, revert after every engagement).
Separate OS profile or container — at minimum, a segregated browser profile
VPN or residential proxy — Mullvad, IVPN, Proton VPN, or a commercial residential proxy for sensitive investigations. Know the provider’s logging policy.
Tor — for the most sensitive operations and dark-web access. Never log into personal accounts over Tor.
Burner accounts — sock puppets with their own email, phone (VoIP or burner SIM), aged over time, with plausible background activity
Hardened browser — Firefox with resist fingerprinting, uBlock Origin, Cookie AutoDelete, NoScript; or Tor Browser; or Brave with strict settings
Screenshot and archive tools with opsec-safe settings — Hunchly is purpose-built for investigators and captures every page automatically, with hash verification
Separate phone / hardware — for investigations where device fingerprinting matters
No personal accounts, ever — a single Google login while “just checking something” burns the entire persona

Sock puppet hygiene

Create accounts well in advance; aged accounts draw less suspicion
Use non-obvious names; avoid giveaway patterns (sequential usernames, shared avatars)
Build plausible activity: followers, posts, reactions over weeks or months
Different sock for different investigations — compartmentalize
Record credentials and backstory in a secure, central store
Never cross-contaminate between sock, work, and personal identities
Accept that sock puppets burn — plan for rotation

Hunchly and investigation capture

Hunchly is one of the few tools in the space purpose-built for investigative OSINT capture. It records every page an investigator visits, preserving exact HTML, screenshots, hashes, and a searchable case database. This solves two perennial problems: (1) reproducibility — you can demonstrate exactly what was on the page when you looked, and (2) note-keeping — the tool captures in real time instead of after the fact. For any investigation that may be scrutinized (legal, regulatory, publication), capture-by-default tooling is essential.

Safe data handling

Treat collected PII as sensitive from the moment it arrives
Encrypt investigation data at rest
Scrub workstations between engagements if commingling is a risk
Understand your deliverable’s exposure — who will see this report, and does it contain information that could re-identify protected sources?
Observe retention limits — delete when no longer needed

18. Legal & Ethical Considerations

OSINT is legal in broad strokes but varied in detail, and ethical only when practiced with judgment.

Legal surface area

Computer Fraud and Abuse Act (US) and similar — unauthorized access laws. Passive consumption of public data is safe; active probing without authorization is not.
GDPR (EU) — applies to processing personal data of EU residents. Investigators must have a lawful basis; “legitimate interest” often applies but must be documented.
CFAA precedent — scraping public data from websites is generally legal (hiQ v. LinkedIn and progeny), but terms-of-service violations can create civil exposure.
Platform ToS — scraping LinkedIn, Facebook, Instagram commonly violates ToS even if legal. Accounts can be banned; repeat offenders can face lawsuits.
Anti-stalking and harassment laws — aggregating public data about an individual can become unlawful harassment depending on intent and jurisdiction.
Breach data handling — possessing breach data is often legal, but further use (extorting victims, publishing PII) is not.
Export controls — some OSINT tooling is regulated under dual-use export regimes.

Ethical guardrails

Purpose test — can you articulate why you need this intelligence and who benefits?
Proportionality test — is the depth of collection proportional to the stakes?
Harm test — could publishing this information enable stalking, doxing, or physical harm?
Consent test — would the subject reasonably expect this information to be collected and used this way?
Transparency test — could you defend your methodology openly if challenged?

Investigators routinely face situations where the legal answer and the ethical answer diverge. A finding that is legal to discover may be unethical to publish. A technique that is clearly ethical may be restricted by platform ToS. Practitioners who survive long-term in the field develop judgment, not just skills.

Defender’s perspective

Defenders using these techniques against their own organization are on firm legal ground — you have implicit authorization over your own assets. The real risks are:

Accidentally probing a third party — vendors, customers, partners, lookalike domains
Storing personal data of employees — even collected from public sources, it falls under privacy law
Tipping off attackers — noisy recon against your own infrastructure can alert adversaries that you are looking

19. Quick Reference

The five-minute external exposure check

Run this on your own domain periodically:

# Subdomains
subfinder -d example.com -all -silent | tee subs.txt
cat subs.txt | dnsx -silent | tee live.txt
cat live.txt | httpx -silent -title -tech-detect -status-code

# Certificates
curl -s "https://crt.sh/?q=%25.example.com&output=json" | jq -r '.[].name_value' | sort -u

# Shodan
shodan search "ssl:example.com" --fields ip_str,port,product,version

# Historical URLs
echo "example.com" | waybackurls | sort -u

# Leaked secrets on GitHub
# Manual: https://github.com/search?q=%22example.com%22+password&type=code

Seed-to-report pivot map

NAME ──┬──▶ search engines ──▶ Wikipedia, personal sites
       ├──▶ LinkedIn ──▶ employer, history
       ├──▶ socials ──▶ aliases ──▶ Sherlock ──▶ more platforms
       └──▶ images ──▶ reverse search ──▶ more accounts

EMAIL ─┬──▶ HIBP ──▶ breach list ──▶ services used
       ├──▶ Hunter.io ──▶ company patterns ──▶ more employees
       ├──▶ Gravatar ──▶ profile image
       ├──▶ Google ──▶ forum posts, paste hits
       └──▶ historical WHOIS ──▶ owned domains

DOMAIN ┬──▶ crt.sh ──▶ subdomains
       ├──▶ subfinder/amass ──▶ more subdomains
       ├──▶ whoxy ──▶ reverse WHOIS ──▶ related domains
       ├──▶ Shodan hostname: ──▶ services
       ├──▶ DNS ──▶ MX/TXT ──▶ vendors
       └──▶ wayback ──▶ historical endpoints

IP ────┬──▶ Shodan/Censys ──▶ services, vulns
       ├──▶ RDAP ──▶ owner, netblock
       ├──▶ reverse DNS ──▶ hostnames
       └──▶ bgp.he.net ──▶ ASN ──▶ more IPs

IMAGE ─┬──▶ Google Lens/Yandex/TinEye ──▶ source
       ├──▶ exiftool ──▶ GPS, camera, timestamp
       ├──▶ AI analysis ──▶ location hypothesis
       └──▶ visual clues ──▶ landmark/sign/architecture

Common Google dorks

site:target.com filetype:pdf
site:target.com ext:doc OR ext:docx OR ext:xls OR ext:xlsx
site:target.com inurl:admin
site:target.com intitle:"index of"
site:target.com "password" OR "confidential"
site:github.com "target.com"
site:linkedin.com/in "Target Corp"
site:pastebin.com "target.com"
site:s3.amazonaws.com target
"@target.com"
intext:"@target.com" site:pastebin.com

Common Shodan queries

hostname:target.com
ssl:"target.com"
org:"Target Corp"
port:3389 country:US org:"Target Corp"
http.title:"Login" hostname:target.com
product:"nginx" version:"1.18.0" hostname:target.com
vuln:CVE-2023-1234
has_screenshot:true port:5900
category:ics country:US

Checklist: before declaring recon complete

All known domains and subdomains enumerated from at least three passive sources
Certificate transparency logs checked for last 90 days
Historical WHOIS reviewed for original/hidden contact data
Wayback Machine checked for historical endpoints and scrubbed content
Shodan and Censys both queried for hostname and org
Cloud bucket namespaces checked (S3, Spaces, Azure, GCS)
GitHub/GitLab/Bitbucket searched for leaked secrets and configs
Employee emails and usernames harvested
Key employees’ breach exposure checked
Metadata extracted from published documents
DNS records analyzed for third-party vendors (SPF/MX/CNAME)
Dangling DNS records screened for takeover potential
All findings documented with source URL, timestamp, and confidence level
Raw artifacts archived separately from analysis notes
Opsec review: no personal accounts touched, no direct target interaction beyond what’s documented

Closing notes

OSINT rewards patience and punishes shortcuts. The tools listed here will all be different in two years — platforms will lock down, APIs will change, services will die, and new ones will appear. What persists is the methodology: ask a clear question, collect broadly, process rigorously, analyze honestly, cite meticulously, and protect the investigation from blowback. Every identifier is a pivot. Every fact needs a source. Every finding needs a second source.

The defender’s version of this guide is the same document read sideways: every technique an attacker can use to map your external footprint is a technique you should be running against yourself, on a schedule, with alerts. The asymmetry between attackers and defenders collapses when defenders start doing their own OSINT first.

Comprehensive OSINT Guide#

Table of Contents#

1. Fundamentals#

2. The OSINT Lifecycle#

Phase 1: Planning & Requirements#

Phase 2: Collection#

Phase 3: Processing#

Phase 4: Analysis#

Phase 5: Dissemination#

Phase 6: Feedback#

3. People OSINT (HUMINT/SOCMINT)#

Starting identifiers#

The Maltego-style pivot graph#

Username enumeration#

Email enumeration and validation#

Phone numbers#

Reverse image search#

4. Company & Corporate OSINT#

Corporate identity sources#

Subsidiary and domain discovery#

Employee enumeration#

Technology fingerprinting#

5. Infrastructure & Network OSINT#

IP space and ASN#

Search engines for infrastructure#

Cloud asset discovery#

Historical data#

6. Domain, DNS & Certificate Intel#

Subdomain enumeration#

Active validation#

DNS record mining#

WHOIS#

7. Social Media Intelligence#

Platform-by-platform quick reference#

Techniques#

Tools#

8. Geolocation & Imagery (GEOINT)#

Classic technique stack#

Reverse image search#

AI-assisted geolocation#

Mapping and imagery sources#

Video and live stream OSINT#

EXIF and video metadata#

9. Breach, Leak & Paste Intel#

Breach lookup services#

Operational notes#

Paste sites and dumps#

10. Metadata Extraction#

Document metadata#

Google dorks for document hunts#

What metadata reveals#

11. Code & Repository OSINT#

GitHub search techniques#

Tools#

Pivoting from a single repo#

Beyond GitHub#

12. Dark Web & Threat Intel#

Platforms#

Threat intel feeds#

13. IoT & Device Discovery#

14. Tools Reference#

Frameworks and aggregators#

Maltego#

SpiderFoot#

theHarvester#

recon-ng#

Sherlock#

Shodan#

Censys#

Specialized tools referenced across the surveys#

Commercial platforms#

15. Automation & Visualization#

Automation patterns#

Visualization#

16. AI-Assisted OSINT#

Where AI helps#

Where AI fails#

Specific tools and workflows#

17. Operational Security#

Attribution risks#

Comprehensive OSINT Guide

Table of Contents

1. Fundamentals

2. The OSINT Lifecycle

Phase 1: Planning & Requirements

Phase 2: Collection

Phase 3: Processing

Phase 4: Analysis

Phase 5: Dissemination

Phase 6: Feedback

3. People OSINT (HUMINT/SOCMINT)

Starting identifiers

The Maltego-style pivot graph

Username enumeration

Email enumeration and validation

Phone numbers

Reverse image search

4. Company & Corporate OSINT

Corporate identity sources

Subsidiary and domain discovery

Employee enumeration

Technology fingerprinting

5. Infrastructure & Network OSINT

IP space and ASN

Search engines for infrastructure

Cloud asset discovery

Historical data

6. Domain, DNS & Certificate Intel

Subdomain enumeration

Active validation

DNS record mining

WHOIS

7. Social Media Intelligence

Platform-by-platform quick reference

Techniques

Tools

8. Geolocation & Imagery (GEOINT)

Classic technique stack

Reverse image search

AI-assisted geolocation

Mapping and imagery sources

Video and live stream OSINT

EXIF and video metadata

9. Breach, Leak & Paste Intel

Breach lookup services

Operational notes

Paste sites and dumps

10. Metadata Extraction

Document metadata

Google dorks for document hunts

What metadata reveals

11. Code & Repository OSINT

GitHub search techniques

Tools

Pivoting from a single repo

Beyond GitHub

12. Dark Web & Threat Intel

Platforms

Threat intel feeds

13. IoT & Device Discovery

14. Tools Reference

Frameworks and aggregators

Maltego

SpiderFoot

theHarvester

recon-ng

Sherlock

Shodan

Censys

Specialized tools referenced across the surveys

Commercial platforms

15. Automation & Visualization

Automation patterns

Visualization

16. AI-Assisted OSINT

Where AI helps

Where AI fails

Specific tools and workflows

17. Operational Security

Attribution risks