Comprehensive OSINT Guide#
A practitioner’s reference for Open Source Intelligence — methodology, collection disciplines, tooling, pivoting techniques, and operational security. Compiled from 34 research sources.
Table of Contents#
- Fundamentals
- The OSINT Lifecycle
- People OSINT (HUMINT/SOCMINT)
- Company & Corporate OSINT
- Infrastructure & Network OSINT
- Domain, DNS & Certificate Intel
- Social Media Intelligence
- Geolocation & Imagery (GEOINT)
- Breach, Leak & Paste Intel
- Metadata Extraction
- Code & Repository OSINT
- Dark Web & Threat Intel
- IoT & Device Discovery
- Tools Reference
- Automation & Visualization
- AI-Assisted OSINT
- Operational Security
- Legal & Ethical Considerations
- Quick Reference
1. Fundamentals#
Open Source Intelligence (OSINT) is the discipline of collecting, correlating, and analyzing information that is publicly or legally available to produce actionable intelligence. “Open source” does not mean “easy” or “low value” — it means no clandestine collection is involved. The sources are lawful: the skill lies in knowing where to look, how to pivot, and how to assemble fragments into a coherent picture.
Why it matters:
| Use case | Practitioners |
|---|---|
| Adversary reconnaissance | Red teams, pentesters, bug bounty hunters |
| Attack surface management | Blue teams, security engineers, CISOs |
| Threat intelligence | SOC analysts, CTI teams, IR responders |
| Fraud and KYC investigation | Financial crime analysts, compliance |
| Journalism and research | Investigative reporters, academic researchers |
| Law enforcement | Missing persons, criminal investigations |
| Due diligence | M&A, investor research, hiring |
| Personal self-defense | Privacy audits, stalker detection |
Core principles:
- Every fact is a pivot. An email address is not an endpoint — it is a seed for breach lookups, social profile enumeration, domain registrations, and search engine dorks.
- Triangulate before trusting. Any single source can be wrong, stale, or planted. Cross-reference at least two independent sources before treating a data point as confirmed.
- Document as you go. If you cannot reproduce a finding in six months, it did not happen. Screenshot, hash, archive.
- Stay passive until you must be active. The default mode is observation. Only escalate to direct interaction when the intelligence you need cannot be harvested from existing public records.
- Scope creep kills investigations. Define the question up front and resist chasing shiny tangents unless they directly serve the objective.
Passive vs. active collection:
| Passive | Active | |
|---|---|---|
| Contact with target | None — consult third-party data only | Direct queries against target infrastructure |
| Detection risk | Near zero | Logs, rate limits, WAF alerts |
| Data freshness | Can be stale (days to years) | Real-time |
| Examples | crt.sh, Shodan, archive.org, Google dorks | Nmap scan, directory brute-force, HTTP probing |
| When used | Always first; enumerate scope and context | After passive exhausted, to confirm/expand |
The cardinal rule: finish passive recon before touching the target. Anything you can learn from Censys or certificate transparency logs is something you do not need to poke a production server for.
2. The OSINT Lifecycle#
Every investigation, whether a two-hour recon sprint or a month-long deep-dive, follows the same phases. Discipline here separates practitioners from tourists.
Phase 1: Planning & Requirements#
Write down the question. Who is the target? What decisions will the intelligence support? What is in scope, and what is off-limits? What is the deadline? What format does the deliverable take? Investigations without a defined question wander forever.
Phase 2: Collection#
Gather raw data from identified sources. The temptation is to start here — resist it until the plan is clear. Collection spans subdomains, WHOIS records, social profiles, PDF metadata, breach dumps, code repos, certificate logs, historical archives, and more. Keep raw artifacts separate from processed notes.
Phase 3: Processing#
Normalize the data. Dedupe subdomains, resolve hostnames to IPs, extract EXIF from images, parse PDFs for authors. Convert everything into a form you can query and pivot against. A messy collection phase dies here.
Phase 4: Analysis#
Turn data into intelligence. Correlate findings across sources: the email on the WHOIS record matches the Gravatar on GitHub, which matches a LinkedIn photo, which matches a conference speaker bio. Link analysis tools (Maltego, spreadsheets, link graphs) help surface non-obvious connections.
Phase 5: Dissemination#
Deliver findings in the format the consumer expects. A bug bounty report, a pentest recon appendix, an executive briefing, a due diligence memo. Include provenance for every claim — where it came from, when it was collected, and how confident you are.
Phase 6: Feedback#
Does the intelligence answer the question? What was missed? What should have been collected sooner? Feed lessons back into Phase 1 for the next engagement.
3. People OSINT (HUMINT/SOCMINT)#
People investigations map a target’s digital footprint: identifiers, aliases, affiliations, locations, and relationships. The process is iterative — each fact opens new pivots.
Starting identifiers#
| Seed | Immediate pivots |
|---|---|
| Full name | Search engines, LinkedIn, Wikipedia, voter rolls, academic directories |
| HaveIBeenPwned, Hunter.io, Gravatar, breach dumps, domain WHOIS, Google | |
| Username | Sherlock, WhatsMyName, Namechk, Maigret |
| Phone number | PhoneInfoga, truecaller, reverse lookup, Telegram/WhatsApp checks |
| Profile photo | Reverse image search (Google, Yandex, TinEye, PimEyes) |
| Employer | LinkedIn, press releases, company filings |
| Address | Property records, voter rolls, Google Maps Street View |
The Maltego-style pivot graph#
Treating a person investigation as a graph (nodes = identifiers/entities, edges = “associated with”) prevents losing track of where each fact originated. A typical pivot chain from a Maltego-style workflow:
- Name → search page titles, Wikipedia, personal website
- Personal website → footer emails, phone numbers, historical WHOIS (DomainTools)
- WHOIS email → other domains registered with same email (reverse WHOIS via WhoXY)
- Social handles (
marc_clotet,marcclotetoficial) → Instagram, Twitter, Facebook profiles - Mutual followers/following → close contacts, private accounts
- Affiliated company (mentioned in bio) → corporate registry → officers → other affiliated parties
- Historical Hotmail address uncovered via DomainTools → Pipl person search → age, relatives, locations
- Phone number → messaging app profile photos, account discovery
Each step is a pivot from a confirmed entity to new related entities. Maltego Transforms automate the individual hops; you can run the same workflow manually with curl, whois, and careful note-taking.
Username enumeration#
Most people reuse handles across platforms. Tools that check hundreds of sites in parallel:
- Sherlock — Python tool checking 300+ social networks for a username
- WhatsMyName — web/CLI tool with a community-maintained JSON list of sites
- Maigret — fork of Sherlock with richer profile data extraction
- Namechk / KnowEm — brand/username availability checkers repurposed for OSINT
A hit on a niche forum is often worth more than another Twitter account — niche forums surface real interests, writing samples, and contact patterns.
Email enumeration and validation#
- Hunter.io — finds email addresses by domain, infers patterns (
first.last@,flast@), verifies deliverability - Email permutator — generates plausible addresses from a name plus domain
- HaveIBeenPwned — reveals which breaches an email appears in (reveals services used)
- Gravatar —
https://gravatar.com/<md5(email)>.jsonreturns profile if registered - Epieos / Holehe — checks dozens of services for account registration without triggering password reset emails
Phone numbers#
- PhoneInfoga — country, carrier, line type, breach hits
- Truecaller — crowdsourced caller ID; risky (reveals your query to Truecaller)
- Messaging apps — adding a number to contacts often reveals the registered profile name and avatar (opsec-heavy; use burner)
Reverse image search#
- Google Images / Google Lens — best for product, landmark, and Western content
- Yandex Images — best for faces and people; still the strongest for facial matches
- TinEye — best for finding the original source and earliest occurrence
- PimEyes / FaceCheck — facial recognition across the open web (paid, ethically fraught)
- Bing Visual Search — decent for products and landmarks
4. Company & Corporate OSINT#
Company investigations combine infrastructure recon with corporate filings, personnel mapping, and vendor/technology fingerprinting.
Corporate identity sources#
| Source | Data |
|---|---|
| OpenCorporates | Global company registry metadata — officers, addresses, status |
| SEC EDGAR | US public company filings (10-K, 10-Q, insider transactions) |
| Companies House (UK) | Officers, filings, beneficial owners |
| Bureau van Dijk / Orbis | Paid, comprehensive global company intelligence |
| Dun & Bradstreet | Business credit, corporate family trees |
| Crunchbase / PitchBook | Funding, investors, board members (paid tiers for depth) |
| LinkedIn company pages | Headcount, departments, employee list |
| Full Contact / Clearbit | Enrichment APIs — size, industry, tech stack, key people |
Subsidiary and domain discovery#
Large companies have sprawling digital footprints. Start from the primary name and expand:
- Reverse WHOIS (WhoXY, DomainTools) — find all domains registered to the same name or email. Remember: WhoXY requires exact-string matches, so “Blizzard Entertainment” and “Blizzard Entertainment, Inc” will return different sets.
- Trademark search — USPTO, EUIPO filings reveal product codenames and subsidiaries.
- Press releases and SEC filings — mention subsidiary names that never appear on the website.
- Job postings — often mention internal tool names, cloud providers, and office locations.
Employee enumeration#
- LinkedIn —
site:linkedin.com "Acme Corp"via Google reveals public profiles even without login - Hunter.io / phonebook.cz — bulk email harvest by domain
- GitHub — commits by
@company.comemail addresses expose engineers - Conference talks, CVE credits, paper authorship — lists specialists
- RocketReach, Lusha, Apollo — sales tools repurposed for contact discovery (paid)
Technology fingerprinting#
Knowing the stack narrows exploitation research later:
- BuiltWith / Wappalyzer — web stack detection from rendered HTML and headers
- Shodan / Censys — banner grabs reveal server software and versions
- DNS records — MX (
pphosted.com= Proofpoint), SPF (spf.protection.outlook.com= O365), CNAMEs revealing CDN/CMS - JavaScript bundles — library imports, API endpoints, third-party integrations
5. Infrastructure & Network OSINT#
This is where OSINT crosses most directly into security recon. The goal is to enumerate every externally reachable asset and catalog what is running on it — without touching the target.
IP space and ASN#
- ARIN / RIPE / APNIC / LACNIC / AFRINIC RDAP — WHOIS for IP blocks, netblock ownership
- bgp.he.net — AS number lookups, peering relationships, announced prefixes
- ipinfo.io / ipdata — enrichment APIs with geoloc, ASN, org
- RIPEstat — authoritative routing, abuse contacts, historical data
A company that owns its own ASN signals maturity and gives you a clean IP perimeter. A company entirely on cloud (all AWS/GCP/Azure) means you map their domains back to cloud ranges instead.
Search engines for infrastructure#
These are the indispensable tools. None of them touch the target; they query pre-indexed scan data.
| Tool | Best for | Free tier |
|---|---|---|
| Shodan | Banners, service versions, SCADA/ICS, webcams, IoT, vulnerability filters (vuln:) | Limited queries, paid plans unlock filters |
| Censys | Certificate search, service fingerprinting, precise field queries | 250 searches/month |
| Netlas.io | Domains, IPs, WHOIS, DNS combined; Maltego integration | 50 searches/day, 2500 results/month |
| FOFA | Chinese alternative, strong for APAC infrastructure | Limited |
| ZoomEye | Another Chinese alternative | Limited |
| BinaryEdge | Scans, leaked databases, risk scoring | Paid |
| GreyNoise | Classifies “background noise” IPs to filter scan traffic | Community tier |
| Hunter.how | Cyberspace search engine | Limited |
Shodan query patterns:
hostname:example.com
ip:203.0.113.0/24
port:22 country:US
product:"nginx"
vuln:CVE-2021-44228
org:"Acme Corp"
ssl:"example.com"
Censys field queries:
parsed.names: example.com
services.service_name: "HTTP" and location.country: "United States"
services.tls.certificates.leaf_data.names: "*.example.com"
autonomous_system.asn: 13335
Cloud asset discovery#
Most modern targets live in public cloud. Mapping cloud assets:
- S3 buckets — use
awsclior boto3 (authenticated anonymous checks surface more than unauthenticated HTTP probes). Bucket names need to be globally unique; try company-name variants with common prefixes/suffixes:qa,dev,staging,prod,bak,backup,logs,assets,uat,legacy,internal,public,private,docs. - Digital Ocean Spaces — same API shape as S3, separate namespace to enumerate.
- Azure blob storage —
<name>.blob.core.windows.net - GCS buckets —
<name>.storage.googleapis.com - Firebase databases —
<name>.firebaseio.com - Dangling CNAMEs — records pointing to deleted cloud resources are ripe for subdomain takeover. The can-i-take-over-xyz repo catalogs the fingerprints.
Historical data#
- Wayback Machine (archive.org) — snapshots of old pages, forgotten endpoints,
robots.txtevolution, admin panel references - CommonCrawl — bulk web archive suitable for scripted search
- SecurityTrails — historical DNS records, WHOIS changes, subdomain discovery
- DomainTools — historical WHOIS (the closest thing to a time machine for registration data)
- Google Cache — the cached view is gradually being removed, but still useful when present
Historical data is often more valuable than current data. A subdomain that vanished last year may still point at a forgotten S3 bucket. An old WHOIS record may contain an admin’s personal email that was scrubbed from the current record.
6. Domain, DNS & Certificate Intel#
Domain-level intel is the connective tissue of infrastructure OSINT.
Subdomain enumeration#
Passive sources pull from pre-indexed databases — no traffic to the target:
- crt.sh — free, no rate limit, queries Certificate Transparency logs. Every TLS cert issued is logged publicly, so issuing a cert for
hr.example.comis enough to discover that subdomain even before it goes live. - Certificate Transparency (
transparencyreport.google.com) — Google’s CT aggregator - VirusTotal — passive DNS from submissions
- SecurityTrails / DNSDumpster / Netcraft — historical DNS aggregators
- Subfinder — orchestrates queries against 30+ passive sources
- Amass (intel/enum passive mode) — OWASP’s enumeration framework
- Assetfinder — tomnomnom’s lightweight passive finder
- chaos-client — ProjectDiscovery’s Chaos dataset
Active enumeration adds brute-force, permutation, and zone walking:
- puredns / shuffledns — mass DNS resolution with wildcards filtered
- altdns / dnsgen — permutation wordlists from discovered names
- dnsrecon — zone transfers, cache snooping
- massdns — raw resolver for large lists
Active validation#
Once you have a list of candidate names, resolve and probe them:
- dnsx — bulk resolution, filtering by record type
- httpx — probes live HTTP(S) services, captures titles, tech stacks, status codes
- aquatone — visual recon; screenshots and clusters subdomains by similarity
- gowitness / eyewitness — alternative screenshotters
DNS record mining#
Beyond A/CNAME, records leak intel:
| Record | Intel |
|---|---|
| MX | Email provider (Google, Microsoft, Proofpoint, Mimecast) |
| TXT | SPF records list third-party senders (Marketo, Salesforce, Zendesk); DKIM selectors hint at tooling |
| SRV | Exposes specific services (XMPP, SIP, LDAP) |
| CAA | Allowed certificate authorities |
| NS | DNS provider (Route53, Cloudflare, NS1) |
| SOA | Admin email, zone refresh parameters |
Weak or missing SPF/DMARC (e.g. v=DMARC1; p=none) signals exploitable email spoofing potential. DKIMValidator is a classic utility for testing DMARC alignment without interacting with the target infrastructure.
WHOIS#
- Current WHOIS — often redacted under GDPR for individuals, still useful for corporate registrants
- Historical WHOIS — DomainTools, WhoisXML API, Whoxy — the unredacted gold
- Reverse WHOIS — find all domains sharing a registrant email, name, phone, or organization
7. Social Media Intelligence#
SOCMINT is high-signal but high-noise. Treat public social media as a window into a target’s relationships, routines, locations, and interests — and treat closely curated accounts (execs, celebrities) as performative artifacts, not ground truth.
Platform-by-platform quick reference#
| Platform | High-value intel |
|---|---|
| Employer history, skills, internal tool mentions, team maps, location | |
| Twitter/X | Writing style, real-time location, device fingerprints (Twitter for iPhone etc.), interests, connections |
| Family, relationships, check-ins, photos, events, groups | |
| Geolocation from photos, friend networks, routines, physical spaces | |
| TikTok | Schedules, location context, behavioral patterns |
| Long-form writing, niche communities, real-world interests | |
| GitHub | Code, commits, emails, working hours, associated accounts |
| Strava/fitness apps | Routines, home location, military base exposure |
| Telegram | Phone number → profile → channels joined |
| Discord | Real-time presence, community affiliations |
Techniques#
- Close contact mapping — mutual followers on Instagram, Facebook friend overlap, Twitter interaction graphs
- Temporal analysis — timestamp clusters reveal timezone, sleep schedule, work hours
- Linguistic fingerprinting — consistent phrasing across accounts links aliases
- Photo OSINT — backgrounds reveal location, device clock shows timezone, reflections leak environment
- Story/ephemeral content — archive quickly; gone in 24 hours
Tools#
- Sherlock / WhatsMyName / Maigret — username across platforms
- Osintgram — Instagram enumeration (rate-limited; may violate ToS)
- Twint / snscrape — Twitter scraping without API (fragile post-API lockdown)
- Social Analyzer — API and CLI for social profile discovery
- Social-Searcher — keyword monitoring across platforms
- Blackbird — username/email search across 500+ sites
- IntelX — indexed social content and leaked data
Note: platform APIs and terms have tightened significantly. Many classic scraping tools are in a state of perpetual repair. Always check recency.
8. Geolocation & Imagery (GEOINT)#
Determining where a photo, video, or person is located from visual evidence.
Classic technique stack#
- Shadow analysis — sun angle gives latitude and time of day (SunCalc, Suncalc.org)
- Landmark identification — monuments, logos, business signage
- Language and script — signage language narrows region
- Vegetation — tree species and agriculture indicate climate zone
- Vehicle makes and license plate formats — country/region disambiguation
- Electrical plug shapes and pole construction — power grid standards vary by region
- Road markings — lane widths, stripe patterns, sign shapes (MUTCD vs. Vienna Convention)
- Architecture — roofing styles, window frames, construction materials
Reverse image search#
Run the same image through all of these — coverage varies wildly:
- Google Images / Lens
- Yandex Images — still the best for Russian/Eastern European and general face matching
- TinEye — best for finding originals and earliest occurrences
- Bing Visual Search
- Baidu — better for Chinese content
AI-assisted geolocation#
Modern multimodal models can synthesize the classic technique stack in seconds. The Hackers Arise walkthrough demonstrates using custom GPTs like GeoGuessr GPT for first-pass geolocation: upload an image, ask where it was taken. These models do not do reverse image search — they visually reason over architectural, vegetation, and signage cues. They are often wrong on specifics but provide a valuable starting framework of observations (“the road signs suggest Cyrillic-script Eastern Europe; the utility pole style matches post-Soviet construction”).
Practitioners should treat AI guesses as hypotheses, not conclusions, and verify every claim against ground-truth imagery.
Mapping and imagery sources#
- Google Maps / Earth Pro — Street View, historical imagery, 3D buildings
- Yandex Maps / Mapillary / KartaView — alternative street-level imagery, stronger coverage in some regions
- Sentinel Hub / EO Browser — free satellite imagery (Sentinel-2, Landsat)
- Planet Labs — commercial high-cadence satellite imagery
- OpenStreetMap — community mapping with extractable POI data
- Overpass Turbo — query OSM for arbitrary features (e.g. “all churches in this bounding box with a spire over 30m”)
- Wikimapia — crowd-sourced photo-annotated POI database
Video and live stream OSINT#
- EarthCam / Insecam — aggregated public webcams (many unintentional)
- Windy.com — live webcams for weather
- YouTube geosearch tools — find videos shot within a geographic radius
- FlightRadar24 / ADS-B Exchange — real-time civilian aircraft tracking
- MarineTraffic / VesselFinder — real-time ship AIS data
- RailSense / similar — train tracking by region
EXIF and video metadata#
Raw camera files contain GPS coordinates by default unless stripped. Most social platforms strip EXIF, but platforms that preserve it (Flickr, some forums, raw email attachments) can hand-deliver the answer. Tools: exiftool, ExifTool Online, Jeffrey's Image Metadata Viewer.
9. Breach, Leak & Paste Intel#
Credentials, PII, and internal data exposed through historical breaches and paste sites are a cornerstone of offensive OSINT.
Breach lookup services#
- HaveIBeenPwned — free, non-commercial use; reveals which breaches an email appears in. Also exposes pastes containing the email. Pwned Passwords lets you check whether a specific password has been seen in any breach without sending the password (k-anonymity via SHA-1 prefix).
- Dehashed — paid, searchable index of actual credential content
- IntelligenceX / IntelX — indexed breach and leak content, darknet sources
- LeakCheck / Snusbase / LeakPeek — commercial breach databases
- Breach-parse / h8mail — local tools for searching personal breach archives
Operational notes#
- HIBP tells you an email was in Collection #1, but not the password. Commercial services provide the cleartext, if ethically acceptable for your engagement.
- “Sensitive” breach flags (Ashley Madison, etc.) require judgment — referencing them in a client deliverable is frequently inappropriate even when technically accurate.
- Breach data ages: a password from 2013 is probably not current, but hints at password patterns and reveals services the user has engaged with.
- Pastes live and die quickly. If a paste URL 404s, check Google cache and Wayback Machine immediately.
Paste sites and dumps#
- Pastebin — classic source, still productive
- Ghostbin / Hastebin / Rentry / Privatebin — newer alternatives
- GitHub Gist — frequently overlooked; indexed by Google (
site:gist.github.com) - Telegram channels — many dump channels operate exclusively on Telegram
- Darknet forums — BreachForums, XSS, Exploit — require careful opsec
10. Metadata Extraction#
Documents, images, and files published by a target frequently leak internal usernames, software versions, file paths, and timestamps.
Document metadata#
- exiftool — the canonical CLI tool; handles EXIF, XMP, IPTC, PDF metadata, Office documents
- FOCA (Fingerprinting Organizations with Collected Archives) — downloads documents from a target domain, extracts metadata in bulk, builds org charts from author fields
- metagoofil — FOCA-alike in Python, uses Google/Bing to find documents by filetype on a target domain
- PDFiD / peepdf — PDF internals inspection
- oletools — OLE/Office document internals
- mat2 — metadata anonymization tool; useful for understanding what it strips and therefore what is leaked
Google dorks for document hunts#
site:example.com filetype:pdf
site:example.com filetype:xlsx
site:example.com filetype:docx
site:example.com ext:doc OR ext:docx OR ext:xls OR ext:xlsx
site:example.com "for internal use only"
What metadata reveals#
| Field | Leak |
|---|---|
| Author | Internal username (often the domain login) |
| Creation software | Microsoft Office 2016, LibreOffice 7.4 — software inventory |
| Last modified by | Another internal user |
| Printer | Printer model and possibly IP |
| Revision history | Earlier drafts, collaborators |
| Embedded images | Secondary EXIF data |
| Hyperlinks | Internal SharePoint/intranet URLs |
| File paths | C:\Users\jdoe\Documents\... reveals username |
11. Code & Repository OSINT#
Source code hosting platforms are a gold mine. Every commit is a historical record, and secrets leak constantly.
GitHub search techniques#
Surface leaked credentials and sensitive content:
"org:acmecorp" password
"org:acmecorp" apikey
"@acmecorp.com" password
filename:.env acmecorp
filename:config.yml acmecorp
"BEGIN RSA PRIVATE KEY" acmecorp
extension:sql acmecorp INSERT INTO users
Note that GitHub’s secret scanning revokes many tokens automatically, so old dumps may have stale credentials — still useful for mapping services used.
Tools#
- gitleaks — scans repos and Git history for secret patterns
- trufflehog — entropy-based secret detection, supports GitHub org scanning
- git-secrets — AWS Labs tool; primarily for preventing commits but usable for audit
- gitrob — catalogs secrets across an organization’s public repos
- github-dorks / gh-dork — curated dork lists
- GitHound / GitMiner — deep search across public GitHub
Pivoting from a single repo#
- Commit metadata — author email, name, timestamps (working hours)
.github/CODEOWNERS— team structure- Issue comments — internal tool names, vendors, ticket systems
- PR reviewers — collaboration networks
- Starred/forked repos — interests, technology exposure
- GitHub Pages — hosted sites under
<user>.github.iooften have separate content
Beyond GitHub#
- GitLab.com — same techniques, smaller dork coverage
- Bitbucket — less searchable but still scannable
- Self-hosted instances — Gitea, Forgejo, cgit — find via Shodan (
http.title:"Gitea") - DockerHub — images often ship with embedded secrets or leaked file paths
- npm / PyPI / crates.io — package authors, private package mentions in public packages
12. Dark Web & Threat Intel#
Aggregators and commercial platforms fold darknet content, malware telemetry, and threat actor intelligence into the OSINT pipeline.
Platforms#
- Intel 471 — cybercriminal forum and actor intelligence
- Recorded Future — broad threat intel with OSINT and closed-source blend
- CloudSEK (XVigil) — external threat monitoring, brand exposure, dark web
- Flashpoint — illicit community monitoring
- DarkOwl — darknet content search
- ShadowDragon (SocialNet, etc.) — investigative toolkits with 200+ data sources integrated
- ZeroFox — brand protection, social and dark web
- Digital Shadows / ReliaQuest — digital risk protection
- Maltego + Transform Hub — glue for integrating many of the above
Threat intel feeds#
- MISP — open-source threat intelligence sharing platform
- AlienVault OTX — free community threat exchange
- abuse.ch (URLhaus, MalwareBazaar, ThreatFox, Feodo Tracker) — free high-quality IoC feeds
- VirusTotal Intelligence — paid search over submitted samples, URLs, domains
- GreyNoise — distinguishes targeted scans from internet background noise
13. IoT & Device Discovery#
Specialized search for internet-connected devices and sensors, from industrial control systems to smart home devices.
- Shodan — still the best for ICS/SCADA (
port:502,port:102,category:ics) - Censys — complementary coverage
- ZoomEye — strong APAC IoT coverage
- Thingful — the “search engine for the Internet of Things” — aggregates public IoT sensor data (air quality, weather, energy, transport) across millions of devices globally, suitable for environmental research and urban analytics
- Kamerka — geolocation-focused ICS/IoT scanner using Shodan/Binary Edge data
- Insecam — lists public webcams (many with default credentials)
These tools are powerful for researchers mapping exposure and for defenders cataloging their own attack surface. They are equally abused by attackers — defenders should track their own presence in them.
14. Tools Reference#
A consolidated lookup of the tools practitioners reach for. The overlap between “OSINT tool” and “recon tool” is large; most of these appear repeatedly in the source surveys.
Frameworks and aggregators#
| Tool | Purpose |
|---|---|
| Maltego | Graph-based link analysis, Transform Hub with 70+ data sources, the standard for investigations that must produce a visual link chart |
| SpiderFoot | Automated OSINT framework, 200+ modules, web UI, runs scheduled scans, correlates findings |
| recon-ng | Framework with Metasploit-style module system for recon workflows |
| theHarvester | Email, subdomain, employee name enumeration from search engines and PGP servers |
| OSINT Framework (osintframework.com) | Curated web directory of tools by category; not a scanner, but the best starting map of the ecosystem |
| IntelTechniques (OSINT Techniques) | Michael Bazzell’s methodology and tool collection |
Maltego#
- Model: graph of entities (Person, Domain, IP, Email, etc.) connected by relationships. Transforms run against an entity to produce related entities.
- Data sources: the Transform Hub integrates DomainTools, Shodan, Pipl, OpenCorporates, Censys, Have I Been Pwned, Vetric, Netlas, IBM Watson, and many more. Many are paid.
- Use cases: person of interest investigations, corporate link analysis, threat actor attribution, fraud networks.
- Typical workflow: seed with names, domains, or emails → run passive Transforms → pivot on interesting results → prune noise → export as report or visual graph. A complete person investigation can move from a name to Wikipedia to personal website to historical WHOIS to personal email to person profile (age, relatives) in a handful of Transform runs.
SpiderFoot#
- Model: modular scanner with 200+ modules, each tapping a specific data source. Configure the target and scan profile, run, review.
- Data sources: Shodan, VirusTotal, HIBP, SecurityTrails, HackerTarget, crt.sh, Censys, IntelX, and many more (some require API keys).
- Use cases: baseline external exposure audit, continuous monitoring, bug bounty asset discovery, threat investigation.
- Strengths: fire-and-forget automation, depth of coverage, built-in correlation rules that highlight interesting findings across modules.
theHarvester#
- Model: CLI tool that queries search engines, DNS sources, and PGP key servers for emails, subdomains, IPs, and employee names.
- Sources: Google, Bing, DuckDuckGo, LinkedIn, Baidu, crt.sh, Shodan, Censys, and many more.
- Typical invocation:
theHarvester -d example.com -l 500 -b all - Strengths: simple, scriptable, pairs well with automation pipelines.
recon-ng#
- Model: Metasploit-style framework (
workspaces,modules,options). Modules fetch specific data types into a workspace database. - Strengths: good persistence of results across sessions, scriptable, reasonable module coverage for core recon tasks.
- Typical flow:
workspaces create acme→ add seed domains → runrecon/domains-hosts/*modules → export.
Sherlock#
- Purpose: username enumeration across 300+ social sites.
python3 sherlock jdoe. - Strengths: fast, easy, no API keys. Good for alias discovery.
- Caveats: false positives on generic 200 responses; validate manually.
Shodan#
- Purpose: search engine over internet-connected service banners. Queries scan data, not live services.
- Filters:
port:,product:,version:,org:,hostname:,country:,vuln:,category:,ssl:,http.title:,http.html: - CLI:
shodan host 1.2.3.4,shodan search 'apache country:US',shodan download,shodan parse - Best for: attack surface snapshots, finding forgotten assets, identifying vulnerable software at scale.
Censys#
- Purpose: internet-wide scan data with particular strength in TLS certificates and precise field queries.
- Query language: Lucene-style with parsed fields.
services.service_name: "HTTP" and parsed.names: example.com - Strengths: certificate history, subdomain discovery via cert parsed names, strong API.
- Free tier: 250 web searches/month; API access requires a paid plan.
Specialized tools referenced across the surveys#
| Category | Tools |
|---|---|
| Subdomain enum | Subfinder, Amass, Assetfinder, chaos-client, Findomain, Sublist3r |
| HTTP probing | httpx, aquatone, gowitness, EyeWitness |
| URL discovery | waybackurls, gau, katana, hakrawler, gospider |
| Port scanning | Nmap, Masscan, RustScan, naabu |
| Content discovery | ffuf, gobuster, feroxbuster, dirsearch |
| Email hunting | Hunter.io, theHarvester, phonebook.cz, Clearbit, Skymem |
| Username hunting | Sherlock, WhatsMyName, Maigret, Namechk, Holehe |
| Image search | Google Lens, Yandex, TinEye, PimEyes |
| Metadata | exiftool, FOCA, metagoofil, mat2 |
| Phone | PhoneInfoga |
| Breach | HaveIBeenPwned, Dehashed, h8mail, IntelX |
| Geolocation | SunCalc, Overpass Turbo, Mapillary, GeoGuessr GPT |
| Visualization | Maltego, Gephi, yEd |
| IoT | Shodan, Censys, Thingful, Kamerka |
| Dark web | IntelX, DarkOwl, Ahmia |
| Continuous monitoring | SpiderFoot, Recon-ng, custom crons, ShadowDragon |
Commercial platforms#
The surveys repeatedly reference a cluster of commercial OSINT/threat intel platforms for enterprise use: Maltego, ShadowDragon, Recorded Future, Intel 471, Flashpoint, CloudSEK XVigil, ZeroFox, DarkOwl, SpiderFoot HX, Babel Street, Dataminr, Palantir Gotham. These bundle data access, analyst tooling, and curated feeds at cost. Free alternatives exist for most individual capabilities; the commercial value is integration, freshness, and support.
15. Automation & Visualization#
Manual OSINT is unsustainable past a few targets. Automation and visualization amplify the analyst.
Automation patterns#
- Scripts orchestrating free tools — a shell script that runs subfinder → httpx → nuclei → slack notify gives continuous monitoring on a cron
- Recon-ng workspaces — persistent state across sessions
- SpiderFoot scans — scheduled or triggered by webhook
- Custom Python pipelines —
requests,beautifulsoup, platform APIs,networkxfor graphs - Jupyter notebooks — for exploratory analysis with inline visualization
One practitioner-authored pipeline (ODIN) strings together WHOIS, reverse WHOIS, subdomain discovery, DNS records, Shodan, RDAP, email harvesting, breach lookups, paste searches, and bucket hunting into a single run against a target name and primary domain, producing a structured report. The underlying techniques are the ones in this guide; the automation just glues them together.
Visualization#
Human eyes excel at spotting patterns in graphs that are invisible in tables.
- Maltego — the reference tool for investigative link analysis
- Gephi — open-source network visualization for large graphs
- yEd — free diagramming with auto-layout for medium graphs
- Neo4j — graph database for queryable link analysis at scale
- D3.js / vis.js / cytoscape.js — web-based custom visualizations
- Kibana / Grafana — dashboards for continuous OSINT feeds
16. AI-Assisted OSINT#
Multimodal LLMs have become legitimate OSINT accelerants. They do not replace classic techniques but compress the first-pass analysis dramatically.
Where AI helps#
- Image analysis — a multimodal model can enumerate visible clues (signage, architecture, vegetation, vehicles) and propose geolocations in seconds
- Document summarization — long PDFs, financial filings, court documents
- Translation and transliteration — foreign-language sources at scale
- Link extraction — pulling structured entities (names, dates, orgs) from unstructured text
- Writing style analysis — comparing two corpora for likely authorship
- Code understanding — interpreting obfuscated JS, reverse engineering APIs
- Query generation — proposing Google dorks, Shodan filters, or Censys queries from natural-language intent
Where AI fails#
- Hallucinated facts — models confidently fabricate names, dates, and attributions
- Stale training data — nothing past the cutoff
- Confirmation bias — will happily pretend to “find” what you ask for
- Source attribution — outputs typically lack provenance
The rule: AI outputs are hypotheses. Every claim must be independently verified against a primary source before it enters a deliverable.
Specific tools and workflows#
- GeoGuessr GPT and similar custom GPTs — image geolocation first-pass
- ChatGPT / Claude with vision — general image and document analysis
- Recon agents — emerging autonomous agents that chain passive recon tools (early stage; reliability is poor)
- AI-powered dark web monitoring — vendors offering semantic search over crawled forum content
- AI entity extraction (IBM Watson NLU, spaCy, transformer-based NER) — scalable entity extraction from corpora
Several of the 2026 tool surveys highlight AI-enabled workflows shipping in mainstream platforms (Maltego, SpiderFoot, Hunchly, ShadowDragon) for automated enrichment, pattern detection, and natural-language query translation.
17. Operational Security#
OSINT is only passive if you do it right. Sloppy operators leak as much as they collect. Whether you are a defender running recon on your own company, an investigator looking into hostile actors, or a researcher probing sensitive communities, the target should never learn you were looking.
Attribution risks#
- IP address — the target’s analytics and logs capture it
- User-Agent — fingerprints browser, OS, sometimes tool
- Account identity — logging into LinkedIn to view a profile attaches your real name
- Cookies / localStorage — cross-session tracking
- Referer headers — leaks where you clicked from
- DNS lookups — your ISP sees every domain you resolve
- Browser fingerprint — canvas, fonts, screen size, timezone
- TLS JA3/JA4 — tooling-specific TLS fingerprints
- Timing patterns — your working hours reveal your timezone
Layered defenses#
- Dedicated investigation VM — never mix with personal or work browsing. Keep it disposable (snapshots, revert after every engagement).
- Separate OS profile or container — at minimum, a segregated browser profile
- VPN or residential proxy — Mullvad, IVPN, Proton VPN, or a commercial residential proxy for sensitive investigations. Know the provider’s logging policy.
- Tor — for the most sensitive operations and dark-web access. Never log into personal accounts over Tor.
- Burner accounts — sock puppets with their own email, phone (VoIP or burner SIM), aged over time, with plausible background activity
- Hardened browser — Firefox with resist fingerprinting, uBlock Origin, Cookie AutoDelete, NoScript; or Tor Browser; or Brave with strict settings
- Screenshot and archive tools with opsec-safe settings — Hunchly is purpose-built for investigators and captures every page automatically, with hash verification
- Separate phone / hardware — for investigations where device fingerprinting matters
- No personal accounts, ever — a single Google login while “just checking something” burns the entire persona
Sock puppet hygiene#
- Create accounts well in advance; aged accounts draw less suspicion
- Use non-obvious names; avoid giveaway patterns (sequential usernames, shared avatars)
- Build plausible activity: followers, posts, reactions over weeks or months
- Different sock for different investigations — compartmentalize
- Record credentials and backstory in a secure, central store
- Never cross-contaminate between sock, work, and personal identities
- Accept that sock puppets burn — plan for rotation
Hunchly and investigation capture#
Hunchly is one of the few tools in the space purpose-built for investigative OSINT capture. It records every page an investigator visits, preserving exact HTML, screenshots, hashes, and a searchable case database. This solves two perennial problems: (1) reproducibility — you can demonstrate exactly what was on the page when you looked, and (2) note-keeping — the tool captures in real time instead of after the fact. For any investigation that may be scrutinized (legal, regulatory, publication), capture-by-default tooling is essential.
Safe data handling#
- Treat collected PII as sensitive from the moment it arrives
- Encrypt investigation data at rest
- Scrub workstations between engagements if commingling is a risk
- Understand your deliverable’s exposure — who will see this report, and does it contain information that could re-identify protected sources?
- Observe retention limits — delete when no longer needed
18. Legal & Ethical Considerations#
OSINT is legal in broad strokes but varied in detail, and ethical only when practiced with judgment.
Legal surface area#
- Computer Fraud and Abuse Act (US) and similar — unauthorized access laws. Passive consumption of public data is safe; active probing without authorization is not.
- GDPR (EU) — applies to processing personal data of EU residents. Investigators must have a lawful basis; “legitimate interest” often applies but must be documented.
- CFAA precedent — scraping public data from websites is generally legal (hiQ v. LinkedIn and progeny), but terms-of-service violations can create civil exposure.
- Platform ToS — scraping LinkedIn, Facebook, Instagram commonly violates ToS even if legal. Accounts can be banned; repeat offenders can face lawsuits.
- Anti-stalking and harassment laws — aggregating public data about an individual can become unlawful harassment depending on intent and jurisdiction.
- Breach data handling — possessing breach data is often legal, but further use (extorting victims, publishing PII) is not.
- Export controls — some OSINT tooling is regulated under dual-use export regimes.
Ethical guardrails#
- Purpose test — can you articulate why you need this intelligence and who benefits?
- Proportionality test — is the depth of collection proportional to the stakes?
- Harm test — could publishing this information enable stalking, doxing, or physical harm?
- Consent test — would the subject reasonably expect this information to be collected and used this way?
- Transparency test — could you defend your methodology openly if challenged?
Investigators routinely face situations where the legal answer and the ethical answer diverge. A finding that is legal to discover may be unethical to publish. A technique that is clearly ethical may be restricted by platform ToS. Practitioners who survive long-term in the field develop judgment, not just skills.
Defender’s perspective#
Defenders using these techniques against their own organization are on firm legal ground — you have implicit authorization over your own assets. The real risks are:
- Accidentally probing a third party — vendors, customers, partners, lookalike domains
- Storing personal data of employees — even collected from public sources, it falls under privacy law
- Tipping off attackers — noisy recon against your own infrastructure can alert adversaries that you are looking
19. Quick Reference#
The five-minute external exposure check#
Run this on your own domain periodically:
# Subdomains
subfinder -d example.com -all -silent | tee subs.txt
cat subs.txt | dnsx -silent | tee live.txt
cat live.txt | httpx -silent -title -tech-detect -status-code
# Certificates
curl -s "https://crt.sh/?q=%25.example.com&output=json" | jq -r '.[].name_value' | sort -u
# Shodan
shodan search "ssl:example.com" --fields ip_str,port,product,version
# Historical URLs
echo "example.com" | waybackurls | sort -u
# Leaked secrets on GitHub
# Manual: https://github.com/search?q=%22example.com%22+password&type=code
Seed-to-report pivot map#
NAME ──┬──▶ search engines ──▶ Wikipedia, personal sites
├──▶ LinkedIn ──▶ employer, history
├──▶ socials ──▶ aliases ──▶ Sherlock ──▶ more platforms
└──▶ images ──▶ reverse search ──▶ more accounts
EMAIL ─┬──▶ HIBP ──▶ breach list ──▶ services used
├──▶ Hunter.io ──▶ company patterns ──▶ more employees
├──▶ Gravatar ──▶ profile image
├──▶ Google ──▶ forum posts, paste hits
└──▶ historical WHOIS ──▶ owned domains
DOMAIN ┬──▶ crt.sh ──▶ subdomains
├──▶ subfinder/amass ──▶ more subdomains
├──▶ whoxy ──▶ reverse WHOIS ──▶ related domains
├──▶ Shodan hostname: ──▶ services
├──▶ DNS ──▶ MX/TXT ──▶ vendors
└──▶ wayback ──▶ historical endpoints
IP ────┬──▶ Shodan/Censys ──▶ services, vulns
├──▶ RDAP ──▶ owner, netblock
├──▶ reverse DNS ──▶ hostnames
└──▶ bgp.he.net ──▶ ASN ──▶ more IPs
IMAGE ─┬──▶ Google Lens/Yandex/TinEye ──▶ source
├──▶ exiftool ──▶ GPS, camera, timestamp
├──▶ AI analysis ──▶ location hypothesis
└──▶ visual clues ──▶ landmark/sign/architecture
Common Google dorks#
site:target.com filetype:pdf
site:target.com ext:doc OR ext:docx OR ext:xls OR ext:xlsx
site:target.com inurl:admin
site:target.com intitle:"index of"
site:target.com "password" OR "confidential"
site:github.com "target.com"
site:linkedin.com/in "Target Corp"
site:pastebin.com "target.com"
site:s3.amazonaws.com target
"@target.com"
intext:"@target.com" site:pastebin.com
Common Shodan queries#
hostname:target.com
ssl:"target.com"
org:"Target Corp"
port:3389 country:US org:"Target Corp"
http.title:"Login" hostname:target.com
product:"nginx" version:"1.18.0" hostname:target.com
vuln:CVE-2023-1234
has_screenshot:true port:5900
category:ics country:US
Checklist: before declaring recon complete#
- All known domains and subdomains enumerated from at least three passive sources
- Certificate transparency logs checked for last 90 days
- Historical WHOIS reviewed for original/hidden contact data
- Wayback Machine checked for historical endpoints and scrubbed content
- Shodan and Censys both queried for hostname and org
- Cloud bucket namespaces checked (S3, Spaces, Azure, GCS)
- GitHub/GitLab/Bitbucket searched for leaked secrets and configs
- Employee emails and usernames harvested
- Key employees’ breach exposure checked
- Metadata extracted from published documents
- DNS records analyzed for third-party vendors (SPF/MX/CNAME)
- Dangling DNS records screened for takeover potential
- All findings documented with source URL, timestamp, and confidence level
- Raw artifacts archived separately from analysis notes
- Opsec review: no personal accounts touched, no direct target interaction beyond what’s documented
Closing notes#
OSINT rewards patience and punishes shortcuts. The tools listed here will all be different in two years — platforms will lock down, APIs will change, services will die, and new ones will appear. What persists is the methodology: ask a clear question, collect broadly, process rigorously, analyze honestly, cite meticulously, and protect the investigation from blowback. Every identifier is a pivot. Every fact needs a source. Every finding needs a second source.
The defender’s version of this guide is the same document read sideways: every technique an attacker can use to map your external footprint is a technique you should be running against yourself, on a schedule, with alerts. The asymmetry between attackers and defenders collapses when defenders start doing their own OSINT first.