Skip to Content
Developer GuideArchitectureCrawler & Discovery

Crawler & Discovery

The crawler (src/modules/crawler/) is responsible for discovering all pages on a domain before scanning begins.

Discovery algorithm

The crawler uses BFS (breadth-first search) via lightweight HTTP fetch — not Playwright. This keeps discovery fast: no JavaScript execution, no screenshots, just HTML parsing.

For each URL in the queue:

  1. Fetch the page HTML (30s timeout)
  2. Parse <a href>, <link href>, <iframe src> tags
  3. Add same-origin URLs not yet seen to the queue
  4. Also fetch and parse /sitemap.xml on the first pass

Using HTTP fetch for discovery means JavaScript-only navigation (e.g., React Router client-side links that aren’t in the initial HTML) won’t be followed. A sitemap.xml helps bridge this gap.

URL normalization

src/modules/crawler/url-normalizer.ts normalizes URLs before deduplication:

  • Strips fragments (#section)
  • Strips common tracking params (utm_*, fbclid, etc.)
  • Normalizes trailing slashes
  • Lowercases the hostname

This prevents the same page from being scanned twice due to URL variants.

Same-origin enforcement

The crawler only follows links to the same origin as the seed URL. Subdomains are treated as separate origins.

Crawl delay

CRAWL_DELAY_MS (default 200ms) adds a delay between discovery fetches to avoid hammering the target server.

Issue tracking across crawls

src/modules/crawler/issue-tracker.ts handles cross-crawl issue state:

  • Issues are fingerprinted by (pageUrl + ruleId + elementSelector) hash
  • On crawl completion, each issue’s hash is compared against the previous crawl
  • Issues not seen in the new crawl → marked fixed
  • New hashes not seen before → marked open (new)
  • Hashes seen in both → carry over their existing status

Key files

FileResponsibility
src/modules/crawler/discovery.tsBFS crawl implementation
src/modules/crawler/url-normalizer.tsURL normalization and deduplication
src/modules/crawler/issue-tracker.tsCross-crawl issue lifecycle
src/modules/crawler/types.tsCrawler type definitions
src/worker/processors/crawl.processor.tsBullMQ processor for crawl-discovery queue

Next steps

Last updated on