OrbTop

Sitemap Walker Pro — Recursive URL Discovery

SEO TOOLSDEVELOPER TOOLS

Sitemap Walker Pro (Recursive URL Discovery + Filters)

Walk sitemaps and sitemap-index trees recursively. Falls back to robots.txt when only a site root is given. Filters URLs by glob, regex, last-modified date, and priority. Emits per-URL rows tagged with optional chunk indexes for parallel downstream actors.


Sitemap Walker Pro Features

  • Recursive sitemap-index walk — handles real-world sites that fan out across nested indexes (most large sites do), with cycle detection up to 10 levels deep.
  • robots.txt fallback discovers the sitemap location when you pass a bare site root.
  • Auto-tries /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, and /sitemaps.xml when no sitemap is given.
  • Glob and regex include / exclude patterns. Wrap a pattern in /.../ to treat it as regex; everything else is picomatch glob.
  • lastmodSince and priorityMin filters for incremental crawls and SEO triage.
  • Optional chunkSize tags each row with a 0-based chunk index so you can fan out to parallel downstream actors without re-implementing pagination.
  • Captures hreflang alternates, image / video / news sitemap metadata when publishers ship them.
  • Pure HTTP, no browser, no proxies. Polite 250 ms courtesy delay per host with automatic 429 / 5xx retry.

Who Uses Sitemap Walker Data?

  • SEO teams — discover the canonical URL set for a site before crawling, deduping, or running rich-result audits.
  • Content engineering — feed downstream Lighthouse, screenshot, or structured-data validator runs with the canonical URL list.
  • Migration QA — diff the URL set before and after a CMS migration, with lastmodSince for incremental snapshots.
  • AI training-set curators — pull the publisher-blessed URL list straight from the sitemap, instead of crawling and guessing.
  • Competitive research — see exactly which content competitors mark up for indexing, and how often each section updates.

How Sitemap Walker Pro Works

  1. Pass in a list of seed URLs. Each seed can be a sitemap directly, a sitemap-index, or a bare site root.
  2. For each seed the actor either fetches the sitemap directly, tries the standard /sitemap.xml paths, or (when fallbackToRobotsTxt is on) parses /robots.txt for Sitemap: directives.
  3. The walker descends into <sitemapindex> entries recursively up to 10 levels deep, dropping cycles via a shared seen-set across all seeds.
  4. URLs are filtered in order: includePatterns → excludePatterns → lastmodSince → priorityMin. Chunk indexes are applied last when chunkSize is set.

Input

{
  "seeds": ["https://www.apify.com/sitemap.xml"],
  "fallbackToRobotsTxt": true,
  "recurseSitemapIndex": true,
  "includePatterns": ["**/blog/**"],
  "excludePatterns": ["/draft/"],
  "lastmodSince": "2026-01-01",
  "priorityMin": 0.5,
  "chunkSize": 100,
  "maxUrls": 0,
  "maxItems": 15
}
Field Type Default Description
seeds array required Sitemap URLs OR site roots. Site roots fall back to robots.txt when fallbackToRobotsTxt is on.
fallbackToRobotsTxt boolean true When a seed lacks an explicit sitemap, parse /robots.txt for Sitemap: directives.
recurseSitemapIndex boolean true Walk into nested <sitemapindex> entries (most large sites use these).
includePatterns array Glob (**/blog/**) or /regex/. Empty = include all.
excludePatterns array Same syntax, applied after include. Exclude wins.
lastmodSince string ISO date. Only emit URLs with lastmod >= this.
priorityMin number Only emit URLs with priority >= this (0.0-1.0).
chunkSize integer Group output into chunks of this size; tag each row with a chunk index.
maxUrls integer 0 Hard cap on emitted URLs. 0 = unlimited.
maxItems integer 15 Apify-tester safety cap. Override (or set to 0) for production batches.

The effective cap is the smaller of maxUrls and maxItems when both are set.


Sitemap Walker Pro Output Fields

{
  "url": "https://example.com/blog/post-1",
  "lastmod": "2026-04-15T10:00:00Z",
  "changefreq": "weekly",
  "priority": 0.8,
  "sourceSitemap": "https://example.com/sitemap.xml",
  "chunk": 0,
  "alternates":  ["es=https://example.com/es/blog/post-1"],
  "imageRefs":   ["https://example.com/img.jpg;Hero shot"],
  "videoRefs":   [],
  "newsTitle":           null,
  "newsPublication":     null,
  "newsPublicationDate": null,
  "newsLanguage":        null,
  "scrapedAt": "2026-04-30T18:00:00Z"
}
Field Type Description
url string Discovered URL.
lastmod string Last-modified timestamp from the sitemap (ISO string).
changefreq string always, hourly, daily, weekly, monthly, yearly, or never.
priority number Priority hint from the sitemap (0.0-1.0).
sourceSitemap string URL of the sitemap that contained this entry.
chunk number 0-based chunk index when chunkSize is set; 0 otherwise.
alternates array Pipe-joined hreflang=href entries from xhtml:link rel=alternate.
imageRefs array Pipe-joined loc;title entries from image sitemaps.
videoRefs array Pipe-joined title;content_loc entries from video sitemaps.
newsTitle string Google News sitemap title (when present).
newsPublication string Google News publication name (when present).
newsPublicationDate string Google News publication date (when present).
newsLanguage string Google News language code (when present).
scrapedAt string Timestamp when this URL was discovered.

Pricing

Token charge — functionally free. Apify rejects truly $0 PPE events, so the per-URL price is the smallest practical floor.

Event Price
Actor start $0.10
Per discovered URL $0.0001
Volume Cost
100 URLs $0.11
1,000 URLs $0.20
10,000 URLs $1.10

This actor is the cheap discovery primitive that pairs with paid downstream actors. Walk sitemaps liberally.


Limits

  • maxItems defaults to 15 — sized for the Apify tester's 5-minute timeout. Override for production batches by setting maxItems higher and / or relying on maxUrls.
  • The actor does not fetch the URLs it discovers. Pair with a downstream actor (HTML scraper, Lighthouse, screenshot, structured-data validator) for that.
  • Sitemap-index recursion caps at 10 levels deep. Cycles are detected via a shared seen-set across all seeds.
  • robots.txt Disallow: rules are not enforced. Sitemaps are explicitly the publisher's invitation to fetch the listed URLs.
  • Crawl-delay: directives are not parsed. The walker uses a fixed 250 ms courtesy delay between requests, plus automatic 429 / 5xx retry handling.
  • Some publishers compress sitemaps as .xml.gz — these are auto-decompressed.

Related Actors

  • Structured Data Validator Pro — feed the discovered URL list straight into structured-data audits.
  • SSL & Security Headers Checker — discover URLs for a site, then probe each one's TLS and header posture.
  • Angular SSR State Extractor — discover an Angular site's URLs, then pull each page's TransferState payload.

Need More Features?

Need a different output shape, a warehouse integration, or a pre-wired sitemap → fetch → validate chain? File an issue or get in touch.

Why Use Sitemap Walker Pro?

  • Functionally free — $0.0001 per URL. Walk a million-URL sitemap for $100 and stop arguing about cost.
  • Recursive index walk + robots.txt fallback — handles the messy real world. Most other walkers handle one sitemap and call it a day.
  • Chunked output — tag each row with a 0-based chunk index and fan out to N parallel downstream actors without writing a coordinator.

Built by OrbTop.