Sitemap Walker Pro — Recursive URL Discovery
SEO TOOLSDEVELOPER TOOLS
Sitemap Walker Pro (Recursive URL Discovery + Filters)
Walk sitemaps and sitemap-index trees recursively. Falls back to robots.txt when only a site root is given. Filters URLs by glob, regex, last-modified date, and priority. Emits per-URL rows tagged with optional chunk indexes for parallel downstream actors.
Sitemap Walker Pro Features
- Recursive sitemap-index walk — handles real-world sites that fan out across nested indexes (most large sites do), with cycle detection up to 10 levels deep.
- robots.txt fallback discovers the sitemap location when you pass a bare site root.
- Auto-tries
/sitemap.xml,/sitemap_index.xml,/sitemap-index.xml, and/sitemaps.xmlwhen no sitemap is given. - Glob and regex include / exclude patterns. Wrap a pattern in
/.../to treat it as regex; everything else is picomatch glob. lastmodSinceandpriorityMinfilters for incremental crawls and SEO triage.- Optional
chunkSizetags each row with a 0-based chunk index so you can fan out to parallel downstream actors without re-implementing pagination. - Captures hreflang alternates, image / video / news sitemap metadata when publishers ship them.
- Pure HTTP, no browser, no proxies. Polite 250 ms courtesy delay per host with automatic 429 / 5xx retry.
Who Uses Sitemap Walker Data?
- SEO teams — discover the canonical URL set for a site before crawling, deduping, or running rich-result audits.
- Content engineering — feed downstream Lighthouse, screenshot, or structured-data validator runs with the canonical URL list.
- Migration QA — diff the URL set before and after a CMS migration, with
lastmodSincefor incremental snapshots. - AI training-set curators — pull the publisher-blessed URL list straight from the sitemap, instead of crawling and guessing.
- Competitive research — see exactly which content competitors mark up for indexing, and how often each section updates.
How Sitemap Walker Pro Works
- Pass in a list of seed URLs. Each seed can be a sitemap directly, a sitemap-index, or a bare site root.
- For each seed the actor either fetches the sitemap directly, tries the standard
/sitemap.xmlpaths, or (whenfallbackToRobotsTxtis on) parses/robots.txtforSitemap:directives. - The walker descends into
<sitemapindex>entries recursively up to 10 levels deep, dropping cycles via a shared seen-set across all seeds. - URLs are filtered in order: includePatterns → excludePatterns → lastmodSince → priorityMin. Chunk indexes are applied last when
chunkSizeis set.
Input
{
"seeds": ["https://www.apify.com/sitemap.xml"],
"fallbackToRobotsTxt": true,
"recurseSitemapIndex": true,
"includePatterns": ["**/blog/**"],
"excludePatterns": ["/draft/"],
"lastmodSince": "2026-01-01",
"priorityMin": 0.5,
"chunkSize": 100,
"maxUrls": 0,
"maxItems": 15
}
| Field | Type | Default | Description |
|---|---|---|---|
seeds |
array | required | Sitemap URLs OR site roots. Site roots fall back to robots.txt when fallbackToRobotsTxt is on. |
fallbackToRobotsTxt |
boolean | true | When a seed lacks an explicit sitemap, parse /robots.txt for Sitemap: directives. |
recurseSitemapIndex |
boolean | true | Walk into nested <sitemapindex> entries (most large sites use these). |
includePatterns |
array | — | Glob (**/blog/**) or /regex/. Empty = include all. |
excludePatterns |
array | — | Same syntax, applied after include. Exclude wins. |
lastmodSince |
string | — | ISO date. Only emit URLs with lastmod >= this. |
priorityMin |
number | — | Only emit URLs with priority >= this (0.0-1.0). |
chunkSize |
integer | — | Group output into chunks of this size; tag each row with a chunk index. |
maxUrls |
integer | 0 | Hard cap on emitted URLs. 0 = unlimited. |
maxItems |
integer | 15 | Apify-tester safety cap. Override (or set to 0) for production batches. |
The effective cap is the smaller of maxUrls and maxItems when both are set.
Sitemap Walker Pro Output Fields
{
"url": "https://example.com/blog/post-1",
"lastmod": "2026-04-15T10:00:00Z",
"changefreq": "weekly",
"priority": 0.8,
"sourceSitemap": "https://example.com/sitemap.xml",
"chunk": 0,
"alternates": ["es=https://example.com/es/blog/post-1"],
"imageRefs": ["https://example.com/img.jpg;Hero shot"],
"videoRefs": [],
"newsTitle": null,
"newsPublication": null,
"newsPublicationDate": null,
"newsLanguage": null,
"scrapedAt": "2026-04-30T18:00:00Z"
}
| Field | Type | Description |
|---|---|---|
url |
string | Discovered URL. |
lastmod |
string | Last-modified timestamp from the sitemap (ISO string). |
changefreq |
string | always, hourly, daily, weekly, monthly, yearly, or never. |
priority |
number | Priority hint from the sitemap (0.0-1.0). |
sourceSitemap |
string | URL of the sitemap that contained this entry. |
chunk |
number | 0-based chunk index when chunkSize is set; 0 otherwise. |
alternates |
array | Pipe-joined hreflang=href entries from xhtml:link rel=alternate. |
imageRefs |
array | Pipe-joined loc;title entries from image sitemaps. |
videoRefs |
array | Pipe-joined title;content_loc entries from video sitemaps. |
newsTitle |
string | Google News sitemap title (when present). |
newsPublication |
string | Google News publication name (when present). |
newsPublicationDate |
string | Google News publication date (when present). |
newsLanguage |
string | Google News language code (when present). |
scrapedAt |
string | Timestamp when this URL was discovered. |
Pricing
Token charge — functionally free. Apify rejects truly $0 PPE events, so the per-URL price is the smallest practical floor.
| Event | Price |
|---|---|
| Actor start | $0.10 |
| Per discovered URL | $0.0001 |
| Volume | Cost |
|---|---|
| 100 URLs | $0.11 |
| 1,000 URLs | $0.20 |
| 10,000 URLs | $1.10 |
This actor is the cheap discovery primitive that pairs with paid downstream actors. Walk sitemaps liberally.
Limits
maxItemsdefaults to 15 — sized for the Apify tester's 5-minute timeout. Override for production batches by settingmaxItemshigher and / or relying onmaxUrls.- The actor does not fetch the URLs it discovers. Pair with a downstream actor (HTML scraper, Lighthouse, screenshot, structured-data validator) for that.
- Sitemap-index recursion caps at 10 levels deep. Cycles are detected via a shared seen-set across all seeds.
robots.txtDisallow:rules are not enforced. Sitemaps are explicitly the publisher's invitation to fetch the listed URLs.Crawl-delay:directives are not parsed. The walker uses a fixed 250 ms courtesy delay between requests, plus automatic 429 / 5xx retry handling.- Some publishers compress sitemaps as
.xml.gz— these are auto-decompressed.
Related Actors
- Structured Data Validator Pro — feed the discovered URL list straight into structured-data audits.
- SSL & Security Headers Checker — discover URLs for a site, then probe each one's TLS and header posture.
- Angular SSR State Extractor — discover an Angular site's URLs, then pull each page's TransferState payload.
Need More Features?
Need a different output shape, a warehouse integration, or a pre-wired sitemap → fetch → validate chain? File an issue or get in touch.
Why Use Sitemap Walker Pro?
- Functionally free — $0.0001 per URL. Walk a million-URL sitemap for $100 and stop arguing about cost.
- Recursive index walk + robots.txt fallback — handles the messy real world. Most other walkers handle one sitemap and call it a day.
- Chunked output — tag each row with a 0-based chunk index and fan out to N parallel downstream actors without writing a coordinator.
Built by OrbTop.