Sitemap URL Extractor Pro — Sitemap Walker & XML Sitemap Scraper (Recursive URL Discovery + Filters)

Extract URLs from any sitemap. This sitemap URL extractor walks sitemap.xml and sitemap-index trees recursively, parses XML sitemaps and gzipped .xml.gz sitemaps, and falls back to robots.txt sitemap discovery when only a site root is given. Filters URLs by glob, regex, last-modified date, and priority. Emits per-URL rows tagged with optional chunk indexes for parallel downstream actors.

Sitemap Walker Pro Features

Recursive sitemap-index walk — handles real-world sites that fan out across nested indexes (most large sites do), with cycle detection up to 10 levels deep.
robots.txt fallback discovers the sitemap location when you pass a bare site root.
Auto-tries /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, and /sitemaps.xml when no sitemap is given.
Glob and regex include / exclude patterns. Wrap a pattern in /.../ to treat it as regex; everything else is picomatch glob.
lastmodSince and priorityMin filters for incremental crawls and SEO triage.
Optional chunkSize tags each row with a 0-based chunk index so you can fan out to parallel downstream actors without re-implementing pagination.
Captures hreflang alternates, image / video / news sitemap metadata when publishers ship them.
Pure HTTP, no browser, no proxies. Polite 250 ms courtesy delay per host with automatic 429 / 5xx retry.

Who Uses Sitemap Walker Data?

SEO teams — discover the canonical URL set for a site before crawling, deduping, or running rich-result audits.
Content engineering — feed downstream Lighthouse, screenshot, or structured-data validator runs with the canonical URL list.
Migration QA — diff the URL set before and after a CMS migration, with lastmodSince for incremental snapshots.
AI training-set curators — pull the publisher-blessed URL list straight from the sitemap, instead of crawling and guessing.
Competitive research — see exactly which content competitors mark up for indexing, and how often each section updates.

How To Extract URLs From A Sitemap

Pass in a list of seed URLs. Each seed can be a sitemap directly, a sitemap-index, or a bare site root.
For each seed the actor either fetches the sitemap directly, tries the standard /sitemap.xml paths, or (when fallbackToRobotsTxt is on) parses /robots.txt for Sitemap: directives.
The walker descends into <sitemapindex> entries recursively up to 10 levels deep, dropping cycles via a shared seen-set across all seeds.
URLs are filtered in order: includePatterns → excludePatterns → lastmodSince → priorityMin. Chunk indexes are applied last when chunkSize is set.

Input

{
  "seeds": ["https://www.apify.com/sitemap.xml"],
  "fallbackToRobotsTxt": true,
  "recurseSitemapIndex": true,
  "includePatterns": ["**/blog/**"],
  "excludePatterns": ["/draft/"],
  "lastmodSince": "2026-01-01",
  "priorityMin": 0.5,
  "chunkSize": 100,
  "maxUrls": 0,
  "maxItems": 15
}

Field	Type	Default	Description
`seeds`	array	required	Sitemap URLs OR site roots. Site roots fall back to robots.txt when `fallbackToRobotsTxt` is on.
`fallbackToRobotsTxt`	boolean	true	When a seed lacks an explicit sitemap, parse `/robots.txt` for `Sitemap:` directives.
`recurseSitemapIndex`	boolean	true	Walk into nested `<sitemapindex>` entries (most large sites use these).
`includePatterns`	array	—	Glob (`/blog/`) or `/regex/`. Empty = include all.
`excludePatterns`	array	—	Same syntax, applied after include. Exclude wins.
`lastmodSince`	string	—	ISO date. Only emit URLs with `lastmod >= this`.
`priorityMin`	number	—	Only emit URLs with `priority >= this` (0.0-1.0).
`chunkSize`	integer	—	Group output into chunks of this size; tag each row with a chunk index.
`maxUrls`	integer	0	Hard cap on emitted URLs. `0` = unlimited.
`maxItems`	integer	15	Apify-tester safety cap. Override (or set to 0) for production batches.

The effective cap is the smaller of maxUrls and maxItems when both are set.

Sitemap Walker Pro Output Fields

{
  "url": "https://example.com/blog/post-1",
  "lastmod": "2026-04-15T10:00:00Z",
  "changefreq": "weekly",
  "priority": 0.8,
  "sourceSitemap": "https://example.com/sitemap.xml",
  "chunk": 0,
  "alternates":  ["es=https://example.com/es/blog/post-1"],
  "imageRefs":   ["https://example.com/img.jpg;Hero shot"],
  "videoRefs":   [],
  "newsTitle":           null,
  "newsPublication":     null,
  "newsPublicationDate": null,
  "newsLanguage":        null,
  "scrapedAt": "2026-04-30T18:00:00Z"
}

Field	Type	Description
`url`	string	Discovered URL.
`lastmod`	string	Last-modified timestamp from the sitemap (ISO string).
`changefreq`	string	`always`, `hourly`, `daily`, `weekly`, `monthly`, `yearly`, or `never`.
`priority`	number	Priority hint from the sitemap (0.0-1.0).
`sourceSitemap`	string	URL of the sitemap that contained this entry.
`chunk`	number	0-based chunk index when `chunkSize` is set; `0` otherwise.
`alternates`	array	Pipe-joined `hreflang=href` entries from `xhtml:link rel=alternate`.
`imageRefs`	array	Pipe-joined `loc;title` entries from image sitemaps.
`videoRefs`	array	Pipe-joined `title;content_loc` entries from video sitemaps.
`newsTitle`	string	Google News sitemap title (when present).
`newsPublication`	string	Google News publication name (when present).
`newsPublicationDate`	string	Google News publication date (when present).
`newsLanguage`	string	Google News language code (when present).
`scrapedAt`	string	Timestamp when this URL was discovered.

Pricing

Token charge — functionally free. Apify rejects truly $0 PPE events, so the per-URL price is the smallest practical floor.

Event	Price
Actor start	$0.10
Per discovered URL	$0.0001

Volume	Cost
100 URLs	$0.11
1,000 URLs	$0.20
10,000 URLs	$1.10

This actor is the cheap discovery primitive that pairs with paid downstream actors. Walk sitemaps liberally.

Limits

maxItems defaults to 15 — sized for the Apify tester's 5-minute timeout. Override for production batches by setting maxItems higher and / or relying on maxUrls.
The actor does not fetch the URLs it discovers. Pair with a downstream actor (HTML scraper, Lighthouse, screenshot, structured-data validator) for that.
Sitemap-index recursion caps at 10 levels deep. Cycles are detected via a shared seen-set across all seeds.
robots.txt Disallow: rules are not enforced. Sitemaps are explicitly the publisher's invitation to fetch the listed URLs.
Crawl-delay: directives are not parsed. The walker uses a fixed 250 ms courtesy delay between requests, plus automatic 429 / 5xx retry handling.
Some publishers compress sitemaps as .xml.gz — these are auto-decompressed.

FAQ

How do I extract all URLs from an XML sitemap? Pass the sitemap URL (or a bare site root) in seeds. The extractor fetches the sitemap.xml, recurses into any nested sitemap-index entries, and returns one row per discovered URL with its lastmod, priority, and source sitemap. Gzipped .xml.gz sitemaps are auto-decompressed, and robots.txt is parsed for the sitemap location when no explicit sitemap is given.

Related Actors

Structured Data Validator Pro — feed the discovered URL list straight into structured-data audits.
SSL & Security Headers Checker — discover URLs for a site, then probe each one's TLS and header posture.
Angular SSR State Extractor — discover an Angular site's URLs, then pull each page's TransferState payload.

Need More Features?

Need a different output shape, a warehouse integration, or a pre-wired sitemap → fetch → validate chain? File an issue or get in touch.

Why Use Sitemap Walker Pro?

Functionally free — $0.0001 per URL. Walk a million-URL sitemap for $100 and stop arguing about cost.
Recursive index walk + robots.txt fallback — handles the messy real world. Most other walkers handle one sitemap and call it a day.
Chunked output — tag each row with a 0-based chunk index and fan out to N parallel downstream actors without writing a coordinator.

Built by OrbTop.

Sitemap Walker Pro — Recursive URL Discovery