Wayback Machine Bulk Lookup

Look up Wayback Machine (archive.org) snapshots for any URL or list of URLs. Returns the full capture timeline, optional snapshot HTML-to-markdown content, and a live-vs-snapshot text diff. Built for OSINT analysts, journalists verifying sources, SEO teams recovering link-rot, and legal evidence collection.

What this actor does

For each input URL, the actor:

Queries the Wayback CDX API to retrieve the snapshot index in your specified date range and capture limit
Optionally fetches snapshot HTML for each capture and converts it to markdown (for reading or archiving)
Optionally fetches the current live URL and computes a line-level text diff against the most recent snapshot (to detect page changes)

Each output record contains the full snapshot timeline plus optional diff and content fields.

Input

Field	Type	Default	Description
`urls`	array of strings	—	Required. URLs to look up in the Wayback Machine
`maxItems`	integer	—	Maximum total output records across all URLs
`dateFrom`	string	—	Earliest snapshot date to include (ISO date, e.g. `2020-01-01`)
`dateTo`	string	—	Latest snapshot date to include (ISO date, e.g. `2024-12-31`)
`captureLimit`	integer	`100`	Max snapshots per URL
`fetchSnapshotContent`	boolean	`false`	Download snapshot HTML and convert to markdown
`diffWithLive`	boolean	`false`	Compute text diff between latest snapshot and current live URL
`proxyConfiguration`	object	none	Optional proxy config (usually not needed for Wayback)

Example input:

{
  "urls": [
    "https://example.com/news/2024-article",
    "https://example.com/about"
  ],
  "dateFrom": "2023-01-01",
  "dateTo": "2024-12-31",
  "captureLimit": 50,
  "diffWithLive": true
}

Output

One record per input URL.

Field	Type	Description
`url`	string	The input URL
`snapshotCount`	number	Number of snapshots found in the date range
`firstCaptured`	string	Earliest snapshot timestamp (ISO 8601)
`lastCaptured`	string	Latest snapshot timestamp (ISO 8601)
`captures`	array	Snapshot entries — each a JSON-encoded string with `timestamp`, `archiveUrl`, `status`, `mimetype`, and optionally `contentMarkdown`
`diff`	object	`{ addedLines, removedLines, changedRatio }` — only present when `diffWithLive=true`
`liveStatus`	number	Current HTTP status of the live URL — only present when `diffWithLive=true`
`finalLiveUrl`	string	Final URL after redirects
`status`	string	`success`, `timeout`, or `error`
`errorMsg`	string	Error details on failure, `null` on success

Example output record:

{
  "url": "https://example.com/news/2024-article",
  "snapshotCount": 14,
  "firstCaptured": "2024-03-12T08:42:00Z",
  "lastCaptured": "2026-04-29T22:11:00Z",
  "captures": [
    "{\"timestamp\":\"2026-04-29T22:11:00Z\",\"archiveUrl\":\"https://web.archive.org/web/20260429221100/https://example.com/news/2024-article\",\"status\":200,\"mimetype\":\"text/html\"}"
  ],
  "diff": { "addedLines": 12, "removedLines": 3, "changedRatio": 0.04 },
  "liveStatus": 200,
  "finalLiveUrl": "https://example.com/news/2024-article",
  "status": "success",
  "errorMsg": null
}

Dataset views

The actor produces two dataset views in the Apify console:

Capture Timeline — url, snapshotCount, firstCaptured, lastCaptured, captures
Live vs Snapshot Diff — url, liveStatus, diff, lastCaptured

Rate limits and performance

The actor respects Wayback Machine's rate limits:

CDX API queries: ~10 requests/second (110ms minimum delay)
Snapshot content fetches: ~1-2 requests/second (700ms minimum delay)

For large batches with fetchSnapshotContent=true, expect longer runtimes. The default timeout is 2 hours. Start with a small captureLimit (e.g. 10) to estimate runtime before running at full scale.

Use cases

OSINT / research: Check whether a source URL existed, when it was captured, and how its content has changed
Journalism: Verify archived versions of articles or government pages for fact-checking
SEO / link-rot recovery: Find archived versions of dead inbound links and plan redirects or outreach
Legal evidence: Retrieve timestamped snapshots of web pages for documentation
Web archiving: Bulk-check coverage for a list of URLs before deeper archiving work

Wayback Machine Bulk Lookup

Wayback Machine Bulk Lookup

What this actor does

Input

Output

Dataset views

Rate limits and performance

Use cases

Related Developer Tools & Utils scrapers