OrbTop

Wayback Machine Bulk Lookup

DEVELOPER TOOLSNEWS

Wayback Machine Bulk Lookup

Look up Wayback Machine (archive.org) snapshots for any URL or list of URLs. Returns the full capture timeline, optional snapshot HTML-to-markdown content, and a live-vs-snapshot text diff. Built for OSINT analysts, journalists verifying sources, SEO teams recovering link-rot, and legal evidence collection.


What this actor does

For each input URL, the actor:

  1. Queries the Wayback CDX API to retrieve the snapshot index in your specified date range and capture limit
  2. Optionally fetches snapshot HTML for each capture and converts it to markdown (for reading or archiving)
  3. Optionally fetches the current live URL and computes a line-level text diff against the most recent snapshot (to detect page changes)

Each output record contains the full snapshot timeline plus optional diff and content fields.


Input

Field Type Default Description
urls array of strings Required. URLs to look up in the Wayback Machine
maxItems integer Maximum total output records across all URLs
dateFrom string Earliest snapshot date to include (ISO date, e.g. 2020-01-01)
dateTo string Latest snapshot date to include (ISO date, e.g. 2024-12-31)
captureLimit integer 100 Max snapshots per URL
fetchSnapshotContent boolean false Download snapshot HTML and convert to markdown
diffWithLive boolean false Compute text diff between latest snapshot and current live URL
proxyConfiguration object none Optional proxy config (usually not needed for Wayback)

Example input:

{
  "urls": [
    "https://example.com/news/2024-article",
    "https://example.com/about"
  ],
  "dateFrom": "2023-01-01",
  "dateTo": "2024-12-31",
  "captureLimit": 50,
  "diffWithLive": true
}

Output

One record per input URL.

Field Type Description
url string The input URL
snapshotCount number Number of snapshots found in the date range
firstCaptured string Earliest snapshot timestamp (ISO 8601)
lastCaptured string Latest snapshot timestamp (ISO 8601)
captures array Snapshot entries — each a JSON-encoded string with timestamp, archiveUrl, status, mimetype, and optionally contentMarkdown
diff object { addedLines, removedLines, changedRatio } — only present when diffWithLive=true
liveStatus number Current HTTP status of the live URL — only present when diffWithLive=true
finalLiveUrl string Final URL after redirects
status string success, timeout, or error
errorMsg string Error details on failure, null on success

Example output record:

{
  "url": "https://example.com/news/2024-article",
  "snapshotCount": 14,
  "firstCaptured": "2024-03-12T08:42:00Z",
  "lastCaptured": "2026-04-29T22:11:00Z",
  "captures": [
    "{\"timestamp\":\"2026-04-29T22:11:00Z\",\"archiveUrl\":\"https://web.archive.org/web/20260429221100/https://example.com/news/2024-article\",\"status\":200,\"mimetype\":\"text/html\"}"
  ],
  "diff": { "addedLines": 12, "removedLines": 3, "changedRatio": 0.04 },
  "liveStatus": 200,
  "finalLiveUrl": "https://example.com/news/2024-article",
  "status": "success",
  "errorMsg": null
}

Dataset views

The actor produces two dataset views in the Apify console:

  • Capture Timelineurl, snapshotCount, firstCaptured, lastCaptured, captures
  • Live vs Snapshot Diffurl, liveStatus, diff, lastCaptured

Rate limits and performance

The actor respects Wayback Machine's rate limits:

  • CDX API queries: ~10 requests/second (110ms minimum delay)
  • Snapshot content fetches: ~1-2 requests/second (700ms minimum delay)

For large batches with fetchSnapshotContent=true, expect longer runtimes. The default timeout is 2 hours. Start with a small captureLimit (e.g. 10) to estimate runtime before running at full scale.


Use cases

  • OSINT / research: Check whether a source URL existed, when it was captured, and how its content has changed
  • Journalism: Verify archived versions of articles or government pages for fact-checking
  • SEO / link-rot recovery: Find archived versions of dead inbound links and plan redirects or outreach
  • Legal evidence: Retrieve timestamped snapshots of web pages for documentation
  • Web archiving: Bulk-check coverage for a list of URLs before deeper archiving work