Wayback Machine Bulk Lookup
Wayback Machine Bulk Lookup
Look up Wayback Machine (archive.org) snapshots for any URL or list of URLs. Returns the full capture timeline, optional snapshot HTML-to-markdown content, and a live-vs-snapshot text diff. Built for OSINT analysts, journalists verifying sources, SEO teams recovering link-rot, and legal evidence collection.
What this actor does
For each input URL, the actor:
- Queries the Wayback CDX API to retrieve the snapshot index in your specified date range and capture limit
- Optionally fetches snapshot HTML for each capture and converts it to markdown (for reading or archiving)
- Optionally fetches the current live URL and computes a line-level text diff against the most recent snapshot (to detect page changes)
Each output record contains the full snapshot timeline plus optional diff and content fields.
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls |
array of strings | — | Required. URLs to look up in the Wayback Machine |
maxItems |
integer | — | Maximum total output records across all URLs |
dateFrom |
string | — | Earliest snapshot date to include (ISO date, e.g. 2020-01-01) |
dateTo |
string | — | Latest snapshot date to include (ISO date, e.g. 2024-12-31) |
captureLimit |
integer | 100 |
Max snapshots per URL |
fetchSnapshotContent |
boolean | false |
Download snapshot HTML and convert to markdown |
diffWithLive |
boolean | false |
Compute text diff between latest snapshot and current live URL |
proxyConfiguration |
object | none | Optional proxy config (usually not needed for Wayback) |
Example input:
{
"urls": [
"https://example.com/news/2024-article",
"https://example.com/about"
],
"dateFrom": "2023-01-01",
"dateTo": "2024-12-31",
"captureLimit": 50,
"diffWithLive": true
}
Output
One record per input URL.
| Field | Type | Description |
|---|---|---|
url |
string | The input URL |
snapshotCount |
number | Number of snapshots found in the date range |
firstCaptured |
string | Earliest snapshot timestamp (ISO 8601) |
lastCaptured |
string | Latest snapshot timestamp (ISO 8601) |
captures |
array | Snapshot entries — each a JSON-encoded string with timestamp, archiveUrl, status, mimetype, and optionally contentMarkdown |
diff |
object | { addedLines, removedLines, changedRatio } — only present when diffWithLive=true |
liveStatus |
number | Current HTTP status of the live URL — only present when diffWithLive=true |
finalLiveUrl |
string | Final URL after redirects |
status |
string | success, timeout, or error |
errorMsg |
string | Error details on failure, null on success |
Example output record:
{
"url": "https://example.com/news/2024-article",
"snapshotCount": 14,
"firstCaptured": "2024-03-12T08:42:00Z",
"lastCaptured": "2026-04-29T22:11:00Z",
"captures": [
"{\"timestamp\":\"2026-04-29T22:11:00Z\",\"archiveUrl\":\"https://web.archive.org/web/20260429221100/https://example.com/news/2024-article\",\"status\":200,\"mimetype\":\"text/html\"}"
],
"diff": { "addedLines": 12, "removedLines": 3, "changedRatio": 0.04 },
"liveStatus": 200,
"finalLiveUrl": "https://example.com/news/2024-article",
"status": "success",
"errorMsg": null
}
Dataset views
The actor produces two dataset views in the Apify console:
- Capture Timeline —
url,snapshotCount,firstCaptured,lastCaptured,captures - Live vs Snapshot Diff —
url,liveStatus,diff,lastCaptured
Rate limits and performance
The actor respects Wayback Machine's rate limits:
- CDX API queries: ~10 requests/second (110ms minimum delay)
- Snapshot content fetches: ~1-2 requests/second (700ms minimum delay)
For large batches with fetchSnapshotContent=true, expect longer runtimes. The default timeout is 2 hours. Start with a small captureLimit (e.g. 10) to estimate runtime before running at full scale.
Use cases
- OSINT / research: Check whether a source URL existed, when it was captured, and how its content has changed
- Journalism: Verify archived versions of articles or government pages for fact-checking
- SEO / link-rot recovery: Find archived versions of dead inbound links and plan redirects or outreach
- Legal evidence: Retrieve timestamped snapshots of web pages for documentation
- Web archiving: Bulk-check coverage for a list of URLs before deeper archiving work