CBS 60 Minutes Transcripts Scraper
CBS 60 Minutes Transcripts Scraper
Scrapes full Q&A interview transcripts from CBS News 60 Minutes — the most-recognised US investigative news magazine. Returns one record per transcript page: title, correspondent, broadcast date, subject list, speaker-labeled body text, and topic metadata. Discovers transcript pages automatically from the CBS News article sitemap. Video-only segments without a published transcript are skipped.
60 Minutes is the most-watched US news magazine, known for long-form sit-down interviews with heads of state, CEOs, whistleblowers, and scientists. Each transcript runs 5,000-30,000 words of clean, on-the-record Q&A — high-signal content for media research, RAG pipelines, and investigative journalism datasets.
What It Scrapes
Targets two URL patterns on cbsnews.com:
/news/<slug>-60-minutes-transcript/— primary transcript pattern/news/read-the-full-transcript-of-<slug>/— extended interview variant
Discovery walks the CBS News monthly article sitemaps, filters by these patterns, and scrapes each matching page. Video-only stories (e.g. /news/<slug>-60-minutes/) are explicitly excluded.
Output Schema
| Field | Type | Description |
|---|---|---|
story_slug |
string | URL slug of the transcript page |
story_title |
string | Article headline |
story_url |
string | Canonical CBS News URL |
aired_date |
string | Broadcast date (YYYY-MM-DD) |
published_date |
string | CBS News publish timestamp (ISO 8601) |
segment_type |
string | Inferred type: interview, investigation, or profile |
correspondent |
string | CBS News correspondent (e.g. Major Garrett, Lesley Stahl) |
subjects |
string | Interviewed subjects extracted from speaker labels (comma-separated) |
synopsis |
string | Article dek / meta description |
body_html |
string | Full transcript HTML preserving Q&A paragraph structure |
body_text |
string | Plain-text version of the transcript |
speakers |
string | All speaker labels found in the transcript (comma-separated) |
is_transcript |
boolean | Always true — non-transcripts are skipped |
has_video_only_variant |
boolean | True when a paired video-only story exists |
related_story_urls |
string | Related CBS News links on the page (comma-separated) |
topics |
string | CBS News topic tags (comma-separated) |
canonical_url |
string | Canonical URL from page head |
source |
string | Fixed: cbsnews.com/60-minutes |
scraped_at |
datetime | ISO 8601 scrape timestamp |
Speaker labels follow two CBS conventions: Major Garrett: (Title Case) and MAJOR GARRETT: (ALL-CAPS, used in the extended-interview variant). Both formats are normalized and extracted.
Input Options
maxItems (integer, required) — Maximum number of transcript records to scrape. Set a higher value for bulk runs.
startDate (string, optional) — Limit sitemap discovery to a given month onwards (YYYY-MM format, e.g. "2024-01"). Defaults to all available months when omitted.
startUrls (array, optional) — One or more direct CBS News transcript URLs. When provided, sitemap discovery is skipped and only the supplied URLs are scraped. Useful for targeted re-runs of specific episodes.
Example: Specific episode
{
"maxItems": 1,
"startUrls": [
{"url": "https://www.cbsnews.com/news/netanyahu-us-israel-iran-60-minutes-transcript/"}
]
}
Example: All 2025 transcripts
{
"maxItems": 200,
"startDate": "2025-01"
}
Example: Full archive (all available transcripts)
{
"maxItems": 1000
}
How It Works
Discovery uses the CBS News sitemap index at cbsnews.com/xml-sitemap/index.xml. Monthly article sitemaps (article-YYYY-MM.xml) are walked in order, newest first. Each sitemap lists 3,000+ news articles; only URLs matching the transcript patterns are fetched.
Metadata is parsed from JSON-LD NewsArticle blocks present on every CBS article page — giving reliable correspondent name, publish date, and keywords. The transcript body lives in <section class="content__body"> as a sequence of <p> tags. Speaker labels are extracted from paragraph-leading Name: patterns. Ad wrappers are stripped before body extraction.
CBS News is server-rendered (varnish edge cache) with no bot-protection observed. No proxy required, no headless browser required.
Coverage Notes
60 Minutes airs approximately 45 episodes per US broadcast season, with 3-4 segments per episode. Roughly 50-70% of segments receive a published transcript — the remainder are video-only. This scraper covers transcript-bearing segments only and makes that boundary explicit in every record (is_transcript: true, video-only pages are skipped). The active transcript archive covers approximately 5 years back, with sparser coverage for earlier seasons.
Pricing
Charged per transcript record scraped. Long-form interviews (5,000-30,000 words each) are priced at a modest premium reflecting their per-record research value versus wire-copy or short-form corpora.