CBS 60 Minutes Transcript Scraper — Interview Archive

60 Minutes transcripts average 5,000–30,000 words of on-the-record Q&A per segment, with consistent correspondent and speaker-label fields across the archive. This actor scrapes full interview transcripts from CBS News 60 Minutes — one record per segment: headline, correspondent, broadcast date, speaker-labeled body text, and topic metadata. Discovers transcript pages automatically from the CBS News article sitemap. Video-only segments without a published transcript are skipped.

What's included and what isn't

60 Minutes airs approximately 45 episodes per US broadcast season, with 3–4 segments per episode. Roughly 50–70% of segments have a published transcript — the remainder are video-only. This scraper covers transcript-bearing segments only, makes that boundary explicit in every record (is_transcript: true), and skips video-only pages entirely. The active transcript archive covers approximately 5 years back, with sparser coverage for earlier seasons.

What does the CBS 60 Minutes Transcript Scraper do?

Discovery walks the CBS News sitemap index at cbsnews.com/xml-sitemap/index.xml, filtering monthly article sitemaps for two URL patterns:

/news/<slug>-60-minutes-transcript/ — primary transcript pattern
/news/read-the-full-transcript-of-<slug>/ — extended interview variant

Metadata is parsed from JSON-LD NewsArticle blocks on each page. Transcript body text lives in <section class="content__body"> as <p> tags, with ad wrappers stripped before extraction. Speaker labels are extracted from paragraph-leading Name: patterns in both Title Case and ALL-CAPS formats. No headless browser or proxy required.

What data does it extract?

Field	Type	Description
`story_slug`	string	URL slug of the transcript page
`story_title`	string	Article headline
`story_url`	string	Canonical CBS News URL
`aired_date`	string	Broadcast date (YYYY-MM-DD)
`published_date`	string	CBS News publish timestamp (ISO 8601)
`segment_type`	string	Inferred type: `interview`, `investigation`, or `profile`
`correspondent`	string	CBS News correspondent (e.g. Major Garrett, Lesley Stahl)
`subjects`	string	Interviewed subjects extracted from speaker labels (comma-separated)
`synopsis`	string	Article meta description
`body_html`	string	Full transcript HTML preserving Q&A paragraph structure
`body_text`	string	Plain-text version of the transcript
`speakers`	string	All speaker labels found in the transcript (comma-separated)
`is_transcript`	boolean	Always `true` — non-transcripts are skipped
`has_video_only_variant`	boolean	True when a paired video-only story exists
`related_story_urls`	string	Related CBS News links on the page (comma-separated)
`topics`	string	CBS News topic tags (comma-separated)
`canonical_url`	string	Canonical URL from page head
`source`	string	Fixed: `cbsnews.com/60-minutes`
`scraped_at`	datetime	ISO 8601 scrape timestamp

How to use it

{
  "maxItems": 1,
  "startUrls": [
    {"url": "https://www.cbsnews.com/news/netanyahu-us-israel-iran-60-minutes-transcript/"}
  ]
}

Scrapes a specific episode transcript.

{
  "maxItems": 200,
  "startDate": "2025-01"
}

Scrapes all 60 Minutes transcripts published from January 2025 onward (up to 200 records).

{ "maxItems": 1000 }

Full archive crawl — returns all available transcripts across the active archive.

Field	Type	Description
`maxItems`	integer	Required. Maximum transcript records to scrape
`startDate`	string	Optional. Limit discovery to sitemaps from this month onward (YYYY-MM format)
`startUrls`	array	Optional. Direct CBS News transcript URLs — skips sitemap discovery when provided

Pricing

Charged per transcript record scraped. A 200-transcript run at the 1.2x coefficient on the default_2603_basic profile costs approximately $0.35 ($0.10 start + $0.00125 per record × 200 records).

Use cases

Media and political research — build a structured corpus of 60 Minutes interviews with heads of state, CEOs, and scientists spanning multiple years
NLP corpora — long-form Q&A transcripts with consistent speaker labeling are well-suited for dialog modeling, summarization, or entity extraction
Journalism datasets — index correspondent names, broadcast dates, and topics across the archive to analyze coverage patterns or research specific subjects
RAG pipelines — ingest high-quality, on-the-record interview text as a retrieval source for investigative journalism or policy research applications
Academic research — track how specific topics (foreign policy, corporate governance, public health) are covered on network news over time

FAQ

Why is coverage 50–70% rather than 100%?

CBS News publishes transcripts for most but not all 60 Minutes segments. Some segments are video-only by editorial choice, particularly shorter news-break items and some documentary segments. The is_transcript field and the URL-pattern filter ensure only genuine transcript pages are returned.

Can I scrape by correspondent?

The input does not have a correspondent filter, but every record returns the correspondent field. Fetch the relevant date range and filter downstream by correspondent name.

Results export as JSON, CSV, or Excel from the Apify dataset view.

CBS 60 Minutes Transcripts Scraper