OrbTop

CBS 60 Minutes Transcripts Scraper

NEWSAI

CBS 60 Minutes Transcripts Scraper

Scrapes full Q&A interview transcripts from CBS News 60 Minutes — the most-recognised US investigative news magazine. Returns one record per transcript page: title, correspondent, broadcast date, subject list, speaker-labeled body text, and topic metadata. Discovers transcript pages automatically from the CBS News article sitemap. Video-only segments without a published transcript are skipped.

60 Minutes is the most-watched US news magazine, known for long-form sit-down interviews with heads of state, CEOs, whistleblowers, and scientists. Each transcript runs 5,000-30,000 words of clean, on-the-record Q&A — high-signal content for media research, RAG pipelines, and investigative journalism datasets.

What It Scrapes

Targets two URL patterns on cbsnews.com:

  • /news/<slug>-60-minutes-transcript/ — primary transcript pattern
  • /news/read-the-full-transcript-of-<slug>/ — extended interview variant

Discovery walks the CBS News monthly article sitemaps, filters by these patterns, and scrapes each matching page. Video-only stories (e.g. /news/<slug>-60-minutes/) are explicitly excluded.

Output Schema

Field Type Description
story_slug string URL slug of the transcript page
story_title string Article headline
story_url string Canonical CBS News URL
aired_date string Broadcast date (YYYY-MM-DD)
published_date string CBS News publish timestamp (ISO 8601)
segment_type string Inferred type: interview, investigation, or profile
correspondent string CBS News correspondent (e.g. Major Garrett, Lesley Stahl)
subjects string Interviewed subjects extracted from speaker labels (comma-separated)
synopsis string Article dek / meta description
body_html string Full transcript HTML preserving Q&A paragraph structure
body_text string Plain-text version of the transcript
speakers string All speaker labels found in the transcript (comma-separated)
is_transcript boolean Always true — non-transcripts are skipped
has_video_only_variant boolean True when a paired video-only story exists
related_story_urls string Related CBS News links on the page (comma-separated)
topics string CBS News topic tags (comma-separated)
canonical_url string Canonical URL from page head
source string Fixed: cbsnews.com/60-minutes
scraped_at datetime ISO 8601 scrape timestamp

Speaker labels follow two CBS conventions: Major Garrett: (Title Case) and MAJOR GARRETT: (ALL-CAPS, used in the extended-interview variant). Both formats are normalized and extracted.

Input Options

maxItems (integer, required) — Maximum number of transcript records to scrape. Set a higher value for bulk runs.

startDate (string, optional) — Limit sitemap discovery to a given month onwards (YYYY-MM format, e.g. "2024-01"). Defaults to all available months when omitted.

startUrls (array, optional) — One or more direct CBS News transcript URLs. When provided, sitemap discovery is skipped and only the supplied URLs are scraped. Useful for targeted re-runs of specific episodes.

Example: Specific episode

{
    "maxItems": 1,
    "startUrls": [
        {"url": "https://www.cbsnews.com/news/netanyahu-us-israel-iran-60-minutes-transcript/"}
    ]
}

Example: All 2025 transcripts

{
    "maxItems": 200,
    "startDate": "2025-01"
}

Example: Full archive (all available transcripts)

{
    "maxItems": 1000
}

How It Works

Discovery uses the CBS News sitemap index at cbsnews.com/xml-sitemap/index.xml. Monthly article sitemaps (article-YYYY-MM.xml) are walked in order, newest first. Each sitemap lists 3,000+ news articles; only URLs matching the transcript patterns are fetched.

Metadata is parsed from JSON-LD NewsArticle blocks present on every CBS article page — giving reliable correspondent name, publish date, and keywords. The transcript body lives in <section class="content__body"> as a sequence of <p> tags. Speaker labels are extracted from paragraph-leading Name: patterns. Ad wrappers are stripped before body extraction.

CBS News is server-rendered (varnish edge cache) with no bot-protection observed. No proxy required, no headless browser required.

Coverage Notes

60 Minutes airs approximately 45 episodes per US broadcast season, with 3-4 segments per episode. Roughly 50-70% of segments receive a published transcript — the remainder are video-only. This scraper covers transcript-bearing segments only and makes that boundary explicit in every record (is_transcript: true, video-only pages are skipped). The active transcript archive covers approximately 5 years back, with sparser coverage for earlier seasons.

Pricing

Charged per transcript record scraped. Long-form interviews (5,000-30,000 words each) are priced at a modest premium reflecting their per-record research value versus wire-copy or short-form corpora.