OrbTop

CNN Transcripts Scraper

NEWSAIAUTOMATION

CNN Transcripts Scraper

Scrape broadcast transcripts from CNN's public archive at transcripts.cnn.com. Covers every CNN show from recent history (archive dates back to ~2000), with per-segment granularity, speaker-label extraction, and structured metadata.

What you get

Each output record represents one broadcast segment:

Field Description
show_slug CNN show identifier (e.g. cnr, fzgps, sotu)
show_title Full show name (e.g. "CNN Newsroom")
aired_date Broadcast date — YYYY-MM-DD
segment_number Index within the show-date (1, 2, 3 …)
segment_title Segment headline and topic summary
segment_url Canonical URL on transcripts.cnn.com
body_html Full transcript HTML (preserves timestamps, paragraph breaks)
body_text Plain-text version with speaker labels and newlines preserved
speakers Comma-separated list of detected speaker names
aired_at_local ET broadcast time (e.g. 02:00 ET)
source Always transcripts.cnn.com
scraped_at ISO timestamp of when the record was scraped

Usage

Basic — scrape all shows for a single day

{
  "startDate": "2026-05-08",
  "maxItems": 50
}

Date range with show filter

{
  "startDate": "2026-05-01",
  "endDate": "2026-05-08",
  "showSlugs": ["cnr", "fzgps", "sotu"],
  "maxItems": 500
}

Input fields

Field Type Required Description
startDate string Yes Start date YYYY-MM-DD
endDate string No End date YYYY-MM-DD (defaults to startDate)
showSlugs string[] No Filter to specific shows (e.g. ["cnr", "fzgps"]). Leave empty for all shows.
maxItems integer No Cap on total segments returned. 0 = no limit. Default: 0.

Common show slugs

Slug Show
cnr CNN Newsroom
fzgps Fareed Zakaria GPS
sotu State of the Union
acd Anderson Cooper 360
ebo Erin Burnett OutFront
cg The Lead with Jake Tapper
sitroom The Situation Room
ip Inside Politics
ctmo CNN This Morning
ampr Amanpour

Dataset size

  • ~30 active CNN shows, 1–22 segments per show per day
  • ~30–50 new segments published daily
  • Archive goes back to approximately 2000

Notes on speaker extraction

The speakers field parses ALL-CAPS labels preceding a colon (e.g. ANDERSON COOPER:, TRUMP:) using a regex pass on the plain-text body. It covers named hosts and guests; unnamed contributors appear as UNIDENTIFIED MALE / UNIDENTIFIED FEMALE where present.

Responsible use

Transcripts on transcripts.cnn.com are published publicly by CNN for informational access. Users are responsible for ensuring their downstream use of transcript data complies with applicable copyright law and CNN's terms of service.