CNN Transcripts Scraper — Full Text and Speakers

Scrape broadcast transcripts from CNN's public archive at transcripts.cnn.com. The archive covers approximately 2000 to the present — over two decades of CNN programming — with per-segment granularity, speaker label extraction, and structured show metadata. Roughly 30–50 new segments are published daily across ~30 active shows.

What does the CNN Transcripts Scraper do?

The scraper walks the CNN transcript date index to discover segment URLs for each requested date, then fetches each segment page to extract full text, speaker labels, and show metadata. Speaker labels are parsed from ALL-CAPS identifiers preceding a colon (e.g. ANDERSON COOPER:, TRUMP:) — this covers named hosts and guests; unnamed contributors appear as UNIDENTIFIED MALE / UNIDENTIFIED FEMALE where the transcript uses them.

What data does it extract?

Each output record represents one broadcast segment.

Field	Type	Description
`show_slug`	string	CNN show identifier (e.g. `cnr`, `fzgps`, `sotu`)
`show_title`	string	Full show name (e.g. `CNN Newsroom`)
`aired_date`	string	Broadcast date (YYYY-MM-DD)
`segment_number`	number	Segment index within the show-date (1, 2, 3, …)
`segment_title`	string	Segment headline and topic summary
`segment_url`	string	Canonical URL on transcripts.cnn.com
`body_html`	string	Full transcript HTML (preserves timestamps, paragraph breaks)
`body_text`	string	Plain-text version with speaker labels and newlines preserved
`speakers`	string	Comma-separated list of detected speaker labels
`aired_at_local`	string	ET broadcast time (e.g. `02:00 ET`)
`source`	string	Always `transcripts.cnn.com`
`scraped_at`	string	ISO timestamp when the record was scraped

Sample record:

{
  "show_slug": "fzgps",
  "show_title": "Fareed Zakaria GPS",
  "aired_date": "2026-05-08",
  "segment_number": 3,
  "segment_title": "Discussion on US-China trade and tariff negotiations",
  "aired_at_local": "10:00 ET",
  "speakers": "FAREED ZAKARIA,GUEST NAME",
  "body_text": "FAREED ZAKARIA: Welcome back. Today we examine...",
  "segment_url": "https://transcripts.cnn.com/show/fzgps/date/2026-05-08/segment/03"
}

How to use it

Common show slugs — use these in the showSlugs filter:

Slug	Show
`cnr`	CNN Newsroom
`fzgps`	Fareed Zakaria GPS
`sotu`	State of the Union
`acd`	Anderson Cooper 360
`ebo`	Erin Burnett OutFront
`cg`	The Lead with Jake Tapper
`sitroom`	The Situation Room
`ip`	Inside Politics
`ctmo`	CNN This Morning
`ampr`	Amanpour

Single day, all shows:

{
  "startDate": "2026-05-08",
  "maxItems": 50
}

Date range with show filter:

{
  "startDate": "2026-05-01",
  "endDate": "2026-05-08",
  "showSlugs": ["cnr", "fzgps", "sotu"],
  "maxItems": 500
}

Field	Type	Required	Description
`startDate`	string	Yes	Start date (YYYY-MM-DD)
`endDate`	string	No	End date, inclusive (YYYY-MM-DD). Defaults to `startDate`.
`showSlugs`	string[]	No	Show slugs to filter. Empty = all shows.
`maxItems`	integer	No	Max segments. `0` = no limit.

Use cases

Media monitoring and research — Track how a topic, name, or organization is covered across CNN shows over a date range. Filter by show to compare how different programs frame the same news event.
NLP and training datasets — CNN transcripts are a large, structured corpus of broadcast English with speaker labels, making them useful for speech recognition, topic modeling, and fine-tuning language models.
Journalism and fact-checking — Pull exact quotes from a specific broadcast with the aired_date, segment_title, and speakers fields for citation and verification.
Political and communications research — Analyze which guests appear on which shows, how often, and in what context using the speakers field across extended date ranges.
Media archive access — The archive dates to approximately 2000, making this a practical way to retrieve historical CNN coverage without manual navigation of the transcript site.

FAQ

How far back does the archive go? The CNN transcript archive at transcripts.cnn.com extends to approximately 2000 for most shows. Coverage varies by program — older shows may have gaps.

Is this data public? Yes. CNN publishes transcripts on transcripts.cnn.com for public informational access. Users are responsible for ensuring their downstream use complies with applicable copyright law and CNN's terms of service.

What export formats are available? Apify supports JSON, CSV, and Excel export. The body_text and body_html fields contain the full transcript text, which may be long — JSON is recommended for large text fields.

CNN Transcripts Scraper

CNN Transcripts Scraper — Full Text and Speakers

What does the CNN Transcripts Scraper do?

What data does it extract?

How to use it

Use cases

FAQ

Related AI & Data scrapers