PBS Frontline Transcripts Scraper — Documentary Archive

Frontline is one of the few public-broadcasting sources where transcripts are routinely cited in academic and journalism contexts — each film runs 60–120 minutes and produces 30–80KB of clean, fact-checked, single-narrative text. This actor scrapes full transcripts from PBS Frontline documentary films: title, air date, synopsis, speaker-labeled transcript body, producer and correspondent credits, and PBS topic tags. Covers 250+ films back to approximately 1995.

Copyright and permitted use

PBS Frontline transcripts are copyrighted content owned by WGBH Educational Foundation. Before bulk extraction — particularly for AI training datasets, commercial redistribution, or republication — review PBS's terms of use. This actor provides access to publicly available transcript pages; buyers are responsible for ensuring their downstream use complies with PBS's terms.

What does the PBS Frontline Transcripts Scraper do?

Discovery uses Frontline's sitemap index at pbs.org/wgbh/frontline/sitemap.xml. Nine documentary sub-sitemaps hold up to 100 film URLs each, ordered newest-first. The actor visits each documentary page, extracts metadata from JSON-LD structured data blocks, and parses the transcript and credits from the page's accordion panels.

No headless browser is required. Frontline is server-rendered Next.js with edge caching — HTML responses contain everything needed.

Films without a published transcript are skipped silently and do not count toward maxItems.

What data does it extract?

Field	Type	Description
`film_slug`	string	URL slug (e.g. `the-deal-trump-bukele-gangs-el-salvador`)
`film_title`	string	Documentary title
`film_url`	string	Canonical PBS URL
`air_date`	string	Original broadcast date (YYYY-MM-DD)
`duration_minutes`	number	Runtime in minutes
`synopsis`	string	Brief description from page metadata
`producers`	string	Comma-separated producing and directing credits
`correspondents`	string	Comma-separated correspondent credits
`related_topics`	string	Comma-separated PBS topic tags
`body_html`	string	Full transcript HTML with `<strong>SPEAKER:</strong>` spans preserved
`body_text`	string	Plain-text transcript with inline speaker labels
`speakers`	string	Comma-separated unique speaker labels extracted from the transcript
`has_viewer_discretion_notice`	boolean	True if the film flags mature content
`related_film_urls`	string	Comma-separated URLs of cross-linked Frontline films
`canonical_url`	string	Canonical page URL
`source`	string	Fixed: `pbs.org/wgbh/frontline`
`scraped_at`	datetime	ISO 8601 scrape timestamp

Speaker labels follow Frontline convention: NARRATOR, PRESIDENT DONALD TRUMP, NAYIB BUKELE, etc. They are extracted from <strong>LABEL:</strong> spans — no inference required.

Data quality

Frontline's transcript archive is consistent across its modern rebuild but has some gaps at the older end of the archive. A fraction of older films have transcripts that were rebuilt without speaker-label markup — in these cases, body_text is still returned when a transcript exists, but speakers will be empty. Films with no transcript at all are skipped entirely. Newer films (roughly post-2010) have reliable speaker-label coverage.

How to use it

{
  "startUrls": [
    {"url": "https://www.pbs.org/wgbh/frontline/documentary/the-deal-trump-bukele-gangs-el-salvador/"}
  ]
}

Scrapes a single specified documentary.

{ "maxItems": 0 }

Runs the full sitemap discovery and scrapes all available transcripts (~250 films).

{ "maxItems": 50 }

Scrapes the 50 most recent documentaries.

Field	Type	Description
`startUrls`	array	Specific documentary URLs to scrape. Leave empty for full sitemap discovery
`maxItems`	integer	Cap on total records. `0` = no limit

Use cases

Media research and journalism studies — build a structured corpus of longform documentary transcripts from a single authoritative source with consistent formatting
NLP and text analysis — use body_html with preserved <strong>SPEAKER:</strong> spans for speaker-turn detection and dialog analysis pipelines
RAG and retrieval applications — ingest Frontline transcripts as high-quality, fact-checked documentary text for retrieval-augmented generation systems
Political science and policy research — extract transcripts by related_topics tag (Criminal Justice, Immigration, National Security, etc.) for topic-specific corpora
Archival and library datasets — build a queryable index of Frontline films with air dates, credits, and full text for institutional research access

Results export as JSON, CSV, or Excel from the Apify dataset view.

PBS Frontline Transcripts Scraper