PBS Frontline Transcripts Scraper
PBS Frontline Transcripts Scraper
Scrapes full transcripts from PBS Frontline documentary films. Returns one record per documentary — title, synopsis, air date, speaker-labeled transcript body, and topic metadata. Covers the active Frontline archive (250+ films back to ~1995) using the site's sitemap index for discovery.
Frontline is one of the few public-broadcasting sources where transcripts are routinely cited in academic and journalism contexts. Each film runs 60-120 minutes and produces 30-80KB of clean, fact-checked, single-narrative text — a different unit of value from fragmented panel-show or wire-copy corpora.
What It Scrapes
Every record comes from pbs.org/wgbh/frontline/documentary/<slug>/. The transcript lives inline on the main documentary page (the /transcript/ subpath was retired; transcripts are now embedded directly). Metadata comes from JSON-LD structured data blocks on the same page.
Output Schema
| Field | Type | Description |
|---|---|---|
film_slug |
string | URL slug (e.g. the-deal-trump-bukele-gangs-el-salvador) |
film_title |
string | Documentary title |
film_url |
string | Canonical PBS URL |
air_date |
string | Original broadcast date (YYYY-MM-DD) |
duration_minutes |
number | Runtime in minutes |
synopsis |
string | Brief description from page metadata |
producers |
string | Comma-separated producing and directing credits |
correspondents |
string | Comma-separated correspondent credits |
related_topics |
string | Comma-separated PBS topic tags |
body_html |
string | Full transcript HTML with <strong>SPEAKER:</strong> spans |
body_text |
string | Plain-text transcript with inline speaker labels |
speakers |
string | Comma-separated unique speaker labels |
has_viewer_discretion_notice |
boolean | True if the film flags mature content |
related_film_urls |
string | Comma-separated URLs of cross-linked Frontline films |
canonical_url |
string | Canonical page URL |
source |
string | Fixed: pbs.org/wgbh/frontline |
scraped_at |
datetime | ISO 8601 scrape timestamp |
Speaker labels follow the Frontline convention: NARRATOR, PRESIDENT DONALD TRUMP, NAYIB BUKELE, etc. They are extracted directly from <strong>LABEL:</strong> spans — no inference, no cleanup required.
Input Options
startUrls (array, optional) — Specific documentary URLs to scrape. Leave empty to run the full sitemap discovery and scrape all available transcripts.
maxItems (integer, optional) — Cap on total records. Default 0 (no limit). When using sitemap discovery, applies globally across all sitemaps.
Example: Single film
{
"startUrls": [
{"url": "https://www.pbs.org/wgbh/frontline/documentary/the-deal-trump-bukele-gangs-el-salvador/"}
]
}
Example: Full archive crawl (all ~250 films)
{
"maxItems": 0
}
Example: Recent 50 films
{
"maxItems": 50
}
How It Works
Discovery uses PBS Frontline's sitemap index at pbs.org/wgbh/frontline/sitemap.xml. The nine sitemap-documentary sub-sitemaps each hold up to 100 film URLs, ordered newest-first. Films without a transcript (some pre-rebuild older entries) are silently skipped.
Metadata is parsed from JSON-LD blocks on each documentary page. The transcript and credits live in two Chakra UI accordion panels — panel 0 is the transcript, panel 1 is the credits. Speaker labels are extracted via a single regex pass on the <strong>LABEL:</strong> pattern Frontline uses consistently across its archive.
The site is server-rendered Next.js with aggressive edge caching — no headless browser required, no proxy required.
Pricing
Charged per record scraped. Long-form transcripts (30-80KB each) are priced at a modest premium reflecting per-record research value. Start price applies per actor run regardless of record count.
Notes
- Films without a transcript are skipped gracefully and do not count toward
maxItems. - Some older archive films have had their transcript pages rebuilt and may appear without speaker-label markup — body text is still returned when a transcript exists.
body_htmlpreserves the original<strong>speaker spans for downstream NLP pipelines that want to distinguish speaker turns programmatically.
Need Custom Fields or a Different Source?
File an issue or get in touch. We can add fields, filter by topic, or build adjacent scrapers in the same broadcast-transcript vertical.