OrbTop

PBS Frontline Transcripts Scraper

AINEWSAUTOMATION

PBS Frontline Transcripts Scraper

Scrapes full transcripts from PBS Frontline documentary films. Returns one record per documentary — title, synopsis, air date, speaker-labeled transcript body, and topic metadata. Covers the active Frontline archive (250+ films back to ~1995) using the site's sitemap index for discovery.

Frontline is one of the few public-broadcasting sources where transcripts are routinely cited in academic and journalism contexts. Each film runs 60-120 minutes and produces 30-80KB of clean, fact-checked, single-narrative text — a different unit of value from fragmented panel-show or wire-copy corpora.

What It Scrapes

Every record comes from pbs.org/wgbh/frontline/documentary/<slug>/. The transcript lives inline on the main documentary page (the /transcript/ subpath was retired; transcripts are now embedded directly). Metadata comes from JSON-LD structured data blocks on the same page.

Output Schema

Field Type Description
film_slug string URL slug (e.g. the-deal-trump-bukele-gangs-el-salvador)
film_title string Documentary title
film_url string Canonical PBS URL
air_date string Original broadcast date (YYYY-MM-DD)
duration_minutes number Runtime in minutes
synopsis string Brief description from page metadata
producers string Comma-separated producing and directing credits
correspondents string Comma-separated correspondent credits
related_topics string Comma-separated PBS topic tags
body_html string Full transcript HTML with <strong>SPEAKER:</strong> spans
body_text string Plain-text transcript with inline speaker labels
speakers string Comma-separated unique speaker labels
has_viewer_discretion_notice boolean True if the film flags mature content
related_film_urls string Comma-separated URLs of cross-linked Frontline films
canonical_url string Canonical page URL
source string Fixed: pbs.org/wgbh/frontline
scraped_at datetime ISO 8601 scrape timestamp

Speaker labels follow the Frontline convention: NARRATOR, PRESIDENT DONALD TRUMP, NAYIB BUKELE, etc. They are extracted directly from <strong>LABEL:</strong> spans — no inference, no cleanup required.

Input Options

startUrls (array, optional) — Specific documentary URLs to scrape. Leave empty to run the full sitemap discovery and scrape all available transcripts.

maxItems (integer, optional) — Cap on total records. Default 0 (no limit). When using sitemap discovery, applies globally across all sitemaps.

Example: Single film

{
    "startUrls": [
        {"url": "https://www.pbs.org/wgbh/frontline/documentary/the-deal-trump-bukele-gangs-el-salvador/"}
    ]
}

Example: Full archive crawl (all ~250 films)

{
    "maxItems": 0
}

Example: Recent 50 films

{
    "maxItems": 50
}

How It Works

Discovery uses PBS Frontline's sitemap index at pbs.org/wgbh/frontline/sitemap.xml. The nine sitemap-documentary sub-sitemaps each hold up to 100 film URLs, ordered newest-first. Films without a transcript (some pre-rebuild older entries) are silently skipped.

Metadata is parsed from JSON-LD blocks on each documentary page. The transcript and credits live in two Chakra UI accordion panels — panel 0 is the transcript, panel 1 is the credits. Speaker labels are extracted via a single regex pass on the <strong>LABEL:</strong> pattern Frontline uses consistently across its archive.

The site is server-rendered Next.js with aggressive edge caching — no headless browser required, no proxy required.

Pricing

Charged per record scraped. Long-form transcripts (30-80KB each) are priced at a modest premium reflecting per-record research value. Start price applies per actor run regardless of record count.

Notes

  • Films without a transcript are skipped gracefully and do not count toward maxItems.
  • Some older archive films have had their transcript pages rebuilt and may appear without speaker-label markup — body text is still returned when a transcript exists.
  • body_html preserves the original <strong> speaker spans for downstream NLP pipelines that want to distinguish speaker turns programmatically.

Need Custom Fields or a Different Source?

File an issue or get in touch. We can add fields, filter by topic, or build adjacent scrapers in the same broadcast-transcript vertical.