TED Talks Transcript Scraper

Scrape full transcripts from TED.com in any available language. Returns timed cue segments, plain text, SRT subtitles, WebVTT captions, and speaker metadata for every talk — packaged in one record per language per talk.

TED Talks Transcript Scraper Features

Extracts complete transcripts with millisecond-accurate timing (427 cues for an average talk)
Returns four formats per transcript: JSON segments, plain text, SRT, and WebVTT — most actors pick one format and call it done
Collects speaker name, role, full bio, event name, recorded date, duration, view count, and topic tags alongside the transcript
Reports all available language codes so you can plan multi-language runs
Fetches only the native language by default, or every translation the talk has, or a specific list you provide
Accepts custom start URLs for targeted scraping of individual talks
Discovers all talks automatically via TED's year-by-year sitemap index when no URLs are given
No proxy required — TED serves transcripts publicly, no auth or Cloudflare management involved

What Can You Do With TED Transcript Data?

NLP researchers — Build or extend corpora for text classification, summarization, or speaker style analysis; TED-LIUM is a standard benchmark, and this actor gives you fresh slices of it
Language-learning app developers — Pull parallel transcripts (English audio + Japanese subtitles) for aligned bilingual reading and listening exercises
AI training teams — Collect multi-speaker, multi-language text at scale; TED's volunteer-translated transcripts cover 100+ languages with consistent quality
Public speaking coaches — Analyze rhetorical structure, pacing cues, and paragraph breaks across thousands of talks
Translation quality researchers — Compare the same content across 60+ language variants for benchmarking MT and human translation output
Educators and content curators — Build searchable archives of transcript text with metadata for curriculum alignment or topic discovery

How TED Talks Transcript Scraper Works

Seed the run. If you provide startUrls, those talks are processed directly. Otherwise the scraper walks TED's year-by-year sitemap index (2006–2025) and collects every talk URL up to your maxItems budget.
For each talk, the scraper fetches the transcript page HTML and parses the embedded __NEXT_DATA__ JSON blob. This yields the numeric talk ID, speaker details, event name, dates, view count, tags, and the full list of available language codes.
Using the language list, the scraper calls TED's public subtitles API — one request per language — and retrieves millisecond-timed caption cues.
The cues are assembled into four transcript formats, merged with the talk metadata, and saved as one dataset record per language.

Input

{
  "maxItems": 15,
  "startUrls": [
    { "url": "https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity" }
  ],
  "languages": ["en", "ja"],
  "fetchAllLanguages": false
}

Field	Type	Default	Description
`maxItems`	integer	15	Maximum transcript records to save. One record = one talk × one language.
`startUrls`	array	—	Specific TED talk URLs to scrape. When empty, the scraper discovers talks from the sitemap.
`languages`	array	—	ISO 639-1 codes to fetch (e.g. `["en", "ja", "es"]`). Leave empty for the talk's native language only.
`fetchAllLanguages`	boolean	false	When true, fetches every available translation for each talk. Overrides `languages`.

Fetch all languages for a single talk:

{
  "maxItems": 100,
  "startUrls": [
    { "url": "https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity" }
  ],
  "fetchAllLanguages": true
}

Ken Robinson's talk has 64 language translations — that input produces 64 records.

TED Talks Transcript Scraper Output Fields

{
  "talk_id": "66",
  "slug": "sir_ken_robinson_do_schools_kill_creativity",
  "title": "Do schools kill creativity?",
  "speaker_name": "Sir Ken Robinson",
  "speaker_role": "Author, educator",
  "speaker_bio": "Creativity expert Sir Ken Robinson challenged the way we educate our children...",
  "event": "TED2006",
  "recorded_date": "2006-02-25",
  "published_date": "2006-06-27T00:11:00.000Z",
  "duration_seconds": 1148,
  "language": "en",
  "language_name": "English",
  "tags": "culture, education, creativity, dance, parenting, teaching, kids",
  "description": "Sir Ken Robinson makes an entertaining and profoundly moving case for creating an education system...",
  "view_count": 80149052,
  "thumbnail_url": "https://pi.tedcdn.com/r/pe.tedcdn.com/...",
  "canonical_url": "https://www.ted.com/talks/sir_ken_robinson_do_schools_kill_creativity",
  "available_languages": "pt-br, el, eo, en, vi, ca, it, sv, cs, ar, ...",
  "transcript_plain": "Good morning. How are you? (Audience) Good. It's been great, hasn't it?...",
  "transcript_srt": "1\n00:00:02,103 --> 00:00:04,678\nGood morning. How are you?\n\n2\n...",
  "transcript_vtt": "WEBVTT\n\n1\n00:00:02.103 --> 00:00:04.678\nGood morning. How are you?\n\n2\n...",
  "transcript_segments": "[{\"start_ms\":2103,\"duration_ms\":2575,\"text\":\"Good morning. How are you?\",\"start_of_paragraph\":true},...]"
}

Field	Type	Description
`talk_id`	string	Numeric TED talk ID
`slug`	string	Canonical URL slug
`title`	string	Talk title in English
`speaker_name`	string	Speaker display name
`speaker_role`	string	One-line speaker description
`speaker_bio`	string	Full speaker biography
`event`	string	Event where the talk was given (e.g. TED2006, TEDxBoston)
`recorded_date`	string	Recording date (YYYY-MM-DD)
`published_date`	string	Publication date (ISO 8601)
`duration_seconds`	number	Talk duration in seconds
`language`	string	ISO 639-1 code for this transcript
`language_name`	string	Full language name in English
`tags`	string	Comma-separated TED topic tags
`description`	string	Talk abstract
`view_count`	number	Total view count across platforms
`thumbnail_url`	string	Talk thumbnail image URL
`canonical_url`	string	Canonical TED.com URL
`available_languages`	string	Comma-separated codes of all available translations
`transcript_plain`	string	Full transcript as plain text
`transcript_srt`	string	Transcript in SRT subtitle format
`transcript_vtt`	string	Transcript in WebVTT format
`transcript_segments`	string	JSON-serialized timed cue array: `[{start_ms, duration_ms, text, start_of_paragraph}]`

🔍 FAQ

How do I scrape TED talk transcripts?

TED Talks Transcript Scraper handles discovery automatically. Provide a startUrls list for specific talks or leave it empty to pull from the sitemap. Set maxItems to cap the output, then run.

How much does TED Talks Transcript Scraper cost to run?

TED Talks Transcript Scraper charges $0.003 per transcript record (one talk × one language) plus a small platform start fee. Fetching the English transcript for 100 talks costs roughly $0.30.

Can I get transcripts in multiple languages?

Yes. Set fetchAllLanguages: true to retrieve every translation for each talk, or pass a languages array with specific ISO 639-1 codes. A popular talk like Ken Robinson's "Do Schools Kill Creativity?" has 64 language variants.

Does TED Talks Transcript Scraper need proxies?

No. TED publishes transcripts publicly — no authentication, no Cloudflare challenge, no residential proxy required. The scraper runs on standard infrastructure at a courteous pace.

What format do the timed segments come in?

Each record includes transcript_segments as a JSON string containing an array of cue objects: {start_ms, duration_ms, text, start_of_paragraph}. Timing is in milliseconds, matching TED's source data. SRT and VTT formats are derived from the same cue data.

Are transcripts available for all TED talks?

Most established talks have English transcripts. Translations depend on TED's volunteer community — popular talks often have 50+ languages, while talks published in the last few months may have none yet. The scraper logs a warning and skips talks with no available transcripts for the requested language.

Need More Features?

Need filtering by event, speaker, or topic? Custom language combinations? File an issue or get in touch.

Why Use TED Talks Transcript Scraper?

Four formats, one run — plain text, SRT, WebVTT, and timestamped JSON segments in a single record; most alternatives force you to choose one and convert the rest yourself
Multi-language by design — fetch all 64+ translations of a talk with a single flag, which is the part that makes this corpus useful for NLP alignment work
No setup required — public access, no API keys, no proxies, sitemap-driven discovery out of the box

TED Talks Transcript Scraper

TED Talks Transcript Scraper

TED Talks Transcript Scraper Features

What Can You Do With TED Transcript Data?

How TED Talks Transcript Scraper Works

Input

TED Talks Transcript Scraper Output Fields

🔍 FAQ

How do I scrape TED talk transcripts?

How much does TED Talks Transcript Scraper cost to run?

Can I get transcripts in multiple languages?

Does TED Talks Transcript Scraper need proxies?

What format do the timed segments come in?

Are transcripts available for all TED talks?

Need More Features?

Why Use TED Talks Transcript Scraper?

Related AI & Data scrapers