OrbTop

TED Talks Transcript Scraper

AIEDUCATIONAUTOMATION

TED Talks Transcript Scraper

Scrape full transcripts from TED.com in any available language. Returns timed cue segments, plain text, SRT subtitles, WebVTT captions, and speaker metadata for every talk — packaged in one record per language per talk.


TED Talks Transcript Scraper Features

  • Extracts complete transcripts with millisecond-accurate timing (427 cues for an average talk)
  • Returns four formats per transcript: JSON segments, plain text, SRT, and WebVTT — most actors pick one format and call it done
  • Collects speaker name, role, full bio, event name, recorded date, duration, view count, and topic tags alongside the transcript
  • Reports all available language codes so you can plan multi-language runs
  • Fetches only the native language by default, or every translation the talk has, or a specific list you provide
  • Accepts custom start URLs for targeted scraping of individual talks
  • Discovers all talks automatically via TED's year-by-year sitemap index when no URLs are given
  • No proxy required — TED serves transcripts publicly, no auth or Cloudflare management involved

What Can You Do With TED Transcript Data?

  • NLP researchers — Build or extend corpora for text classification, summarization, or speaker style analysis; TED-LIUM is a standard benchmark, and this actor gives you fresh slices of it
  • Language-learning app developers — Pull parallel transcripts (English audio + Japanese subtitles) for aligned bilingual reading and listening exercises
  • AI training teams — Collect multi-speaker, multi-language text at scale; TED's volunteer-translated transcripts cover 100+ languages with consistent quality
  • Public speaking coaches — Analyze rhetorical structure, pacing cues, and paragraph breaks across thousands of talks
  • Translation quality researchers — Compare the same content across 60+ language variants for benchmarking MT and human translation output
  • Educators and content curators — Build searchable archives of transcript text with metadata for curriculum alignment or topic discovery

How TED Talks Transcript Scraper Works

  1. Seed the run. If you provide startUrls, those talks are processed directly. Otherwise the scraper walks TED's year-by-year sitemap index (2006–2025) and collects every talk URL up to your maxItems budget.
  2. For each talk, the scraper fetches the transcript page HTML and parses the embedded __NEXT_DATA__ JSON blob. This yields the numeric talk ID, speaker details, event name, dates, view count, tags, and the full list of available language codes.
  3. Using the language list, the scraper calls TED's public subtitles API — one request per language — and retrieves millisecond-timed caption cues.
  4. The cues are assembled into four transcript formats, merged with the talk metadata, and saved as one dataset record per language.

Input

{
  "maxItems": 15,
  "startUrls": [
    { "url": "https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity" }
  ],
  "languages": ["en", "ja"],
  "fetchAllLanguages": false
}
Field Type Default Description
maxItems integer 15 Maximum transcript records to save. One record = one talk × one language.
startUrls array Specific TED talk URLs to scrape. When empty, the scraper discovers talks from the sitemap.
languages array ISO 639-1 codes to fetch (e.g. ["en", "ja", "es"]). Leave empty for the talk's native language only.
fetchAllLanguages boolean false When true, fetches every available translation for each talk. Overrides languages.

Fetch all languages for a single talk:

{
  "maxItems": 100,
  "startUrls": [
    { "url": "https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity" }
  ],
  "fetchAllLanguages": true
}

Ken Robinson's talk has 64 language translations — that input produces 64 records.


TED Talks Transcript Scraper Output Fields

{
  "talk_id": "66",
  "slug": "sir_ken_robinson_do_schools_kill_creativity",
  "title": "Do schools kill creativity?",
  "speaker_name": "Sir Ken Robinson",
  "speaker_role": "Author, educator",
  "speaker_bio": "Creativity expert Sir Ken Robinson challenged the way we educate our children...",
  "event": "TED2006",
  "recorded_date": "2006-02-25",
  "published_date": "2006-06-27T00:11:00.000Z",
  "duration_seconds": 1148,
  "language": "en",
  "language_name": "English",
  "tags": "culture, education, creativity, dance, parenting, teaching, kids",
  "description": "Sir Ken Robinson makes an entertaining and profoundly moving case for creating an education system...",
  "view_count": 80149052,
  "thumbnail_url": "https://pi.tedcdn.com/r/pe.tedcdn.com/...",
  "canonical_url": "https://www.ted.com/talks/sir_ken_robinson_do_schools_kill_creativity",
  "available_languages": "pt-br, el, eo, en, vi, ca, it, sv, cs, ar, ...",
  "transcript_plain": "Good morning. How are you? (Audience) Good. It's been great, hasn't it?...",
  "transcript_srt": "1\n00:00:02,103 --> 00:00:04,678\nGood morning. How are you?\n\n2\n...",
  "transcript_vtt": "WEBVTT\n\n1\n00:00:02.103 --> 00:00:04.678\nGood morning. How are you?\n\n2\n...",
  "transcript_segments": "[{\"start_ms\":2103,\"duration_ms\":2575,\"text\":\"Good morning. How are you?\",\"start_of_paragraph\":true},...]"
}
Field Type Description
talk_id string Numeric TED talk ID
slug string Canonical URL slug
title string Talk title in English
speaker_name string Speaker display name
speaker_role string One-line speaker description
speaker_bio string Full speaker biography
event string Event where the talk was given (e.g. TED2006, TEDxBoston)
recorded_date string Recording date (YYYY-MM-DD)
published_date string Publication date (ISO 8601)
duration_seconds number Talk duration in seconds
language string ISO 639-1 code for this transcript
language_name string Full language name in English
tags string Comma-separated TED topic tags
description string Talk abstract
view_count number Total view count across platforms
thumbnail_url string Talk thumbnail image URL
canonical_url string Canonical TED.com URL
available_languages string Comma-separated codes of all available translations
transcript_plain string Full transcript as plain text
transcript_srt string Transcript in SRT subtitle format
transcript_vtt string Transcript in WebVTT format
transcript_segments string JSON-serialized timed cue array: [{start_ms, duration_ms, text, start_of_paragraph}]

🔍 FAQ

How do I scrape TED talk transcripts?

TED Talks Transcript Scraper handles discovery automatically. Provide a startUrls list for specific talks or leave it empty to pull from the sitemap. Set maxItems to cap the output, then run.

How much does TED Talks Transcript Scraper cost to run?

TED Talks Transcript Scraper charges $0.003 per transcript record (one talk × one language) plus a small platform start fee. Fetching the English transcript for 100 talks costs roughly $0.30.

Can I get transcripts in multiple languages?

Yes. Set fetchAllLanguages: true to retrieve every translation for each talk, or pass a languages array with specific ISO 639-1 codes. A popular talk like Ken Robinson's "Do Schools Kill Creativity?" has 64 language variants.

Does TED Talks Transcript Scraper need proxies?

No. TED publishes transcripts publicly — no authentication, no Cloudflare challenge, no residential proxy required. The scraper runs on standard infrastructure at a courteous pace.

What format do the timed segments come in?

Each record includes transcript_segments as a JSON string containing an array of cue objects: {start_ms, duration_ms, text, start_of_paragraph}. Timing is in milliseconds, matching TED's source data. SRT and VTT formats are derived from the same cue data.

Are transcripts available for all TED talks?

Most established talks have English transcripts. Translations depend on TED's volunteer community — popular talks often have 50+ languages, while talks published in the last few months may have none yet. The scraper logs a warning and skips talks with no available transcripts for the requested language.


Need More Features?

Need filtering by event, speaker, or topic? Custom language combinations? File an issue or get in touch.

Why Use TED Talks Transcript Scraper?

  • Four formats, one run — plain text, SRT, WebVTT, and timestamped JSON segments in a single record; most alternatives force you to choose one and convert the rest yourself
  • Multi-language by design — fetch all 64+ translations of a talk with a single flag, which is the part that makes this corpus useful for NLP alignment work
  • No setup required — public access, no API keys, no proxies, sitemap-driven discovery out of the box