OrbTop

Cambridge Dictionary Definition & IPA Scraper

EDUCATIONAI

Cambridge Dictionary Definition & IPA Scraper

Scrapes Cambridge Dictionary (dictionary.cambridge.org) entries and returns the structured learner-oriented metadata that vocabulary apps, NLP pipelines, and language-curriculum tools actually need: headword, CEFR level (A1–C2), UK and US IPA pronunciation strings, audio MP3 URLs, part of speech, guidewords, definitions, and example sentences.

What you get

Each output record represents one sense block for a headword (a word like "bank" has multiple senses — MONEY, GROUND, STORE, etc. — each becoming a separate row).

Field Description
headword Dictionary headword (e.g. "hello")
part_of_speech Part of speech (exclamation, noun, verb, …)
cefr_level Cambridge CEFR tag: A1, A2, B1, B2, C1, C2 — or empty if untagged
uk_ipa British English IPA string (e.g. heˈləʊ)
us_ipa American English IPA string (e.g. heˈloʊ)
uk_audio_url Absolute URL to the UK pronunciation MP3
us_audio_url Absolute URL to the US pronunciation MP3
guideword Sense disambiguator (e.g. "MONEY" for bank's financial sense)
definitions Pipe-separated list of definitions for this sense
example_sentences Pipe-separated list of example sentences
url Canonical source URL for the entry page
scrapedAt ISO-8601 scrape timestamp

How to use

Look up specific words (fastest)

Supply a list of headwords and the actor fetches only those entry pages:

{
  "startWords": ["hello", "bank", "run", "beautiful"],
  "maxItems": 0
}

Set maxItems: 0 for no limit, or a positive integer to cap output at that many senses.

Crawl the full English dictionary (A–Z browse)

Leave startWords empty to crawl all ~140,000 English headwords via Cambridge's A-Z browse hierarchy:

{
  "startWords": [],
  "maxItems": 0
}

A full crawl processes the browse hierarchy: root → letter pages (A–Z) → sub-group pages → individual entries. Set maxItems to limit output for testing.

Input parameters

Parameter Type Description
startWords array Headwords to look up directly. Empty = full A-Z crawl.
maxItems integer Max sense records to output. 0 = no limit. Default: 10.

Data source

All data is scraped from dictionary.cambridge.org (the Cambridge Advanced Learner's Dictionary sub-domain). This scraper does not cover bilingual or specialized Cambridge dictionaries. The site serves static HTML — no JavaScript rendering required, no proxy needed.

Notes

  • CEFR tags (A1–C2) appear only on headwords Cambridge has officially tagged; less common words may have an empty cefr_level.
  • UK and US audio URLs point directly to Cambridge's CDN MP3 files.
  • Some entries appear in both the CALD4 (British) and CACD (American) dictionaries on the same page — the actor may produce two records for the same headword with differing US IPA strings.