Cambridge Dictionary Definition & IPA Scraper

Scrapes Cambridge Dictionary (dictionary.cambridge.org) entries and returns the structured learner-oriented metadata that vocabulary apps, NLP pipelines, and language-curriculum tools actually need: headword, CEFR level (A1–C2), UK and US IPA pronunciation strings, audio MP3 URLs, part of speech, guidewords, definitions, and example sentences.

What you get

Each output record represents one sense block for a headword (a word like "bank" has multiple senses — MONEY, GROUND, STORE, etc. — each becoming a separate row).

Field	Description
`headword`	Dictionary headword (e.g. "hello")
`part_of_speech`	Part of speech (exclamation, noun, verb, …)
`cefr_level`	Cambridge CEFR tag: A1, A2, B1, B2, C1, C2 — or empty if untagged
`uk_ipa`	British English IPA string (e.g. `heˈləʊ`)
`us_ipa`	American English IPA string (e.g. `heˈloʊ`)
`uk_audio_url`	Absolute URL to the UK pronunciation MP3
`us_audio_url`	Absolute URL to the US pronunciation MP3
`guideword`	Sense disambiguator (e.g. "MONEY" for bank's financial sense)
`definitions`	Pipe-separated list of definitions for this sense
`example_sentences`	Pipe-separated list of example sentences
`url`	Canonical source URL for the entry page
`scrapedAt`	ISO-8601 scrape timestamp

How to use

Look up specific words (fastest)

Supply a list of headwords and the actor fetches only those entry pages:

{
  "startWords": ["hello", "bank", "run", "beautiful"],
  "maxItems": 0
}

Set maxItems: 0 for no limit, or a positive integer to cap output at that many senses.

Crawl the full English dictionary (A–Z browse)

Leave startWords empty to crawl all ~140,000 English headwords via Cambridge's A-Z browse hierarchy:

{
  "startWords": [],
  "maxItems": 0
}

A full crawl processes the browse hierarchy: root → letter pages (A–Z) → sub-group pages → individual entries. Set maxItems to limit output for testing.

Input parameters

Parameter	Type	Description
`startWords`	array	Headwords to look up directly. Empty = full A-Z crawl.
`maxItems`	integer	Max sense records to output. 0 = no limit. Default: 10.

Data source

All data is scraped from dictionary.cambridge.org (the Cambridge Advanced Learner's Dictionary sub-domain). This scraper does not cover bilingual or specialized Cambridge dictionaries. The site serves static HTML — no JavaScript rendering required, no proxy needed.

Notes

CEFR tags (A1–C2) appear only on headwords Cambridge has officially tagged; less common words may have an empty cefr_level.
UK and US audio URLs point directly to Cambridge's CDN MP3 files.
Some entries appear in both the CALD4 (British) and CACD (American) dictionaries on the same page — the actor may produce two records for the same headword with differing US IPA strings.