CUC China Media University — Dance & Performing Arts Scraper

Scrapes faculty rosters, admissions announcements, program pages, and news from Communication University of China (中国传媒大学 / CUC). The school that trains CCTV anchors, Mango TV hosts, and the performers you see at the CCTV Spring Festival Gala.

CUC runs one of China's top performing-arts programs. Its alumni pipeline feeds state broadcasters, provincial TV networks, and the national competition circuit. This actor pulls that pipeline data — structured, paginated, and in UTF-8 — so you don't have to navigate a WebPlus CMS manually.

What It Scrapes

Four configurable section categories, all from www.cuc.edu.cn:

Category	Content
`admissions`	招生就业 — enrollment notices, admission policies, exam requirements
`faculty`	Leadership rosters, special collections, departmental faculty pages
`programs`	Academic affairs notices, curriculum docs, departmental announcements
`news`	Main news feed, school of arts culture network, academic exchanges

Articles follow a predictable URL pattern (/YYYY/MMDD/c<channel>a<id>/page.htm). Pagination uses numbered .psp pages. No JavaScript rendering required.

Output Fields

Field	Type	Description
`page_url`	String	Canonical URL of the scraped page
`title`	String	Article or page title (Chinese characters preserved)
`title_zh`	String	Chinese title — identical to `title` for CUC pages
`category`	String	Section: `admissions`, `faculty`, `programs`, or `news`
`publish_date`	String	Publication date as shown on the page (e.g. `2024-05-10`)
`body_html`	String	Full article HTML including embedded content references
`body_text`	String	Plain-text article body
`department`	String	Channel code identifying the originating department
`attachments`	String	PDF/DOC attachment URLs, pipe-separated (admissions docs, curriculum PDFs)
`source_url`	String	Originating article URL
`scrapedAt`	String	ISO-8601 timestamp of the scrape

Faculty name-card pages have body_html and body_text empty by design — the page contains only a name and date. The title and department fields are always populated.

Input Parameters

Parameter	Type	Default	Description
`maxItems`	Integer	10	Maximum number of article pages to scrape (0 = unlimited)
`categories`	Array	all four	Which sections to crawl: `admissions`, `faculty`, `programs`, `news`

Run with maxItems: 0 and all four categories for a full archive crawl. The main news channel alone has hundreds of pages going back several years.

How It Works

The actor uses a hierarchical crawl. It seeds from the section entry points (/zsjy/list.htm, /9996/list.htm, etc.), follows pagination forward, and enqueues every article URL it finds. Article pages get a full extraction pass — title, date, body content, and any PDF attachments.

No proxy required. CUC's servers respond cleanly to datacenter IPs. No Cloudflare. No anti-bot. Concurrency is kept at 5 to stay polite with a university web server.

Use Cases

Chinese-language NLP training corpora (performing arts domain)
Talent-pipeline research tracking which departments feed CCTV and Mango TV
Competitive analysis for Chinese broadcasting education programs
Admissions document archives for research on Chinese university policy

Notes

The site CMS (WebPlus) uses numeric channel codes for some departments. The department field preserves the raw channel code. Map to human-readable names using the CUC department directory as needed.

CUC's admissions section includes PDFs of curriculum plans and judging panel documents for performance programs. These appear as pipe-separated URLs in the attachments field.

Part of the OrbTop Chinese media education dataset — companion to the BDA Beijing Dance Academy Scraper.