OrbTop

CUC China Media University — Dance & Performing Arts Scraper

EDUCATIONNEWSAI

CUC China Media University — Dance & Performing Arts Scraper

Scrapes faculty rosters, admissions announcements, program pages, and news from Communication University of China (中国传媒大学 / CUC). The school that trains CCTV anchors, Mango TV hosts, and the performers you see at the CCTV Spring Festival Gala.

CUC runs one of China's top performing-arts programs. Its alumni pipeline feeds state broadcasters, provincial TV networks, and the national competition circuit. This actor pulls that pipeline data — structured, paginated, and in UTF-8 — so you don't have to navigate a WebPlus CMS manually.

What It Scrapes

Four configurable section categories, all from www.cuc.edu.cn:

Category Content
admissions 招生就业 — enrollment notices, admission policies, exam requirements
faculty Leadership rosters, special collections, departmental faculty pages
programs Academic affairs notices, curriculum docs, departmental announcements
news Main news feed, school of arts culture network, academic exchanges

Articles follow a predictable URL pattern (/YYYY/MMDD/c<channel>a<id>/page.htm). Pagination uses numbered .psp pages. No JavaScript rendering required.

Output Fields

Field Type Description
page_url String Canonical URL of the scraped page
title String Article or page title (Chinese characters preserved)
title_zh String Chinese title — identical to title for CUC pages
category String Section: admissions, faculty, programs, or news
publish_date String Publication date as shown on the page (e.g. 2024-05-10)
body_html String Full article HTML including embedded content references
body_text String Plain-text article body
department String Channel code identifying the originating department
attachments String PDF/DOC attachment URLs, pipe-separated (admissions docs, curriculum PDFs)
source_url String Originating article URL
scrapedAt String ISO-8601 timestamp of the scrape

Faculty name-card pages have body_html and body_text empty by design — the page contains only a name and date. The title and department fields are always populated.

Input Parameters

Parameter Type Default Description
maxItems Integer 10 Maximum number of article pages to scrape (0 = unlimited)
categories Array all four Which sections to crawl: admissions, faculty, programs, news

Run with maxItems: 0 and all four categories for a full archive crawl. The main news channel alone has hundreds of pages going back several years.

How It Works

The actor uses a hierarchical crawl. It seeds from the section entry points (/zsjy/list.htm, /9996/list.htm, etc.), follows pagination forward, and enqueues every article URL it finds. Article pages get a full extraction pass — title, date, body content, and any PDF attachments.

No proxy required. CUC's servers respond cleanly to datacenter IPs. No Cloudflare. No anti-bot. Concurrency is kept at 5 to stay polite with a university web server.

Use Cases

  • Chinese-language NLP training corpora (performing arts domain)
  • Talent-pipeline research tracking which departments feed CCTV and Mango TV
  • Competitive analysis for Chinese broadcasting education programs
  • Admissions document archives for research on Chinese university policy

Notes

The site CMS (WebPlus) uses numeric channel codes for some departments. The department field preserves the raw channel code. Map to human-readable names using the CUC department directory as needed.

CUC's admissions section includes PDFs of curriculum plans and judging panel documents for performance programs. These appear as pipe-separated URLs in the attachments field.


Part of the OrbTop Chinese media education dataset — companion to the BDA Beijing Dance Academy Scraper.