OrbTop

CNSA English News Scraper

NEWSBUSINESSOTHER

CNSA English News Scraper

Scrape the official English-language news and announcements published by the China National Space Administration (CNSA) at cnsa.gov.cn/english.

CNSA's English mirror is the highest-authority English-language source for Chinese space agency news — cited by Reuters, BBC, AP, and SpaceNews. This actor collects full article text, publish dates, images, and attachment links across all five English subchannels: News, Policies & Announcements, Intergovernmental Cooperation, International Cooperation Coordinate Commission, and Special Information.

What you get

Each scraped record contains:

Field Description
articleId Unique numeric article ID from the CNSA CMS URL
subchannel Subchannel name (News, Policies and Announcement, etc.)
title Full article title
bodyHtml Article body as raw HTML
bodyText Article body as plain text
publishDate Publish date in MM/DD/YYYY format
sourceUrl Canonical URL of the article detail page
mirrorZhUrl Chinese-language counterpart URL (always null — not exposed by the English CMS)
images Comma-separated absolute URLs of all images in the article body
attachments Comma-separated absolute URLs of any PDF/document attachments
scrapedAt ISO-8601 timestamp when the record was scraped

How it works

The actor crawls three levels:

  1. Index — Seeds five subchannel listing pages (News, Policies, Cooperation, etc.)
  2. Listing — Extracts article links from each listing page. Discovers all pagination pages from the embedded JavaScript (maxPageNum) and enqueues them automatically.
  3. Article — Fetches each article detail page and extracts title, body HTML/text, date, images, and attachment links.

External links (CGTN, China Daily) that appear in the listing are skipped — only articles hosted on cnsa.gov.cn are scraped.

Usage

Set Max Items to limit how many articles to collect. Leave it at the default (10) for a quick sample, or increase it to collect the full archive (~500 English articles).

Example input

{
  "maxItems": 50
}

Example output record

{
  "articleId": "10743249",
  "subchannel": "News",
  "title": "Chinese scientists discover two new lunar minerals",
  "bodyHtml": "<p>Chinese scientists recently discovered...</p>",
  "bodyText": "Chinese scientists recently discovered two new lunar minerals...",
  "publishDate": "04/24/2026",
  "sourceUrl": "https://www.cnsa.gov.cn/english/n6465652/n6465653/c10743249/content.html",
  "mirrorZhUrl": null,
  "images": "https://www.cnsa.gov.cn/english/n6465652/n6465653/c10743249/part/10743247.jpg",
  "attachments": null,
  "scrapedAt": "2026-05-31T08:14:23.000Z"
}

Notes

  • The site does not require a proxy — direct datacenter egress works reliably.
  • Some listing items link to external publications (China Daily, Xinhua) rather than CNSA-hosted articles. These are filtered out automatically.
  • The mirrorZhUrl field is always null — CNSA's English CMS does not expose cross-links to the Chinese counterpart articles.
  • Coverage: approximately 500 English-translated articles across all five subchannels as of mid-2026.