CNSA English News Scraper
CNSA English News Scraper
Scrape the official English-language news and announcements published by the China National Space Administration (CNSA) at cnsa.gov.cn/english.
CNSA's English mirror is the highest-authority English-language source for Chinese space agency news — cited by Reuters, BBC, AP, and SpaceNews. This actor collects full article text, publish dates, images, and attachment links across all five English subchannels: News, Policies & Announcements, Intergovernmental Cooperation, International Cooperation Coordinate Commission, and Special Information.
What you get
Each scraped record contains:
| Field | Description |
|---|---|
articleId |
Unique numeric article ID from the CNSA CMS URL |
subchannel |
Subchannel name (News, Policies and Announcement, etc.) |
title |
Full article title |
bodyHtml |
Article body as raw HTML |
bodyText |
Article body as plain text |
publishDate |
Publish date in MM/DD/YYYY format |
sourceUrl |
Canonical URL of the article detail page |
mirrorZhUrl |
Chinese-language counterpart URL (always null — not exposed by the English CMS) |
images |
Comma-separated absolute URLs of all images in the article body |
attachments |
Comma-separated absolute URLs of any PDF/document attachments |
scrapedAt |
ISO-8601 timestamp when the record was scraped |
How it works
The actor crawls three levels:
- Index — Seeds five subchannel listing pages (News, Policies, Cooperation, etc.)
- Listing — Extracts article links from each listing page. Discovers all pagination pages from the embedded JavaScript (
maxPageNum) and enqueues them automatically. - Article — Fetches each article detail page and extracts title, body HTML/text, date, images, and attachment links.
External links (CGTN, China Daily) that appear in the listing are skipped — only articles hosted on cnsa.gov.cn are scraped.
Usage
Set Max Items to limit how many articles to collect. Leave it at the default (10) for a quick sample, or increase it to collect the full archive (~500 English articles).
Example input
{
"maxItems": 50
}
Example output record
{
"articleId": "10743249",
"subchannel": "News",
"title": "Chinese scientists discover two new lunar minerals",
"bodyHtml": "<p>Chinese scientists recently discovered...</p>",
"bodyText": "Chinese scientists recently discovered two new lunar minerals...",
"publishDate": "04/24/2026",
"sourceUrl": "https://www.cnsa.gov.cn/english/n6465652/n6465653/c10743249/content.html",
"mirrorZhUrl": null,
"images": "https://www.cnsa.gov.cn/english/n6465652/n6465653/c10743249/part/10743247.jpg",
"attachments": null,
"scrapedAt": "2026-05-31T08:14:23.000Z"
}
Notes
- The site does not require a proxy — direct datacenter egress works reliably.
- Some listing items link to external publications (China Daily, Xinhua) rather than CNSA-hosted articles. These are filtered out automatically.
- The
mirrorZhUrlfield is alwaysnull— CNSA's English CMS does not expose cross-links to the Chinese counterpart articles. - Coverage: approximately 500 English-translated articles across all five subchannels as of mid-2026.