BAAI / Zhiyuan AI Research Papers Scraper

Extract the current curated AI research paper feed from BAAI (Beijing Academy of Artificial Intelligence, 智源研究院) at hub.baai.ac.cn. Each run fetches the hotness-sorted daily paper feed and enriches every paper with full editorial curator notes written in Chinese by BAAI staff.

What You Get

Each record includes:

Field	Description
`paper_title_en`	Paper title in English
`arxiv_id`	ArXiv paper ID (e.g. `2606.06624`)
`authors`	List of author names
`publication_date`	Release date (ISO 8601)
`abstract_zh`	Full Chinese-language abstract
`keywords_zh`	Chinese subject tags (e.g. 机器学习, 生成模型)
`keywords_en`	ArXiv category codes (e.g. cs.LG, cs.RL)
`pdf_url`	Direct PDF download link (BAAI-hosted mirror)
`baai_curator_note`	Structured editorial notes: [简介] abstract, [问题] problem addressed, [思路] key approach, [亮点] highlights, [相关] related work
`baai_url`	Canonical BAAI paper page URL
`cited_by_count`	BAAI hotness score
`source`	Always `hub.baai.ac.cn`

Why BAAI?

BAAI (智源研究院) is China's premier government-backed AI research institute, behind the WuDao foundation model series, the BGE embedding family, and the Aquila LLM. Their curated daily paper feed covers ~10–30 papers per day with Chinese-language editorial summaries not available on arXiv — the editorial value add is the key moat.

Use cases:

Track Chinese AI research output for competitive intelligence
Build a joinable dataset with an ArXiv scraper (shared arxiv_id key)
Monitor BAAI's curated AI research highlights in Chinese for sino-watchers
Feed into downstream LLM pipelines with Chinese-language summaries

Input

Parameter	Required	Default	Description
`maxItems`	Yes	5	Maximum number of papers to return (current feed has ~9 per run)

How It Works

Fetches hub.baai.ac.cn/papers — a Nuxt SSR page that embeds the current hotness feed in window.__NUXT__ state (no JavaScript execution required)
Extracts up to 9 paper UUIDs from the SSR data
Fetches each paper's detail page (hub.baai.ac.cn/paper/<uuid>) — also fully SSR-rendered
Merges listing data (basic fields) with detail data (curator notes, extended keywords)
Emits one record per paper

Note on scope: The BAAI listing page renders the current editorial feed (~9 papers) via server-side rendering. Further pagination is client-side only (infinite scroll). Each run captures the current curated snapshot — run daily to build a historical archive.

Sample Output

{
  "paper_title_en": "Rethinking the Trust Region in LLM Reinforcement Learning",
  "arxiv_id": "2602.04879",
  "authors": ["Penghui Qi", "Xiangxin Zhou", "Zichen Liu"],
  "publication_date": "2026-02-04",
  "abstract_zh": "强化学习（RL）已成为大语言模型（LLM）微调的基石...",
  "keywords_zh": ["机器学习", "强化学习", "大语言模型"],
  "keywords_en": ["cs.LG", "cs.CL", "cs.AI"],
  "pdf_url": "https://simg.baai.ac.cn/paperfile/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7.pdf",
  "baai_curator_note": "[简介] 强化学习（RL）已成为大语言模型...\n\n[问题] 如何设计更合理的信任域约束...\n\n[思路] 提出散度近端策略优化（DPPO）...",
  "baai_url": "https://hub.baai.ac.cn/paper/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7",
  "cited_by_count": 120,
  "source": "hub.baai.ac.cn"
}

Notes

China-hosted: The site is hosted in China. Cross-border latency is factored into timeouts (45 seconds per request). Runs from US/EU Apify datacenters may experience occasional delays.
No authentication required: The papers feed is publicly accessible without login.
Daily curation: BAAI curates ~10–30 papers per day. Running this actor daily gives you a rolling archive of their editorial picks.