BAAI / Zhiyuan AI Research Papers Scraper
BAAI / Zhiyuan AI Research Papers Scraper
Extract the current curated AI research paper feed from BAAI (Beijing Academy of Artificial Intelligence, 智源研究院) at hub.baai.ac.cn. Each run fetches the hotness-sorted daily paper feed and enriches every paper with full editorial curator notes written in Chinese by BAAI staff.
What You Get
Each record includes:
| Field | Description |
|---|---|
paper_title_en |
Paper title in English |
arxiv_id |
ArXiv paper ID (e.g. 2606.06624) |
authors |
List of author names |
publication_date |
Release date (ISO 8601) |
abstract_zh |
Full Chinese-language abstract |
keywords_zh |
Chinese subject tags (e.g. 机器学习, 生成模型) |
keywords_en |
ArXiv category codes (e.g. cs.LG, cs.RL) |
pdf_url |
Direct PDF download link (BAAI-hosted mirror) |
baai_curator_note |
Structured editorial notes: [简介] abstract, [问题] problem addressed, [思路] key approach, [亮点] highlights, [相关] related work |
baai_url |
Canonical BAAI paper page URL |
cited_by_count |
BAAI hotness score |
source |
Always hub.baai.ac.cn |
Why BAAI?
BAAI (智源研究院) is China's premier government-backed AI research institute, behind the WuDao foundation model series, the BGE embedding family, and the Aquila LLM. Their curated daily paper feed covers ~10–30 papers per day with Chinese-language editorial summaries not available on arXiv — the editorial value add is the key moat.
Use cases:
- Track Chinese AI research output for competitive intelligence
- Build a joinable dataset with an ArXiv scraper (shared
arxiv_idkey) - Monitor BAAI's curated AI research highlights in Chinese for sino-watchers
- Feed into downstream LLM pipelines with Chinese-language summaries
Input
| Parameter | Required | Default | Description |
|---|---|---|---|
maxItems |
Yes | 5 | Maximum number of papers to return (current feed has ~9 per run) |
How It Works
- Fetches
hub.baai.ac.cn/papers— a Nuxt SSR page that embeds the current hotness feed inwindow.__NUXT__state (no JavaScript execution required) - Extracts up to 9 paper UUIDs from the SSR data
- Fetches each paper's detail page (
hub.baai.ac.cn/paper/<uuid>) — also fully SSR-rendered - Merges listing data (basic fields) with detail data (curator notes, extended keywords)
- Emits one record per paper
Note on scope: The BAAI listing page renders the current editorial feed (~9 papers) via server-side rendering. Further pagination is client-side only (infinite scroll). Each run captures the current curated snapshot — run daily to build a historical archive.
Sample Output
{
"paper_title_en": "Rethinking the Trust Region in LLM Reinforcement Learning",
"arxiv_id": "2602.04879",
"authors": ["Penghui Qi", "Xiangxin Zhou", "Zichen Liu"],
"publication_date": "2026-02-04",
"abstract_zh": "强化学习(RL)已成为大语言模型(LLM)微调的基石...",
"keywords_zh": ["机器学习", "强化学习", "大语言模型"],
"keywords_en": ["cs.LG", "cs.CL", "cs.AI"],
"pdf_url": "https://simg.baai.ac.cn/paperfile/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7.pdf",
"baai_curator_note": "[简介] 强化学习(RL)已成为大语言模型...\n\n[问题] 如何设计更合理的信任域约束...\n\n[思路] 提出散度近端策略优化(DPPO)...",
"baai_url": "https://hub.baai.ac.cn/paper/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7",
"cited_by_count": 120,
"source": "hub.baai.ac.cn"
}
Notes
- China-hosted: The site is hosted in China. Cross-border latency is factored into timeouts (45 seconds per request). Runs from US/EU Apify datacenters may experience occasional delays.
- No authentication required: The papers feed is publicly accessible without login.
- Daily curation: BAAI curates ~10–30 papers per day. Running this actor daily gives you a rolling archive of their editorial picks.