OrbTop

BAAI / Zhiyuan AI Research Papers Scraper

AIDEVELOPER TOOLS

BAAI / Zhiyuan AI Research Papers Scraper

Extract the current curated AI research paper feed from BAAI (Beijing Academy of Artificial Intelligence, 智源研究院) at hub.baai.ac.cn. Each run fetches the hotness-sorted daily paper feed and enriches every paper with full editorial curator notes written in Chinese by BAAI staff.

What You Get

Each record includes:

Field Description
paper_title_en Paper title in English
arxiv_id ArXiv paper ID (e.g. 2606.06624)
authors List of author names
publication_date Release date (ISO 8601)
abstract_zh Full Chinese-language abstract
keywords_zh Chinese subject tags (e.g. 机器学习, 生成模型)
keywords_en ArXiv category codes (e.g. cs.LG, cs.RL)
pdf_url Direct PDF download link (BAAI-hosted mirror)
baai_curator_note Structured editorial notes: [简介] abstract, [问题] problem addressed, [思路] key approach, [亮点] highlights, [相关] related work
baai_url Canonical BAAI paper page URL
cited_by_count BAAI hotness score
source Always hub.baai.ac.cn

Why BAAI?

BAAI (智源研究院) is China's premier government-backed AI research institute, behind the WuDao foundation model series, the BGE embedding family, and the Aquila LLM. Their curated daily paper feed covers ~10–30 papers per day with Chinese-language editorial summaries not available on arXiv — the editorial value add is the key moat.

Use cases:

  • Track Chinese AI research output for competitive intelligence
  • Build a joinable dataset with an ArXiv scraper (shared arxiv_id key)
  • Monitor BAAI's curated AI research highlights in Chinese for sino-watchers
  • Feed into downstream LLM pipelines with Chinese-language summaries

Input

Parameter Required Default Description
maxItems Yes 5 Maximum number of papers to return (current feed has ~9 per run)

How It Works

  1. Fetches hub.baai.ac.cn/papers — a Nuxt SSR page that embeds the current hotness feed in window.__NUXT__ state (no JavaScript execution required)
  2. Extracts up to 9 paper UUIDs from the SSR data
  3. Fetches each paper's detail page (hub.baai.ac.cn/paper/<uuid>) — also fully SSR-rendered
  4. Merges listing data (basic fields) with detail data (curator notes, extended keywords)
  5. Emits one record per paper

Note on scope: The BAAI listing page renders the current editorial feed (~9 papers) via server-side rendering. Further pagination is client-side only (infinite scroll). Each run captures the current curated snapshot — run daily to build a historical archive.

Sample Output

{
  "paper_title_en": "Rethinking the Trust Region in LLM Reinforcement Learning",
  "arxiv_id": "2602.04879",
  "authors": ["Penghui Qi", "Xiangxin Zhou", "Zichen Liu"],
  "publication_date": "2026-02-04",
  "abstract_zh": "强化学习(RL)已成为大语言模型(LLM)微调的基石...",
  "keywords_zh": ["机器学习", "强化学习", "大语言模型"],
  "keywords_en": ["cs.LG", "cs.CL", "cs.AI"],
  "pdf_url": "https://simg.baai.ac.cn/paperfile/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7.pdf",
  "baai_curator_note": "[简介] 强化学习(RL)已成为大语言模型...\n\n[问题] 如何设计更合理的信任域约束...\n\n[思路] 提出散度近端策略优化(DPPO)...",
  "baai_url": "https://hub.baai.ac.cn/paper/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7",
  "cited_by_count": 120,
  "source": "hub.baai.ac.cn"
}

Notes

  • China-hosted: The site is hosted in China. Cross-border latency is factored into timeouts (45 seconds per request). Runs from US/EU Apify datacenters may experience occasional delays.
  • No authentication required: The papers feed is publicly accessible without login.
  • Daily curation: BAAI curates ~10–30 papers per day. Running this actor daily gives you a rolling archive of their editorial picks.