OrbTop

ModelScope Model Catalog Scraper

AIDEVELOPER TOOLS

ModelScope Model Catalog Scraper

Scrape the ModelScope (modelscope.cn) AI model catalog — China's Alibaba-backed model registry hosting ~200k models. Export model IDs, tasks, frameworks, download statistics, star counts, licenses, READMEs, and full metadata for all models in the catalog.

What it does

Sweeps the ModelScope JSON API task-by-task (text-generation, image-generation, multimodal, and 26 other task categories), deduplicates across task overlaps, and optionally enriches each model record with the full README from the per-model detail endpoint.

Output fields per model:

  • model_id — full identifier (namespace/name)
  • namespace, name — publisher slug and model name
  • chinese_name — display name in Chinese if present
  • task — primary task tag used for discovery
  • tasks_all — all task tags, pipe-separated
  • frameworks — ML frameworks (pytorch, tensorflow, mindspore, etc.), pipe-separated
  • languages — supported languages (en, zh, multilingual, etc.), pipe-separated
  • license — SPDX identifier (apache-2.0, mit, etc.)
  • downloads_30d — downloads in the last 30 days
  • stars — star count
  • last_updated, created_at — ISO-8601 timestamps
  • readme_text — README content, truncated to 8 KB (requires includeDetails: true)
  • model_size_params — parameter count label when tagged (7B, 72B, MoE-22B-A2B)
  • quantization_variants — available quantization types from tensor metadata, pipe-separated
  • base_model — base model ID if this is a fine-tune
  • publisher_org, publisher_url — organization name and profile URL
  • has_demo, has_inference_api — boolean flags

Input

Field Type Default Description
tasks array (all tasks) Limit to specific task slugs (e.g. text-generation, image-generation). Leave empty to sweep all 29 canonical tasks.
maxItems integer 100 Maximum number of models to return. Set to 0 for unlimited (full catalog run).
includeDetails boolean true Fetch the per-model detail endpoint for full README text and quantization variant metadata. Disabling this speeds up runs but leaves readme_text and quantization_variants empty.

Example use cases

  • West+East parity datasets — pair with the HuggingFace Model Scraper to build a combined index of both Western and Chinese open-weights releases (Qwen, DeepSeek, Yi, GLM, InternLM, ERNIE, MiniMax, etc.).
  • Model landscape research — filter by task, framework, or license to survey which Chinese labs are publishing in specific domains.
  • Download trend tracking — schedule regular runs and track downloads_30d growth for specific namespaces or model families.
  • README content analysis — extract model cards from readme_text for NLP-based capability assessment or feature extraction.

Notes

  • The API requires no authentication. No proxy is needed — direct access from Apify infrastructure works without restriction.
  • Full catalog sweeps (all tasks, includeDetails: true) are long-running. Use maxItems to cap output for targeted queries.
  • Array output fields (tasks_all, frameworks, languages, quantization_variants) use | as separator for flat dataset compatibility. Split on | in downstream processing.