OrbTop

HuggingFace Model Scraper - AI/ML Model Data

AIDEVELOPER TOOLSBUSINESS

HuggingFace Model Scraper

Scrape AI/ML model metadata from the HuggingFace Hub — over 1M public models. Returns model name, task type, download counts, likes, library, author, tags, license, parameter size, model-card README excerpt, Spaces count, and referenced datasets. Filter by task, library, author, or free-text search.


HuggingFace Scraper Features

  • Queries HuggingFace's public API directly. No HTML scraping, no authentication.
  • Filters by task type, ML library, author, or search query — combine them as needed
  • Sorts by downloads, likes, trending, or last-modified, depending on which axis matters
  • Enriches each model with a model-card README excerpt, Spaces count, and dataset references from the cardData YAML front matter
  • Extracts license identifier (apache-2.0, mit, cc-by-4.0, etc.) and parameter size (7B, 13B, 70B) from tag patterns
  • Handles cursor pagination via the API's RFC 5988 Link headers, so a 10K-result run walks itself
  • No proxies needed. The Hub is public.

Who Uses HuggingFace Model Data?

  • ML engineers — find the most-downloaded models for a specific task or framework without browsing the Hub manually
  • AI tooling builders — feed model metadata into agent platforms, model routers, or evaluation harnesses
  • Researchers — track adoption signals (downloads, likes, Spaces count) across model families over time
  • Procurement and licensing teams — pull license identifiers across hundreds of models in one pass for compliance review
  • Market analysts — monitor which authors and organizations are gaining traction on the Hub

How HuggingFace Scraper Works

  1. Pick your filters: task type (text-generation, image-classification), library (transformers, diffusers), author, or search query. Sort by downloads, likes, trending, or last modified.
  2. The scraper hits the HuggingFace /api/models endpoint with your filters and walks every page of results using cursor pagination from the response Link header.
  3. For each model in the list, a follow-up detail fetch pulls Spaces count, cardData datasets, and the README. The README is stripped of YAML front matter and truncated to a 500-character excerpt.

Input

Top text-generation models by downloads

{
  "pipelineTag": "text-generation",
  "sortBy": "downloads",
  "maxItems": 50
}

All models from a single author

{
  "author": "meta-llama",
  "sortBy": "downloads",
  "maxItems": 100
}

Free-text search

{
  "searchQuery": "llama",
  "library": "transformers",
  "sortBy": "likes",
  "maxItems": 25
}
Field Type Default Description
searchQuery string "" Free-text search across model names, authors, and descriptions. Empty means browse all.
pipelineTag string "" Filter by primary task type (text-generation, image-classification, automatic-speech-recognition, etc.). Empty means all tasks.
library string "" Filter by ML framework (transformers, diffusers, sentence-transformers, timm, etc.). Empty means all libraries.
author string "" Filter by author or organization (e.g. meta-llama, google, microsoft).
sortBy string downloads One of downloads, likes, lastModified, trending.
maxItems integer 10 Maximum models to return. Set to 0 for unlimited — though the Hub has 1M+ public models, so filters are recommended.

HuggingFace Scraper Output Fields

{
  "model_name": "Meta-Llama-3-8B-Instruct",
  "model_id": "meta-llama/Meta-Llama-3-8B-Instruct",
  "pipeline_tag": "text-generation",
  "downloads_total": 4823910,
  "downloads_30d": 612400,
  "likes": 3812,
  "library": "transformers",
  "author": "meta-llama",
  "tags": ["text-generation", "conversational", "llama-3", "en"],
  "license": "llama3",
  "model_size_params": "8B",
  "last_modified": "2026-04-12T10:24:33.000Z",
  "readme_excerpt": "Meta Llama 3 is a family of large language models (LLMs) developed by Meta...",
  "spaces_count": 412,
  "datasets_used": ["meta-llama/Meta-Llama-3-eval"]
}
Field Type Description
model_name string Human-readable model name (without the author prefix)
model_id string Full model identifier in author/model-name format
pipeline_tag string Primary task type (text-generation, image-classification, etc.)
downloads_total integer All-time download count
downloads_30d integer Download count in the last 30 days
likes integer Number of likes on HuggingFace
library string Primary ML library (transformers, diffusers, etc.)
author string Model author or organization username
tags string[] Tags including language, dataset, and custom labels (license tags are stripped)
license string License identifier (apache-2.0, mit, cc-by-4.0, etc.)
model_size_params string Parameter count (7B, 13B, 70B, 175B) when present in tags
last_modified string ISO 8601 timestamp of the model's last update
readme_excerpt string First 500 characters of the model card README, YAML front matter stripped
spaces_count integer Number of HuggingFace Spaces that reference this model
datasets_used string[] Datasets declared in the model card's YAML front matter

FAQ

How do I scrape HuggingFace?

HuggingFace Scraper hits the public Hub API directly — no key, no login, no rate-limit pain at the default settings. Set your filters and the actor handles pagination and enrichment.

Does HuggingFace Scraper need proxies?

HuggingFace Scraper runs without proxies. The Hub API is publicly accessible and the actor stays well under the unauthenticated rate limit with a 100ms courtesy delay between detail fetches.

What data does HuggingFace Scraper return?

HuggingFace Scraper returns 15 fields per model — name, task, downloads (total and 30-day), likes, license, parameter size, tags, library, last-modified timestamp, README excerpt, Spaces count, and dataset references.

Can I filter HuggingFace models by license?

HuggingFace Scraper doesn't filter by license at the API level, but the license field is parsed from each model's tags. Run the scrape with your other filters and post-filter the dataset by license — apache-2.0, mit, or whatever the compliance review allows.

How much does HuggingFace Scraper cost to run?

HuggingFace Scraper uses pay-per-event pricing at the default 1.0 coefficient. You pay per record saved, so a 500-model run costs what 500 records cost. No browser time, no proxy bill.


Need More Features?

Need additional model fields, GGUF or safetensors filter, or model-card body extraction beyond the 500-char excerpt? File an issue or get in touch.

Why Use HuggingFace Scraper?

  • Direct API access — pulls structured JSON from the official Hub API, no HTML parsing, no breakage when the site redesigns
  • Enriched output — model-card README excerpt, Spaces count, and dataset references come from a second API call so each record carries more than just list-view metadata
  • Filter combinations — task + library + author + sort, all in one input, so you don't have to script the cartesian product yourself