HuggingFace Model Scraper

Scrape AI/ML model metadata from the HuggingFace Hub — over 1M public models. Returns model name, task type, download counts, likes, library, author, tags, license, parameter size, model-card README excerpt, Spaces count, and referenced datasets. Filter by task, library, author, or free-text search.

HuggingFace Scraper Features

Queries HuggingFace's public API directly. No HTML scraping, no authentication.
Filters by task type, ML library, author, or search query — combine them as needed
Sorts by downloads, likes, trending, or last-modified, depending on which axis matters
Enriches each model with a model-card README excerpt, Spaces count, and dataset references from the cardData YAML front matter
Extracts license identifier (apache-2.0, mit, cc-by-4.0, etc.) and parameter size (7B, 13B, 70B) from tag patterns
Handles cursor pagination via the API's RFC 5988 Link headers, so a 10K-result run walks itself
No proxies needed. The Hub is public.

Who Uses HuggingFace Model Data?

ML engineers — find the most-downloaded models for a specific task or framework without browsing the Hub manually
AI tooling builders — feed model metadata into agent platforms, model routers, or evaluation harnesses
Researchers — track adoption signals (downloads, likes, Spaces count) across model families over time
Procurement and licensing teams — pull license identifiers across hundreds of models in one pass for compliance review
Market analysts — monitor which authors and organizations are gaining traction on the Hub

How HuggingFace Scraper Works

Pick your filters: task type (text-generation, image-classification), library (transformers, diffusers), author, or search query. Sort by downloads, likes, trending, or last modified.
The scraper hits the HuggingFace /api/models endpoint with your filters and walks every page of results using cursor pagination from the response Link header.
For each model in the list, a follow-up detail fetch pulls Spaces count, cardData datasets, and the README. The README is stripped of YAML front matter and truncated to a 500-character excerpt.

Input

Top text-generation models by downloads

{
  "pipelineTag": "text-generation",
  "sortBy": "downloads",
  "maxItems": 50
}

All models from a single author

{
  "author": "meta-llama",
  "sortBy": "downloads",
  "maxItems": 100
}

Free-text search

{
  "searchQuery": "llama",
  "library": "transformers",
  "sortBy": "likes",
  "maxItems": 25
}

Field	Type	Default	Description
`searchQuery`	string	`""`	Free-text search across model names, authors, and descriptions. Empty means browse all.
`pipelineTag`	string	`""`	Filter by primary task type (text-generation, image-classification, automatic-speech-recognition, etc.). Empty means all tasks.
`library`	string	`""`	Filter by ML framework (transformers, diffusers, sentence-transformers, timm, etc.). Empty means all libraries.
`author`	string	`""`	Filter by author or organization (e.g. `meta-llama`, `google`, `microsoft`).
`sortBy`	string	`downloads`	One of `downloads`, `likes`, `lastModified`, `trending`.
`maxItems`	integer	`10`	Maximum models to return. Set to `0` for unlimited — though the Hub has 1M+ public models, so filters are recommended.

HuggingFace Scraper Output Fields

{
  "model_name": "Meta-Llama-3-8B-Instruct",
  "model_id": "meta-llama/Meta-Llama-3-8B-Instruct",
  "pipeline_tag": "text-generation",
  "downloads_total": 4823910,
  "downloads_30d": 612400,
  "likes": 3812,
  "library": "transformers",
  "author": "meta-llama",
  "tags": ["text-generation", "conversational", "llama-3", "en"],
  "license": "llama3",
  "model_size_params": "8B",
  "last_modified": "2026-04-12T10:24:33.000Z",
  "readme_excerpt": "Meta Llama 3 is a family of large language models (LLMs) developed by Meta...",
  "spaces_count": 412,
  "datasets_used": ["meta-llama/Meta-Llama-3-eval"]
}

Field	Type	Description
`model_name`	string	Human-readable model name (without the author prefix)
`model_id`	string	Full model identifier in `author/model-name` format
`pipeline_tag`	string	Primary task type (text-generation, image-classification, etc.)
`downloads_total`	integer	All-time download count
`downloads_30d`	integer	Download count in the last 30 days
`likes`	integer	Number of likes on HuggingFace
`library`	string	Primary ML library (transformers, diffusers, etc.)
`author`	string	Model author or organization username
`tags`	string[]	Tags including language, dataset, and custom labels (license tags are stripped)
`license`	string	License identifier (apache-2.0, mit, cc-by-4.0, etc.)
`model_size_params`	string	Parameter count (7B, 13B, 70B, 175B) when present in tags
`last_modified`	string	ISO 8601 timestamp of the model's last update
`readme_excerpt`	string	First 500 characters of the model card README, YAML front matter stripped
`spaces_count`	integer	Number of HuggingFace Spaces that reference this model
`datasets_used`	string[]	Datasets declared in the model card's YAML front matter

FAQ

How do I scrape HuggingFace?

HuggingFace Scraper hits the public Hub API directly — no key, no login, no rate-limit pain at the default settings. Set your filters and the actor handles pagination and enrichment.

Does HuggingFace Scraper need proxies?

HuggingFace Scraper runs without proxies. The Hub API is publicly accessible and the actor stays well under the unauthenticated rate limit with a 100ms courtesy delay between detail fetches.

What data does HuggingFace Scraper return?

HuggingFace Scraper returns 15 fields per model — name, task, downloads (total and 30-day), likes, license, parameter size, tags, library, last-modified timestamp, README excerpt, Spaces count, and dataset references.

Can I filter HuggingFace models by license?

HuggingFace Scraper doesn't filter by license at the API level, but the license field is parsed from each model's tags. Run the scrape with your other filters and post-filter the dataset by license — apache-2.0, mit, or whatever the compliance review allows.

How much does HuggingFace Scraper cost to run?

HuggingFace Scraper uses pay-per-event pricing at the default 1.0 coefficient. You pay per record saved, so a 500-model run costs what 500 records cost. No browser time, no proxy bill.

Need More Features?

Need additional model fields, GGUF or safetensors filter, or model-card body extraction beyond the 500-char excerpt? File an issue or get in touch.

Why Use HuggingFace Scraper?

Direct API access — pulls structured JSON from the official Hub API, no HTML parsing, no breakage when the site redesigns
Enriched output — model-card README excerpt, Spaces count, and dataset references come from a second API call so each record carries more than just list-view metadata
Filter combinations — task + library + author + sort, all in one input, so you don't have to script the cartesian product yourself

HuggingFace Model Scraper - AI/ML Model Data