HuggingFace Model Scraper - AI/ML Model Data
HuggingFace Model Scraper
Scrape AI/ML model metadata from the HuggingFace Hub — over 1M public models. Returns model name, task type, download counts, likes, library, author, tags, license, parameter size, model-card README excerpt, Spaces count, and referenced datasets. Filter by task, library, author, or free-text search.
HuggingFace Scraper Features
- Queries HuggingFace's public API directly. No HTML scraping, no authentication.
- Filters by task type, ML library, author, or search query — combine them as needed
- Sorts by downloads, likes, trending, or last-modified, depending on which axis matters
- Enriches each model with a model-card README excerpt, Spaces count, and dataset references from the cardData YAML front matter
- Extracts license identifier (
apache-2.0,mit,cc-by-4.0, etc.) and parameter size (7B,13B,70B) from tag patterns - Handles cursor pagination via the API's RFC 5988 Link headers, so a 10K-result run walks itself
- No proxies needed. The Hub is public.
Who Uses HuggingFace Model Data?
- ML engineers — find the most-downloaded models for a specific task or framework without browsing the Hub manually
- AI tooling builders — feed model metadata into agent platforms, model routers, or evaluation harnesses
- Researchers — track adoption signals (downloads, likes, Spaces count) across model families over time
- Procurement and licensing teams — pull license identifiers across hundreds of models in one pass for compliance review
- Market analysts — monitor which authors and organizations are gaining traction on the Hub
How HuggingFace Scraper Works
- Pick your filters: task type (
text-generation,image-classification), library (transformers,diffusers), author, or search query. Sort by downloads, likes, trending, or last modified. - The scraper hits the HuggingFace
/api/modelsendpoint with your filters and walks every page of results using cursor pagination from the response Link header. - For each model in the list, a follow-up detail fetch pulls Spaces count, cardData datasets, and the README. The README is stripped of YAML front matter and truncated to a 500-character excerpt.
Input
Top text-generation models by downloads
{
"pipelineTag": "text-generation",
"sortBy": "downloads",
"maxItems": 50
}
All models from a single author
{
"author": "meta-llama",
"sortBy": "downloads",
"maxItems": 100
}
Free-text search
{
"searchQuery": "llama",
"library": "transformers",
"sortBy": "likes",
"maxItems": 25
}
| Field | Type | Default | Description |
|---|---|---|---|
searchQuery |
string | "" |
Free-text search across model names, authors, and descriptions. Empty means browse all. |
pipelineTag |
string | "" |
Filter by primary task type (text-generation, image-classification, automatic-speech-recognition, etc.). Empty means all tasks. |
library |
string | "" |
Filter by ML framework (transformers, diffusers, sentence-transformers, timm, etc.). Empty means all libraries. |
author |
string | "" |
Filter by author or organization (e.g. meta-llama, google, microsoft). |
sortBy |
string | downloads |
One of downloads, likes, lastModified, trending. |
maxItems |
integer | 10 |
Maximum models to return. Set to 0 for unlimited — though the Hub has 1M+ public models, so filters are recommended. |
HuggingFace Scraper Output Fields
{
"model_name": "Meta-Llama-3-8B-Instruct",
"model_id": "meta-llama/Meta-Llama-3-8B-Instruct",
"pipeline_tag": "text-generation",
"downloads_total": 4823910,
"downloads_30d": 612400,
"likes": 3812,
"library": "transformers",
"author": "meta-llama",
"tags": ["text-generation", "conversational", "llama-3", "en"],
"license": "llama3",
"model_size_params": "8B",
"last_modified": "2026-04-12T10:24:33.000Z",
"readme_excerpt": "Meta Llama 3 is a family of large language models (LLMs) developed by Meta...",
"spaces_count": 412,
"datasets_used": ["meta-llama/Meta-Llama-3-eval"]
}
| Field | Type | Description |
|---|---|---|
model_name |
string | Human-readable model name (without the author prefix) |
model_id |
string | Full model identifier in author/model-name format |
pipeline_tag |
string | Primary task type (text-generation, image-classification, etc.) |
downloads_total |
integer | All-time download count |
downloads_30d |
integer | Download count in the last 30 days |
likes |
integer | Number of likes on HuggingFace |
library |
string | Primary ML library (transformers, diffusers, etc.) |
author |
string | Model author or organization username |
tags |
string[] | Tags including language, dataset, and custom labels (license tags are stripped) |
license |
string | License identifier (apache-2.0, mit, cc-by-4.0, etc.) |
model_size_params |
string | Parameter count (7B, 13B, 70B, 175B) when present in tags |
last_modified |
string | ISO 8601 timestamp of the model's last update |
readme_excerpt |
string | First 500 characters of the model card README, YAML front matter stripped |
spaces_count |
integer | Number of HuggingFace Spaces that reference this model |
datasets_used |
string[] | Datasets declared in the model card's YAML front matter |
FAQ
How do I scrape HuggingFace?
HuggingFace Scraper hits the public Hub API directly — no key, no login, no rate-limit pain at the default settings. Set your filters and the actor handles pagination and enrichment.
Does HuggingFace Scraper need proxies?
HuggingFace Scraper runs without proxies. The Hub API is publicly accessible and the actor stays well under the unauthenticated rate limit with a 100ms courtesy delay between detail fetches.
What data does HuggingFace Scraper return?
HuggingFace Scraper returns 15 fields per model — name, task, downloads (total and 30-day), likes, license, parameter size, tags, library, last-modified timestamp, README excerpt, Spaces count, and dataset references.
Can I filter HuggingFace models by license?
HuggingFace Scraper doesn't filter by license at the API level, but the license field is parsed from each model's tags. Run the scrape with your other filters and post-filter the dataset by license — apache-2.0, mit, or whatever the compliance review allows.
How much does HuggingFace Scraper cost to run?
HuggingFace Scraper uses pay-per-event pricing at the default 1.0 coefficient. You pay per record saved, so a 500-model run costs what 500 records cost. No browser time, no proxy bill.
Need More Features?
Need additional model fields, GGUF or safetensors filter, or model-card body extraction beyond the 500-char excerpt? File an issue or get in touch.
Why Use HuggingFace Scraper?
- Direct API access — pulls structured JSON from the official Hub API, no HTML parsing, no breakage when the site redesigns
- Enriched output — model-card README excerpt, Spaces count, and dataset references come from a second API call so each record carries more than just list-view metadata
- Filter combinations — task + library + author + sort, all in one input, so you don't have to script the cartesian product yourself