arXiv Scraper
AIDEVELOPER TOOLS
arXiv Scraper
Export preprints and papers from arXiv.org — the leading open-access repository for 2.5 million+ scientific papers across physics, mathematics, computer science, biology, economics, and quantitative finance.
This actor queries the official ArXiv Atom API (export.arxiv.org/api/query) — the method ArXiv officially supports for programmatic data access. No scraping, no JavaScript rendering, no account required.
What you get
Each result includes:
- arxiv_id — the canonical short ID (e.g.
2301.12345) - abs_url — link to the abstract page
- pdf_url — direct PDF download link
- title — full paper title
- abstract — complete abstract / summary
- authors — comma-separated author names
- primary_category — primary subject category (e.g.
cs.AI) - categories — all subject categories, comma-separated
- published — original submission date (ISO 8601)
- updated — date of the latest version
- comment — author notes (page count, conference, etc.) if available
Search query syntax
The searchQuery field supports ArXiv's full query language:
| Pattern | Example | Meaning |
|---|---|---|
| Plain keyword | machine learning |
Full-text search |
| Title | ti:attention |
Papers with "attention" in the title |
| Author | au:Hinton |
Papers by Hinton |
| Abstract | abs:transformer |
Papers with "transformer" in abstract |
| Category | cat:cs.AI |
Papers in the cs.AI category |
| Boolean | cat:cs.LG AND ti:diffusion |
Category AND title filter |
| Date range | submittedDate:[202301010000 TO 202312312359] |
Papers from 2023 |
See the ArXiv query language reference for the full syntax.
Common arXiv categories
| Category | Field |
|---|---|
cs.AI |
Artificial Intelligence |
cs.LG |
Machine Learning |
cs.CL |
Computation and Language (NLP) |
cs.CV |
Computer Vision |
physics.hep-th |
High Energy Physics Theory |
math.CO |
Combinatorics |
q-bio.NC |
Neurons and Cognition |
econ.GN |
General Economics |
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQuery |
string | required | ArXiv query expression |
maxItems |
integer | 50 | Maximum number of papers to return |
sortBy |
string | submittedDate |
Sort field: relevance, lastUpdatedDate, submittedDate |
sortOrder |
string | descending |
ascending or descending |
Usage examples
Fetch the 100 most recent cs.AI papers:
{
"searchQuery": "cat:cs.AI",
"maxItems": 100,
"sortBy": "submittedDate",
"sortOrder": "descending"
}
Find papers by a specific author:
{
"searchQuery": "au:LeCun",
"maxItems": 50,
"sortBy": "relevance"
}
Search for diffusion model papers from 2024:
{
"searchQuery": "ti:diffusion AND submittedDate:[202401010000 TO 202412312359]",
"maxItems": 200
}
Technical notes
- Uses the ArXiv Atom API — ArXiv's official programmatic interface
- Pagination is handled automatically; set
maxItemsto any number - Rate-limited to ~1 request/second per ArXiv usage guidelines
- No authentication required
- Results span all of arXiv's subject areas (2.5M+ papers total)