Project Gutenberg Ebook Scraper (Gutendex)
Project Gutenberg Ebook Scraper (Gutendex)
Scrape the full Project Gutenberg public-domain catalog via the Gutendex JSON API. Filter by search query, language, subject, author era, and minimum download count. Returns book metadata with direct EPUB, Kindle, plain-text, and HTML download URLs — built for AI training corpora, NLP datasets, and TTS pipelines.
What does this actor do?
Project Gutenberg hosts over 78,000 public-domain books that are legally free to download and use, including for commercial AI/ML training. This actor paginates the entire catalog through the Gutendex API — the official REST wrapper — and delivers clean, structured records with:
- Full text download URLs: direct links to
.txt,.epub,.mobi, and.htmlversions - Rich metadata: authors with birth/death years, translators, subjects, bookshelves, and language codes
- Copyright status flag:
"true"/"false"/null— essential for corpus licensing decisions - 30-day download count: a proxy for public interest and training-set priority
Every record is a flat JSON object with no nested arrays — ready to pipe into a vector database, fine-tuning pipeline, or document store.
Use cases
- LLM training corpora: bulk-fetch plain-text URLs for thousands of public-domain works in a target language
- NLP datasets: filter by subject headings (e.g.
Philosophy,Science Fiction) for domain-specific datasets - TTS pipelines: filter by language code (
en,fr,de,ja) and pull plain-text URLs for audio synthesis - Academic research: enumerate works by author era using
authorYearStart/authorYearEnd - Digital library projects: mirror metadata for discovery interfaces
Input
| Field | Type | Default | Description |
|---|---|---|---|
maxItems |
integer | (required) | Maximum number of books to return. Omit to fetch the full catalog (~78,000 books). |
search |
string | (empty) | Free-text search across titles, authors, and subjects. |
languages |
string | (empty) | Comma-separated ISO 639-1 codes, e.g. en,fr,de. Leave blank for all languages. |
topic |
string | (empty) | Topic or subject keyword filter, e.g. science fiction. |
authorYearStart |
integer | (empty) | Return only books whose authors were born after this year. |
authorYearEnd |
integer | (empty) | Return only books whose authors were born before this year. |
minDownloadCount |
integer | 0 |
Return only books with at least this many 30-day downloads. |
sortBy |
string | popular |
popular (most downloaded first) or ascending (by Gutenberg ID). |
Minimal run — top 10 most popular books:
{ "maxItems": 10 }
French classic literature — authors born before 1900:
{ "maxItems": 500, "languages": "fr", "authorYearEnd": 1900, "sortBy": "popular" }
Full corpus, high-demand English books only:
{ "languages": "en", "minDownloadCount": 100 }
Output
Each result is a flat JSON object:
{
"gutenbergId": "1342",
"title": "Pride and Prejudice",
"authors": "Austen, Jane",
"authorBirthYears": "1775",
"authorDeathYears": "1817",
"translators": "",
"subjects": "Domestic fiction | England -- Social life and customs -- 19th century -- Fiction | Love stories",
"bookshelves": "Best Books Ever Listings | Harvard Classics",
"languages": "en",
"copyright": "false",
"mediaType": "Text",
"downloadCount": "52793",
"coverImageUrl": "https://www.gutenberg.org/cache/epub/1342/pg1342.cover.medium.jpg",
"epubUrl": "https://www.gutenberg.org/ebooks/1342.epub3.images",
"kindleUrl": "https://www.gutenberg.org/ebooks/1342.kf8.images",
"textPlainUrl": "https://www.gutenberg.org/ebooks/1342.txt.utf-8",
"htmlUrl": "https://www.gutenberg.org/ebooks/1342.html.images",
"gutendexUrl": "https://gutendex.com/books/1342/",
"scrapedAt": "2026-05-29T02:00:00.000Z"
}
| Field | Description |
|---|---|
gutenbergId |
Project Gutenberg numeric ID (as string) |
title |
Full book title |
authors |
Pipe-separated author names in Last, First format |
authorBirthYears |
Pipe-separated birth years (same order as authors) |
authorDeathYears |
Pipe-separated death years |
translators |
Pipe-separated translator names |
subjects |
Pipe-separated Library of Congress subject headings |
bookshelves |
Pipe-separated Gutenberg category tags |
languages |
Pipe-separated ISO 639-1 language codes |
copyright |
"true" / "false" / null |
mediaType |
Typically "Text" |
downloadCount |
30-day download count (as string) |
coverImageUrl |
Direct URL to cover image (JPEG) |
epubUrl |
Direct URL to EPUB file |
kindleUrl |
Direct URL to Kindle (MOBI) file |
textPlainUrl |
Direct URL to plain-text .txt file |
htmlUrl |
Direct URL to HTML version |
gutendexUrl |
Gutendex API URL for this specific record |
scrapedAt |
ISO-8601 scrape timestamp |
Performance and cost
- No proxy required: Gutendex is a free public API with no authentication or IP restrictions.
- Rate limiting: No documented cap. The actor uses a 300ms inter-page delay as a courtesy.
- Memory: 256 MB is sufficient for any run size.
- Full catalog run: ~78,000 books across ~2,500 pages at 32 books/page.
- PPE pricing: charged per result record via the standard profile.
Notes
- Gutendex occasionally responds slowly (10-30 seconds per page). The actor uses a 90-second timeout with automatic retries — transient slowness does not cause failures.
- The
copyright: "false"flag indicates the work is in the public domain in the United States. Verify licensing for your specific jurisdiction and use case before commercial use. - Download URLs point directly to the
gutenberg.orgCDN. The actor does not download the book files themselves — it returns the URLs for downstream processing.