Project Gutenberg Ebook Scraper (Gutendex)

Scrape the full Project Gutenberg public-domain catalog via the Gutendex JSON API. Filter by search query, language, subject, author era, and minimum download count. Returns book metadata with direct EPUB, Kindle, plain-text, and HTML download URLs — built for AI training corpora, NLP datasets, and TTS pipelines.

What does this actor do?

Project Gutenberg hosts over 78,000 public-domain books that are legally free to download and use, including for commercial AI/ML training. This actor paginates the entire catalog through the Gutendex API — the official REST wrapper — and delivers clean, structured records with:

Full text download URLs: direct links to .txt, .epub, .mobi, and .html versions
Rich metadata: authors with birth/death years, translators, subjects, bookshelves, and language codes
Copyright status flag: "true"/"false"/null — essential for corpus licensing decisions
30-day download count: a proxy for public interest and training-set priority

Every record is a flat JSON object with no nested arrays — ready to pipe into a vector database, fine-tuning pipeline, or document store.

Use cases

LLM training corpora: bulk-fetch plain-text URLs for thousands of public-domain works in a target language
NLP datasets: filter by subject headings (e.g. Philosophy, Science Fiction) for domain-specific datasets
TTS pipelines: filter by language code (en, fr, de, ja) and pull plain-text URLs for audio synthesis
Academic research: enumerate works by author era using authorYearStart/authorYearEnd
Digital library projects: mirror metadata for discovery interfaces

Input

Field	Type	Default	Description
`maxItems`	integer	(required)	Maximum number of books to return. Omit to fetch the full catalog (~78,000 books).
`search`	string	(empty)	Free-text search across titles, authors, and subjects.
`languages`	string	(empty)	Comma-separated ISO 639-1 codes, e.g. `en,fr,de`. Leave blank for all languages.
`topic`	string	(empty)	Topic or subject keyword filter, e.g. `science fiction`.
`authorYearStart`	integer	(empty)	Return only books whose authors were born after this year.
`authorYearEnd`	integer	(empty)	Return only books whose authors were born before this year.
`minDownloadCount`	integer	`0`	Return only books with at least this many 30-day downloads.
`sortBy`	string	`popular`	`popular` (most downloaded first) or `ascending` (by Gutenberg ID).

Minimal run — top 10 most popular books:

{ "maxItems": 10 }

French classic literature — authors born before 1900:

{ "maxItems": 500, "languages": "fr", "authorYearEnd": 1900, "sortBy": "popular" }

Full corpus, high-demand English books only:

{ "languages": "en", "minDownloadCount": 100 }

Output

Each result is a flat JSON object:

{
    "gutenbergId": "1342",
    "title": "Pride and Prejudice",
    "authors": "Austen, Jane",
    "authorBirthYears": "1775",
    "authorDeathYears": "1817",
    "translators": "",
    "subjects": "Domestic fiction | England -- Social life and customs -- 19th century -- Fiction | Love stories",
    "bookshelves": "Best Books Ever Listings | Harvard Classics",
    "languages": "en",
    "copyright": "false",
    "mediaType": "Text",
    "downloadCount": "52793",
    "coverImageUrl": "https://www.gutenberg.org/cache/epub/1342/pg1342.cover.medium.jpg",
    "epubUrl": "https://www.gutenberg.org/ebooks/1342.epub3.images",
    "kindleUrl": "https://www.gutenberg.org/ebooks/1342.kf8.images",
    "textPlainUrl": "https://www.gutenberg.org/ebooks/1342.txt.utf-8",
    "htmlUrl": "https://www.gutenberg.org/ebooks/1342.html.images",
    "gutendexUrl": "https://gutendex.com/books/1342/",
    "scrapedAt": "2026-05-29T02:00:00.000Z"
}

Field	Description
`gutenbergId`	Project Gutenberg numeric ID (as string)
`title`	Full book title
`authors`	Pipe-separated author names in `Last, First` format
`authorBirthYears`	Pipe-separated birth years (same order as authors)
`authorDeathYears`	Pipe-separated death years
`translators`	Pipe-separated translator names
`subjects`	Pipe-separated Library of Congress subject headings
`bookshelves`	Pipe-separated Gutenberg category tags
`languages`	Pipe-separated ISO 639-1 language codes
`copyright`	`"true"` / `"false"` / `null`
`mediaType`	Typically `"Text"`
`downloadCount`	30-day download count (as string)
`coverImageUrl`	Direct URL to cover image (JPEG)
`epubUrl`	Direct URL to EPUB file
`kindleUrl`	Direct URL to Kindle (MOBI) file
`textPlainUrl`	Direct URL to plain-text `.txt` file
`htmlUrl`	Direct URL to HTML version
`gutendexUrl`	Gutendex API URL for this specific record
`scrapedAt`	ISO-8601 scrape timestamp

Performance and cost

No proxy required: Gutendex is a free public API with no authentication or IP restrictions.
Rate limiting: No documented cap. The actor uses a 300ms inter-page delay as a courtesy.
Memory: 256 MB is sufficient for any run size.
Full catalog run: ~78,000 books across ~2,500 pages at 32 books/page.
PPE pricing: charged per result record via the standard profile.

Notes

Gutendex occasionally responds slowly (10-30 seconds per page). The actor uses a 90-second timeout with automatic retries — transient slowness does not cause failures.
The copyright: "false" flag indicates the work is in the public domain in the United States. Verify licensing for your specific jurisdiction and use case before commercial use.
Download URLs point directly to the gutenberg.org CDN. The actor does not download the book files themselves — it returns the URLs for downstream processing.

Project Gutenberg Ebook Scraper (Gutendex)

Project Gutenberg Ebook Scraper (Gutendex)

What does this actor do?

Use cases

Input

Output

Performance and cost

Notes

Related AI & Data scrapers