OrbTop

Project Gutenberg Ebook Scraper (Gutendex)

AIEDUCATIONOPEN SOURCE

Project Gutenberg Ebook Scraper (Gutendex)

Scrape the full Project Gutenberg public-domain catalog via the Gutendex JSON API. Filter by search query, language, subject, author era, and minimum download count. Returns book metadata with direct EPUB, Kindle, plain-text, and HTML download URLs — built for AI training corpora, NLP datasets, and TTS pipelines.

What does this actor do?

Project Gutenberg hosts over 78,000 public-domain books that are legally free to download and use, including for commercial AI/ML training. This actor paginates the entire catalog through the Gutendex API — the official REST wrapper — and delivers clean, structured records with:

  • Full text download URLs: direct links to .txt, .epub, .mobi, and .html versions
  • Rich metadata: authors with birth/death years, translators, subjects, bookshelves, and language codes
  • Copyright status flag: "true"/"false"/null — essential for corpus licensing decisions
  • 30-day download count: a proxy for public interest and training-set priority

Every record is a flat JSON object with no nested arrays — ready to pipe into a vector database, fine-tuning pipeline, or document store.

Use cases

  • LLM training corpora: bulk-fetch plain-text URLs for thousands of public-domain works in a target language
  • NLP datasets: filter by subject headings (e.g. Philosophy, Science Fiction) for domain-specific datasets
  • TTS pipelines: filter by language code (en, fr, de, ja) and pull plain-text URLs for audio synthesis
  • Academic research: enumerate works by author era using authorYearStart/authorYearEnd
  • Digital library projects: mirror metadata for discovery interfaces

Input

Field Type Default Description
maxItems integer (required) Maximum number of books to return. Omit to fetch the full catalog (~78,000 books).
search string (empty) Free-text search across titles, authors, and subjects.
languages string (empty) Comma-separated ISO 639-1 codes, e.g. en,fr,de. Leave blank for all languages.
topic string (empty) Topic or subject keyword filter, e.g. science fiction.
authorYearStart integer (empty) Return only books whose authors were born after this year.
authorYearEnd integer (empty) Return only books whose authors were born before this year.
minDownloadCount integer 0 Return only books with at least this many 30-day downloads.
sortBy string popular popular (most downloaded first) or ascending (by Gutenberg ID).

Minimal run — top 10 most popular books:

{ "maxItems": 10 }

French classic literature — authors born before 1900:

{ "maxItems": 500, "languages": "fr", "authorYearEnd": 1900, "sortBy": "popular" }

Full corpus, high-demand English books only:

{ "languages": "en", "minDownloadCount": 100 }

Output

Each result is a flat JSON object:

{
    "gutenbergId": "1342",
    "title": "Pride and Prejudice",
    "authors": "Austen, Jane",
    "authorBirthYears": "1775",
    "authorDeathYears": "1817",
    "translators": "",
    "subjects": "Domestic fiction | England -- Social life and customs -- 19th century -- Fiction | Love stories",
    "bookshelves": "Best Books Ever Listings | Harvard Classics",
    "languages": "en",
    "copyright": "false",
    "mediaType": "Text",
    "downloadCount": "52793",
    "coverImageUrl": "https://www.gutenberg.org/cache/epub/1342/pg1342.cover.medium.jpg",
    "epubUrl": "https://www.gutenberg.org/ebooks/1342.epub3.images",
    "kindleUrl": "https://www.gutenberg.org/ebooks/1342.kf8.images",
    "textPlainUrl": "https://www.gutenberg.org/ebooks/1342.txt.utf-8",
    "htmlUrl": "https://www.gutenberg.org/ebooks/1342.html.images",
    "gutendexUrl": "https://gutendex.com/books/1342/",
    "scrapedAt": "2026-05-29T02:00:00.000Z"
}
Field Description
gutenbergId Project Gutenberg numeric ID (as string)
title Full book title
authors Pipe-separated author names in Last, First format
authorBirthYears Pipe-separated birth years (same order as authors)
authorDeathYears Pipe-separated death years
translators Pipe-separated translator names
subjects Pipe-separated Library of Congress subject headings
bookshelves Pipe-separated Gutenberg category tags
languages Pipe-separated ISO 639-1 language codes
copyright "true" / "false" / null
mediaType Typically "Text"
downloadCount 30-day download count (as string)
coverImageUrl Direct URL to cover image (JPEG)
epubUrl Direct URL to EPUB file
kindleUrl Direct URL to Kindle (MOBI) file
textPlainUrl Direct URL to plain-text .txt file
htmlUrl Direct URL to HTML version
gutendexUrl Gutendex API URL for this specific record
scrapedAt ISO-8601 scrape timestamp

Performance and cost

  • No proxy required: Gutendex is a free public API with no authentication or IP restrictions.
  • Rate limiting: No documented cap. The actor uses a 300ms inter-page delay as a courtesy.
  • Memory: 256 MB is sufficient for any run size.
  • Full catalog run: ~78,000 books across ~2,500 pages at 32 books/page.
  • PPE pricing: charged per result record via the standard profile.

Notes

  • Gutendex occasionally responds slowly (10-30 seconds per page). The actor uses a 90-second timeout with automatic retries — transient slowness does not cause failures.
  • The copyright: "false" flag indicates the work is in the public domain in the United States. Verify licensing for your specific jurisdiction and use case before commercial use.
  • Download URLs point directly to the gutenberg.org CDN. The actor does not download the book files themselves — it returns the URLs for downstream processing.