OrbTop

Penguin Random House Publisher Catalog Scraper

ECOMMERCE

Penguin Random House Publisher Catalog Scraper

Scrape the official Penguin Random House publisher catalog from penguinrandomhouse.com. Extracts authoritative book metadata: title, author, ISBN, imprint, format, publication date, price, description, praise blurbs, and series information — primary-source data not available from consumer review aggregators.

What data does it collect?

Each record is one book edition (one canonical detail page):

Field Type Description
prh_id string Penguin Random House work ID (numeric, from URL)
title string Book title
subtitle string Subtitle, if present
author string Primary author name
contributors string All contributors as JSON array: [{"name":"...", "role":"..."}]
imprint string Publisher imprint (e.g. Random House, Crown, Dial Press)
format string Format: Hardcover, Paperback, Ebook, or Audiobook
isbn string Primary ISBN-13
pages integer Page count
publication_date string Publication date (ISO 8601, e.g. 2024-10-01)
price number List price in USD
category string Genre/category as JSON array of strings
description string Publisher description (about the book)
about_the_author string Author biography from the publisher
praise string Praise/endorsement blurbs as JSON array of strings
series string Series name if the book is part of a series
related_titles string Related edition ISBNs as JSON array
cover_url string Cover image URL
product_url string Full URL of the book detail page

How to use it

Search by keyword

Provide one or more search queries. The scraper paginates through search results, visits each book detail page, and saves the metadata. Queries can be genre names, author names, topics, or any other search terms the PRH catalog supports.

{
  "queries": ["mystery", "science fiction"],
  "maxItems": 50,
  "sp_intended_usage": "catalog research"
}

Small focused run

{
  "queries": ["romance"],
  "maxItems": 10,
  "sp_intended_usage": "spot check"
}

Input parameters

Parameter Type Required Default Description
queries array Yes ["fiction"] Search terms to scrape. Each query seeds an independent paginated search.
maxItems integer Yes 5 Maximum total book records to collect across all queries.

Notes

  • Extraction uses the structured JSON-LD Book schema embedded on each detail page — the same data the publisher uses for SEO. This gives authoritative isbn, publisherImprint, datePublished, and offers.price without scraping fragile HTML.
  • Praise blurbs, author bios, and categories are extracted from the HTML where JSON-LD does not carry them.
  • The contributors, category, praise, and related_titles fields are serialised as JSON strings so they remain compatible with spreadsheet and CSV exports.
  • The Penguin Random House catalog covers ~120k+ titles across all imprints (Random House, Crown, Knopf, Bantam, Viking, Penguin, and many more).
  • No proxy required — the catalog is publicly accessible without bot protection.