Penguin Random House Publisher Catalog Scraper
ECOMMERCE
Penguin Random House Publisher Catalog Scraper
Scrape the official Penguin Random House publisher catalog from penguinrandomhouse.com. Extracts authoritative book metadata: title, author, ISBN, imprint, format, publication date, price, description, praise blurbs, and series information — primary-source data not available from consumer review aggregators.
What data does it collect?
Each record is one book edition (one canonical detail page):
| Field | Type | Description |
|---|---|---|
prh_id |
string | Penguin Random House work ID (numeric, from URL) |
title |
string | Book title |
subtitle |
string | Subtitle, if present |
author |
string | Primary author name |
contributors |
string | All contributors as JSON array: [{"name":"...", "role":"..."}] |
imprint |
string | Publisher imprint (e.g. Random House, Crown, Dial Press) |
format |
string | Format: Hardcover, Paperback, Ebook, or Audiobook |
isbn |
string | Primary ISBN-13 |
pages |
integer | Page count |
publication_date |
string | Publication date (ISO 8601, e.g. 2024-10-01) |
price |
number | List price in USD |
category |
string | Genre/category as JSON array of strings |
description |
string | Publisher description (about the book) |
about_the_author |
string | Author biography from the publisher |
praise |
string | Praise/endorsement blurbs as JSON array of strings |
series |
string | Series name if the book is part of a series |
related_titles |
string | Related edition ISBNs as JSON array |
cover_url |
string | Cover image URL |
product_url |
string | Full URL of the book detail page |
How to use it
Search by keyword
Provide one or more search queries. The scraper paginates through search results, visits each book detail page, and saves the metadata. Queries can be genre names, author names, topics, or any other search terms the PRH catalog supports.
{
"queries": ["mystery", "science fiction"],
"maxItems": 50,
"sp_intended_usage": "catalog research"
}
Small focused run
{
"queries": ["romance"],
"maxItems": 10,
"sp_intended_usage": "spot check"
}
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
queries |
array | Yes | ["fiction"] |
Search terms to scrape. Each query seeds an independent paginated search. |
maxItems |
integer | Yes | 5 | Maximum total book records to collect across all queries. |
Notes
- Extraction uses the structured JSON-LD
Bookschema embedded on each detail page — the same data the publisher uses for SEO. This gives authoritativeisbn,publisherImprint,datePublished, andoffers.pricewithout scraping fragile HTML. - Praise blurbs, author bios, and categories are extracted from the HTML where JSON-LD does not carry them.
- The
contributors,category,praise, andrelated_titlesfields are serialised as JSON strings so they remain compatible with spreadsheet and CSV exports. - The Penguin Random House catalog covers ~120k+ titles across all imprints (Random House, Crown, Knopf, Bantam, Viking, Penguin, and many more).
- No proxy required — the catalog is publicly accessible without bot protection.