Recipe JSON-LD Bulk Harvester

Harvest structured recipe data from any food blog or website. Unlike tools that only work on a fixed list of known sites, this actor works on any domain — supply URLs directly or let the actor auto-discover the site's sitemap and find all recipe pages automatically.

What it does

URL mode: Provide a list of recipe page URLs (or a text file of URLs) and scrape each one.
Domain mode: Provide one or more domain names and the actor fetches robots.txt, discovers the site's sitemap(s), filters pages that look like recipes, and crawls them up to your maxItems limit.

Data is extracted from schema.org/Recipe JSON-LD (the near-universal standard used by virtually every food blog for Google rich results) with an hRecipe microformat fallback for legacy sites.

What you get

Each result record contains:

Field	Description
`name`	Recipe title
`author`	Author name
`description`	Recipe summary
`recipe_category`	Category (e.g. Dessert, Main Course)
`recipe_cuisine`	Cuisine type (e.g. Italian, Mexican)
`prep_time`	Preparation time (ISO 8601, e.g. PT15M)
`cook_time`	Cook time (ISO 8601)
`total_time`	Total time
`recipe_yield`	Servings (e.g. "4 servings")
`recipe_ingredient`	Raw ingredient strings from the page
`recipe_ingredient_parsed`	Structured ingredients — each parsed to "quantity unit item, prep"
`recipe_instructions`	Step-by-step instructions (one per array item)
`nutrition`	Nutrition facts as JSON string (calories, fat, protein, carbs, etc.)
`aggregate_rating`	Star rating (number)
`rating_count`	Number of ratings
`keywords`	Recipe tags/keywords
`image_urls`	Recipe photo URLs
`video_url`	Recipe video URL if present
`date_published`	Publication date (ISO 8601)
`source_domain`	Domain scraped
`url`	Full page URL
`schema_type`	Extraction method: `recipe-jsonld`, `hrecipe-microformat`, or `none`
`extraction_warnings`	Non-fatal issues (missing fields, parse errors)

Structured ingredient parser

The recipe_ingredient_parsed field is the headline feature — it breaks each raw ingredient string into structured components:

"2 cups all-purpose flour, sifted"  ->  "2 cups all-purpose flour, sifted"
"1/2 tsp kosher salt"               ->  "0.5 tsp kosher salt"
"1 large egg, at room temperature"  ->  "1 egg, at room temperature"

Handles Unicode fractions, mixed fractions ("1 1/2"), and common unit abbreviations.

Input

URL mode

{
  "urls": [
    "https://www.allrecipes.com/recipe/10813/best-chocolate-chip-cookies/",
    "https://www.simplyrecipes.com/best-easy-roast-chicken-recipe-5207046"
  ],
  "maxItems": 100
}

You can also use requestsFromUrl to point to a plain-text file with one URL per line.

Domain mode

{
  "domains": [
    "www.seriouseats.com",
    "www.kingarthurbaking.com"
  ],
  "maxItems": 500
}

The actor fetches robots.txt from each domain, discovers listed sitemaps (or falls back to /sitemap.xml), traverses sitemap indexes, and filters URLs that look like recipe pages.

Input fields

Field	Type	Description
`urls`	array	Recipe page URLs to scrape (URL mode)
`domains`	array	Domains to auto-discover and crawl (domain mode)
`maxItems`	integer	Maximum results to return (0 = unlimited)
`requestsFromUrl`	string	URL of a text file with one recipe URL per line

Provide either urls (+ optional requestsFromUrl) or domains — not both.

How it works

URL mode — The actor resolves the URL list, crawls each page, and extracts recipe data directly.

Domain mode — For each domain:

Fetch robots.txt to discover sitemap URLs
Fall back to /sitemap.xml if robots.txt lists none
Walk sitemap indexes to find leaf sitemaps
Filter URLs by recipe-path heuristics (path contains /recipe/, slug has 3+ hyphen-separated words, etc.)
Crawl each filtered URL and extract recipe data

Supported sites

Works on any food blog or cooking site that emits schema.org/Recipe JSON-LD — which covers the vast majority of food sites since Google requires it for recipe rich results. This includes:

Recipe-plugin-powered WordPress sites (Tasty Recipes, WP Recipe Maker, Recipe Card Blocks, etc.)
Major food media (Allrecipes, Simply Recipes, Serious Eats, Food Network, BBC Good Food, etc.)
Independent food bloggers
Any site using hRecipe microformat (legacy support)

Pricing

Billed per recipe record saved. The default pricing profile charges a small fee per record plus a run start fee.

Notes

Rate limiting: The actor respects per-domain rate limiting — sites that throttle will be retried with backoff automatically.
Paywalled pages: Pages that return 403 or require login will be skipped with a warning in extraction_warnings.
Missing schema: Pages where no Recipe schema is found produce a stub record with schema_type: "none" and a warning.

Further reading: ISBN Database Access and Other Open Reference Data in Bulk

Recipe JSON-LD Bulk Harvester

Recipe JSON-LD Bulk Harvester

What it does

What you get

Structured ingredient parser

Input

URL mode

Domain mode

Input fields

How it works

Supported sites

Pricing

Notes

Featured in

Related AI & Data scrapers