OrbTop

Recipe JSON-LD Bulk Harvester

AIDEVELOPER TOOLS

Recipe JSON-LD Bulk Harvester

Harvest structured recipe data from any food blog or website. Unlike tools that only work on a fixed list of known sites, this actor works on any domain — supply URLs directly or let the actor auto-discover the site's sitemap and find all recipe pages automatically.

What it does

  • URL mode: Provide a list of recipe page URLs (or a text file of URLs) and scrape each one.
  • Domain mode: Provide one or more domain names and the actor fetches robots.txt, discovers the site's sitemap(s), filters pages that look like recipes, and crawls them up to your maxItems limit.

Data is extracted from schema.org/Recipe JSON-LD (the near-universal standard used by virtually every food blog for Google rich results) with an hRecipe microformat fallback for legacy sites.

What you get

Each result record contains:

Field Description
name Recipe title
author Author name
description Recipe summary
recipe_category Category (e.g. Dessert, Main Course)
recipe_cuisine Cuisine type (e.g. Italian, Mexican)
prep_time Preparation time (ISO 8601, e.g. PT15M)
cook_time Cook time (ISO 8601)
total_time Total time
recipe_yield Servings (e.g. "4 servings")
recipe_ingredient Raw ingredient strings from the page
recipe_ingredient_parsed Structured ingredients — each parsed to "quantity unit item, prep"
recipe_instructions Step-by-step instructions (one per array item)
nutrition Nutrition facts as JSON string (calories, fat, protein, carbs, etc.)
aggregate_rating Star rating (number)
rating_count Number of ratings
keywords Recipe tags/keywords
image_urls Recipe photo URLs
video_url Recipe video URL if present
date_published Publication date (ISO 8601)
source_domain Domain scraped
url Full page URL
schema_type Extraction method: recipe-jsonld, hrecipe-microformat, or none
extraction_warnings Non-fatal issues (missing fields, parse errors)

Structured ingredient parser

The recipe_ingredient_parsed field is the headline feature — it breaks each raw ingredient string into structured components:

"2 cups all-purpose flour, sifted"  ->  "2 cups all-purpose flour, sifted"
"1/2 tsp kosher salt"               ->  "0.5 tsp kosher salt"
"1 large egg, at room temperature"  ->  "1 egg, at room temperature"

Handles Unicode fractions, mixed fractions ("1 1/2"), and common unit abbreviations.

Input

URL mode

{
  "urls": [
    "https://www.allrecipes.com/recipe/10813/best-chocolate-chip-cookies/",
    "https://www.simplyrecipes.com/best-easy-roast-chicken-recipe-5207046"
  ],
  "maxItems": 100
}

You can also use requestsFromUrl to point to a plain-text file with one URL per line.

Domain mode

{
  "domains": [
    "www.seriouseats.com",
    "www.kingarthurbaking.com"
  ],
  "maxItems": 500
}

The actor fetches robots.txt from each domain, discovers listed sitemaps (or falls back to /sitemap.xml), traverses sitemap indexes, and filters URLs that look like recipe pages.

Input fields

Field Type Description
urls array Recipe page URLs to scrape (URL mode)
domains array Domains to auto-discover and crawl (domain mode)
maxItems integer Maximum results to return (0 = unlimited)
requestsFromUrl string URL of a text file with one recipe URL per line

Provide either urls (+ optional requestsFromUrl) or domains — not both.

How it works

URL mode — The actor resolves the URL list, crawls each page, and extracts recipe data directly.

Domain mode — For each domain:

  1. Fetch robots.txt to discover sitemap URLs
  2. Fall back to /sitemap.xml if robots.txt lists none
  3. Walk sitemap indexes to find leaf sitemaps
  4. Filter URLs by recipe-path heuristics (path contains /recipe/, slug has 3+ hyphen-separated words, etc.)
  5. Crawl each filtered URL and extract recipe data

Supported sites

Works on any food blog or cooking site that emits schema.org/Recipe JSON-LD — which covers the vast majority of food sites since Google requires it for recipe rich results. This includes:

  • Recipe-plugin-powered WordPress sites (Tasty Recipes, WP Recipe Maker, Recipe Card Blocks, etc.)
  • Major food media (Allrecipes, Simply Recipes, Serious Eats, Food Network, BBC Good Food, etc.)
  • Independent food bloggers
  • Any site using hRecipe microformat (legacy support)

Pricing

Billed per recipe record saved. The default pricing profile charges a small fee per record plus a run start fee.

Notes

  • Rate limiting: The actor respects per-domain rate limiting — sites that throttle will be retried with backoff automatically.
  • Paywalled pages: Pages that return 403 or require login will be skipped with a warning in extraction_warnings.
  • Missing schema: Pages where no Recipe schema is found produce a stub record with schema_type: "none" and a warning.