OrbTop

Food.com Recipe Scraper

AIDEVELOPER TOOLS

Food.com Recipe Scraper

Scrape recipes from Food.com — one of the largest English-language community recipe databases with over 500,000 recipes including ratings, reviews, and a rich tag taxonomy.

What it does

This actor enumerates Food.com's full sitemap (or accepts a direct list of recipe URLs) and extracts structured recipe data from each page. All core fields come from the embedded schema.org/Recipe JSON-LD block, supplemented with DOM extraction for Food.com-specific data (tag taxonomy, rating details, image gallery).

Use cases

  • Build recommender training datasets (ratings + review counts at 500K+ scale)
  • Meal-plan and recipe-app databases
  • Food trend analytics and NLP corpora
  • RAG pipelines for culinary applications
  • Competitive ingredient and nutritional analysis

Input

Field Type Description
maxItems integer Maximum number of recipes to scrape. Set to 0 for the full ~500K corpus. Default: 10
recipeUrls array Optional list of specific Food.com recipe URLs to scrape. If provided, sitemap enumeration is skipped.

Example: specific URLs

{
  "maxItems": 5,
  "recipeUrls": [
    { "url": "https://www.food.com/recipe/jo-mamas-world-famous-spaghetti-22782" },
    { "url": "https://www.food.com/recipe/easy-homemade-chicken-soup-157877" }
  ]
}

Example: sitemap enumeration (first 1000 recipes)

{
  "maxItems": 1000
}

Output

Each record in the dataset corresponds to one recipe:

Field Type Description
recipe_id string Unique numeric recipe ID from the URL
url string Canonical recipe URL
name string Recipe name
author string Recipe author username
description string Full recipe description
recipe_category string Primary category (e.g. Dessert, Main Dish)
recipe_cuisine string Cuisine type if specified (e.g. Italian)
prep_time string Preparation time in ISO 8601 format (e.g. PT15M)
cook_time string Cook time in ISO 8601 format
total_time string Total time in ISO 8601 format
recipe_yield string Servings (e.g. "4 serving(s)")
recipe_ingredient array Ingredients as formatted strings
recipe_instructions array Step-by-step instruction strings
nutrition object Nutritional data: calories, fat_content, saturated_fat, cholesterol, sodium, carbohydrate, fiber, sugar, protein
aggregate_rating number Average star rating (0-5 scale)
rating_count integer Total number of ratings
review_count integer Total number of written reviews
keywords string Comma-separated keywords (occasion, diet, method tags)
tags array Food.com topic taxonomy tags
image_urls array Recipe photo URLs
date_published string Publication date (ISO 8601)

Sample output record

{
  "recipe_id": "22782",
  "url": "https://www.food.com/recipe/jo-mamas-world-famous-spaghetti-22782",
  "name": "Jo Mama's World Famous Spaghetti",
  "author": "Sharlene~W",
  "description": "My kids will give up a steak dinner for this spaghetti...",
  "recipe_category": "Spaghetti",
  "recipe_cuisine": null,
  "prep_time": "PT20M",
  "cook_time": "PT1H",
  "total_time": "PT1H20M",
  "recipe_yield": "4 quarts, 10-14 serving(s)",
  "recipe_ingredient": ["2 lbs Italian sausage, casings removed", "..."],
  "recipe_instructions": ["In large, heavy stockpot, brown Italian sausage...", "..."],
  "nutrition": {
    "calories": "555.9",
    "fat_content": "26.3",
    "protein": "29.8"
  },
  "aggregate_rating": 5.0,
  "rating_count": 1376,
  "review_count": 1376,
  "keywords": "Pork,Meat,European,Kid Friendly,Weeknight,Stove Top,< 4 Hours,Easy",
  "tags": ["Spaghetti"],
  "image_urls": ["https://img.sndimg.com/food/image/upload/..."],
  "date_published": "2002-03-17T10:26Z"
}

Crawl approach

  1. Sitemap enumeration: Fetches https://www.food.com/sitemap.xml (a 24-child sitemap index with gzip-compressed child files) and collects all /recipe/ URLs.
  2. Page scraping: Each recipe page is fetched and parsed via the embedded schema.org/Recipe JSON-LD block for structured data, plus DOM extraction for Food.com-specific taxonomy and image gallery.
  3. Rate limiting: Automatic rate-limit detection and backoff — no manual configuration needed.

Performance

  • Memory: 512 MB
  • No proxy required — Food.com datacenter access is open
  • Concurrency: 10 parallel requests
  • Full corpus (~500K recipes): runs over the default 4-hour timeout