OrbTop

Clemson HGIC Home & Garden Factsheet Scraper

DEVELOPER TOOLSEDUCATION

Clemson HGIC Home & Garden Factsheet Scraper

Scrapes the Clemson Home & Garden Information Center (HGIC) factsheet library — 2,500+ science-based factsheets covering plant care, diseases, pest management, lawn care, and food preservation. Outputs structured records with HGIC ID, body sections, symptoms, causal agent, management recommendations, recommended products, authors, and related factsheets.

What It Does

Clemson HGIC is one of the largest university extension factsheet libraries in the US (SE US plant palette, 2,500+ documents). Each factsheet follows a consistent template with discrete sections: symptoms, causal agent (pathogen/pest binomial), management/control recommendations, and prevention. This actor parses that structure into machine-readable fields — exactly what plant-diagnosis apps, AI garden assistants, and agronomy SaaS platforms need as grounding data.

The actor reads the Yoast sitemap index to enumerate all factsheet URLs, then crawls each page with impit Chrome TLS fingerprinting — no proxy or CAPTCHA solver required.

Use Cases

  • Training data for plant disease diagnosis AI and AI garden assistant models
  • Structured extension knowledge base for horticulture SaaS
  • Agronomy/landscaping content and reference data pipelines
  • Garden app content enrichment (symptom/treatment lookup)

Input

Field Type Default Description
maxItems integer 10 Maximum number of factsheets to scrape. Set to a large number to scrape all ~2,500+ factsheets.

Output

Each item represents one HGIC factsheet.

Field Type Description
factsheet_id string HGIC factsheet number, e.g. HGIC 1223
slug string URL slug, e.g. turfgrasses-for-the-carolinas
title string Factsheet title
category string Subject category: Diseases, Insects, Lawns, Soils, Vegetables, Trees & Shrubs, Flowers, Fruits & Nuts, Food Safety & Preservation, Human Health & Safety, General
plant_subjects string Comma-separated plant names from the title
problem_type string Problem type: disease, insect, cultural, or none
summary string First meaningful paragraph / introductory text
body_sections string JSON array of {heading, text} objects for the full structured body
symptoms string Symptom description text (for disease/pest/damage factsheets)
causal_agent string Pathogen or pest scientific/common name
management string Management and control recommendation text
prevention string Prevention and cultural practices text
recommended_products string Comma-separated trade names and chemistries found in management sections
related_factsheets string Comma-separated related factsheet links (`title
last_updated string Revision date as shown in factsheet metadata, e.g. Feb 28, 2016
authors string Comma-separated list of factsheet authors
images string Comma-separated image URLs embedded in the factsheet
factsheet_url string Canonical URL of the factsheet
scrapedAt string ISO-8601 timestamp when the record was scraped

Sample Output

{
  "factsheet_id": "HGIC 1223",
  "slug": "turfgrasses-for-the-carolinas",
  "title": "Turfgrasses for the Carolinas",
  "category": "Lawns",
  "problem_type": "none",
  "summary": "For over 50 years the lawn has been an integral part of the landscape...",
  "body_sections": "[{\"heading\":\"Mowing\",\"text\":\"...\"}]",
  "last_updated": "Feb 28, 2016",
  "authors": "Millie Davenport, Gary Forrester",
  "factsheet_url": "https://hgic.clemson.edu/factsheet/turfgrasses-for-the-carolinas/"
}

Discovery Method

Reads the Yoast sitemap index at https://hgic.clemson.edu/sitemap.xml, filters for factsheet-sitemap.xml and factsheet-sitemap2.xml, and collects all /factsheet/<slug>/ URLs. The maxItems cap is applied before crawling begins.

Performance

  • Memory: 128–256 MB
  • Throughput: ~200 pages/minute at default concurrency (5)
  • Full corpus (~2,500 factsheets): ~15–20 minutes
  • Timeout: 2-hour default (sufficient for full corpus)