Clemson HGIC Home & Garden Factsheet Scraper

Scrapes the Clemson Home & Garden Information Center (HGIC) factsheet library — 2,500+ science-based factsheets covering plant care, diseases, pest management, lawn care, and food preservation. Outputs structured records with HGIC ID, body sections, symptoms, causal agent, management recommendations, recommended products, authors, and related factsheets.

What It Does

Clemson HGIC is one of the largest university extension factsheet libraries in the US (SE US plant palette, 2,500+ documents). Each factsheet follows a consistent template with discrete sections: symptoms, causal agent (pathogen/pest binomial), management/control recommendations, and prevention. This actor parses that structure into machine-readable fields — exactly what plant-diagnosis apps, AI garden assistants, and agronomy SaaS platforms need as grounding data.

The actor reads the Yoast sitemap index to enumerate all factsheet URLs, then crawls each page with impit Chrome TLS fingerprinting — no proxy or CAPTCHA solver required.

Use Cases

Training data for plant disease diagnosis AI and AI garden assistant models
Structured extension knowledge base for horticulture SaaS
Agronomy/landscaping content and reference data pipelines
Garden app content enrichment (symptom/treatment lookup)

Input

Field	Type	Default	Description
maxItems	integer	10	Maximum number of factsheets to scrape. Set to a large number to scrape all ~2,500+ factsheets.

Output

Each item represents one HGIC factsheet.

Field	Type	Description
factsheet_id	string	HGIC factsheet number, e.g. `HGIC 1223`
slug	string	URL slug, e.g. `turfgrasses-for-the-carolinas`
title	string	Factsheet title
category	string	Subject category: Diseases, Insects, Lawns, Soils, Vegetables, Trees & Shrubs, Flowers, Fruits & Nuts, Food Safety & Preservation, Human Health & Safety, General
plant_subjects	string	Comma-separated plant names from the title
problem_type	string	Problem type: `disease`, `insect`, `cultural`, or `none`
summary	string	First meaningful paragraph / introductory text
body_sections	string	JSON array of `{heading, text}` objects for the full structured body
symptoms	string	Symptom description text (for disease/pest/damage factsheets)
causal_agent	string	Pathogen or pest scientific/common name
management	string	Management and control recommendation text
prevention	string	Prevention and cultural practices text
recommended_products	string	Comma-separated trade names and chemistries found in management sections
related_factsheets	string	Comma-separated related factsheet links (`title
last_updated	string	Revision date as shown in factsheet metadata, e.g. `Feb 28, 2016`
authors	string	Comma-separated list of factsheet authors
images	string	Comma-separated image URLs embedded in the factsheet
factsheet_url	string	Canonical URL of the factsheet
scrapedAt	string	ISO-8601 timestamp when the record was scraped

Sample Output

{
  "factsheet_id": "HGIC 1223",
  "slug": "turfgrasses-for-the-carolinas",
  "title": "Turfgrasses for the Carolinas",
  "category": "Lawns",
  "problem_type": "none",
  "summary": "For over 50 years the lawn has been an integral part of the landscape...",
  "body_sections": "[{\"heading\":\"Mowing\",\"text\":\"...\"}]",
  "last_updated": "Feb 28, 2016",
  "authors": "Millie Davenport, Gary Forrester",
  "factsheet_url": "https://hgic.clemson.edu/factsheet/turfgrasses-for-the-carolinas/"
}

Discovery Method

Reads the Yoast sitemap index at https://hgic.clemson.edu/sitemap.xml, filters for factsheet-sitemap.xml and factsheet-sitemap2.xml, and collects all /factsheet/<slug>/ URLs. The maxItems cap is applied before crawling begins.

Performance

Memory: 128–256 MB
Throughput: ~200 pages/minute at default concurrency (5)
Full corpus (~2,500 factsheets): ~15–20 minutes
Timeout: 2-hour default (sufficient for full corpus)