OrbTop

Cyclingnews Races & News Scraper

SPORTSNEWS

Cyclingnews Races & News Scraper

Scrapes pro-cycling news articles and race reports from Cyclingnews.com — the largest English-language cycling news outlet, owned by Future plc. Returns structured article data including headline, author, publish date, full body text, and a curated LATAM-cycling relevance layer.

The site is server-rendered with rich JSON-LD structured data on every article. No browser required. The scraper pulls from the Google News sitemap and the live /news/ listing page, so each run returns the freshest content without you managing pagination or archives.

What It Returns

Every record is one article. The dataset includes:

Field Type Description
article_id String URL-slug identifier derived from the canonical URL
article_url String Canonical URL of the article
article_title String Headline (HTML entities decoded)
article_author String Primary author name
article_published_at String ISO-8601 publish timestamp
article_modified_at String ISO-8601 last-modified timestamp
article_body_text String Plain-text article body, up to 50,000 characters
article_summary String Sub-headline or deck
article_section String Section label (e.g. Racing, Women's Cycling, Teams & Riders)
article_tags Array Open Graph article:tag values
latam_relevant Boolean True if the article mentions a curated LATAM rider or race
latam_riders Array LATAM riders mentioned (Quintana, Bernal, Carapaz, Higuita, etc.)
latam_races Array LATAM races mentioned (Tour Colombia, Vuelta San Juan, etc.)
source_url String Always https://www.cyclingnews.com
scraped_at String ISO-8601 scrape timestamp

LATAM Enrichment

The latam_relevant flag and companion arrays are the value-add. The scraper checks every article against a curated list of ~30 Colombian, Ecuadorian, and other Latin American riders — Nairo Quintana, Egan Bernal, Richard Carapaz, Sergio Higuita, Santiago Buitrago, and others — plus ~25 LATAM races including Tour Colombia, Vuelta a Colombia, Vuelta San Juan, and Ruta de los Conquistadores. Downstream models and dashboards can filter on latam_relevant: true without re-reading the body text.

Input Parameters

Parameter Type Default Description
maxItems Integer 10 Maximum articles to scrape. The Google News sitemap refreshes every few hours with ~27 recent articles.

How It Works

Each run:

  1. Fetches sitemap-news.xml (Google News sitemap — always publicly accessible) and collects article URLs for the past 48–72 hours.
  2. Also scrapes the live /news/ listing page for any articles not yet indexed in the sitemap.
  3. Deduplicates and caps to maxItems, then fetches each article.
  4. Parses JSON-LD NewsArticle schema for structured metadata, #article-body for body text.

The scraper uses impit — a Chrome TLS fingerprint HTTP client — which passes Fastly CDN edge checks without a browser. No proxy required.

Use Cases

  • Sports-analytics pipelines: feed article bodies into NLP models to extract race results, rider performance signals, and team news.
  • LLM training corpora: Cyclingnews is the canonical English-language source for pro-cycling narrative. The body text is editorial-quality, structured, and tagged.
  • LATAM cycling intelligence dashboards: the latam_riders and latam_races arrays make it simple to track Colombian Grand Tour coverage, contract news, and race reports without keyword scanning.
  • Journalism aggregators: combine with a scheduling trigger to catch every article within hours of publication.

Coverage

Cyclingnews publishes 50–80 articles per week across racing, women's cycling, teams & riders, tech/gear, and features. The Google News sitemap covers the rolling 48-hour window — run on a daily or twice-daily schedule to maintain a complete archive. A single run with maxItems: 0 captures all available articles (~27 from the news sitemap plus the listing page).

Limitations

The Google News sitemap covers recent articles only (~48–72 hours). Historical article archives are not accessible without pagination, which Future plc gates with 403 on non-recent listing pages. For historical ingestion, supply a list of known article URLs via a custom pipeline.


Data sourced from Cyclingnews.com (Future plc). Use in accordance with applicable terms of service.