OrbTop

Biketo China Cycling News & Product Scraper

NEWSSPORTS

Biketo China Cycling News & Product Scraper

Scrapes Biketo (美骑网) — China's largest and longest-running cycling portal — for news articles, product reviews, and race coverage. The site has published continuously since 2008, accumulating over 56,000 articles across three channels. This actor enumerates the complete back-catalog using Biketo's sequential article ID scheme, making it ideal for building a Mandarin cycling corpus for LLM fine-tuning, market research, or trend detection.

What you get

Each scraped record contains:

Field Description
articleId Numeric article ID (e.g. 56323)
articleUrl Full canonical URL
channel Biketo's channel label in Chinese (e.g. 美骑快讯, 产品快讯, 赛事新闻)
title Article headline in Chinese
tags Comma-separated category tags from the article header
author Author or source attribution
publishDate Publish date-time (YYYY-MM-DD HH:MM:SS)
leadImage URL of the first image in the article body
bodyText Full article body text, whitespace-collapsed
viewCount Page view count (integer)
commentCount Comment count (integer)
scrapedAt ISO-8601 scrape timestamp

Input parameters

Parameter Type Default Description
startId integer 1 Article ID to start enumeration from
endId integer 56500 Article ID to stop at (inclusive)
channels array ["news","product","racing"] Content channels to include
maxItems integer Cap on total articles to return

Content channels

  • news — Cycling news, industry coverage, product announcements (/news/<id>.html)
  • product — Gear reviews and product features (/product/<id>.html)
  • racing — Race coverage and results (/racing/<id>.html)

All three channels share the same sequential ID space. IDs are enumerated in parallel across selected channels; invalid IDs for a given channel are silently skipped.

Usage examples

Full back-catalog (all channels, ~56k articles):

{
  "startId": 1,
  "endId": 56500,
  "channels": ["news", "product", "racing"]
}

Recent articles only (incremental update):

{
  "startId": 56200,
  "endId": 56500,
  "channels": ["news", "product", "racing"],
  "maxItems": 100
}

Product reviews only:

{
  "startId": 1,
  "endId": 56500,
  "channels": ["product"]
}

Notes

  • Charset: Biketo serves pages in GB2312. The actor transparently decodes to UTF-8 via Crawlee's built-in charset handling — all output fields are clean UTF-8 Chinese text.
  • Rate limiting: The actor uses moderate concurrency (5–15) with polite crawling. No proxy is required; the site is fully accessible to datacenter IPs.
  • Invalid IDs: Not every ID exists in every channel. The actor skips URLs that return 404 or lack an article heading — no error is logged for these, keeping run logs clean.
  • Resumability: For large runs, set startId and endId to narrow ranges. Re-run with updated startId for incremental updates.