Biketo China Cycling News & Product Scraper
NEWSSPORTS
Biketo China Cycling News & Product Scraper
Scrapes Biketo (美骑网) — China's largest and longest-running cycling portal — for news articles, product reviews, and race coverage. The site has published continuously since 2008, accumulating over 56,000 articles across three channels. This actor enumerates the complete back-catalog using Biketo's sequential article ID scheme, making it ideal for building a Mandarin cycling corpus for LLM fine-tuning, market research, or trend detection.
What you get
Each scraped record contains:
| Field | Description |
|---|---|
articleId |
Numeric article ID (e.g. 56323) |
articleUrl |
Full canonical URL |
channel |
Biketo's channel label in Chinese (e.g. 美骑快讯, 产品快讯, 赛事新闻) |
title |
Article headline in Chinese |
tags |
Comma-separated category tags from the article header |
author |
Author or source attribution |
publishDate |
Publish date-time (YYYY-MM-DD HH:MM:SS) |
leadImage |
URL of the first image in the article body |
bodyText |
Full article body text, whitespace-collapsed |
viewCount |
Page view count (integer) |
commentCount |
Comment count (integer) |
scrapedAt |
ISO-8601 scrape timestamp |
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startId |
integer | 1 | Article ID to start enumeration from |
endId |
integer | 56500 | Article ID to stop at (inclusive) |
channels |
array | ["news","product","racing"] |
Content channels to include |
maxItems |
integer | — | Cap on total articles to return |
Content channels
- news — Cycling news, industry coverage, product announcements (
/news/<id>.html) - product — Gear reviews and product features (
/product/<id>.html) - racing — Race coverage and results (
/racing/<id>.html)
All three channels share the same sequential ID space. IDs are enumerated in parallel across selected channels; invalid IDs for a given channel are silently skipped.
Usage examples
Full back-catalog (all channels, ~56k articles):
{
"startId": 1,
"endId": 56500,
"channels": ["news", "product", "racing"]
}
Recent articles only (incremental update):
{
"startId": 56200,
"endId": 56500,
"channels": ["news", "product", "racing"],
"maxItems": 100
}
Product reviews only:
{
"startId": 1,
"endId": 56500,
"channels": ["product"]
}
Notes
- Charset: Biketo serves pages in GB2312. The actor transparently decodes to UTF-8 via Crawlee's built-in charset handling — all output fields are clean UTF-8 Chinese text.
- Rate limiting: The actor uses moderate concurrency (5–15) with polite crawling. No proxy is required; the site is fully accessible to datacenter IPs.
- Invalid IDs: Not every ID exists in every channel. The actor skips URLs that return 404 or lack an article heading — no error is logged for these, keeping run logs clean.
- Resumability: For large runs, set
startIdandendIdto narrow ranges. Re-run with updatedstartIdfor incremental updates.