Media Bias/Fact Check Source Credibility Scraper
Media Bias/Fact Check Source Credibility Scraper
Pull structured source-credibility records from Media Bias/Fact Check (MBFC) — the largest media-source reliability database on the internet (~7,000+ profiles).
For each outlet, returns:
- Bias rating (normalized slug + verbatim MBFC label)
- Factual reporting tier
- MBFC credibility rating
- Country and country press-freedom rating
- Media type and traffic tier
- Full History, Funded by / Ownership, and Analysis / Bias prose sections
Use Cases
- Fact-checking pipelines — source-level trust layer alongside claim-level fact-checkers (PolitiFact, Snopes)
- Disinformation research — identify and filter conspiracy-pseudoscience / questionable sources
- Brand safety — screen media sources before advertising placement
- RAG source-filtering — weight or exclude sources by credibility rating before ingestion
- OSINT / media-literacy — annotate article URLs with source bias and credibility metadata
Input
| Field | Type | Default | Description |
|---|---|---|---|
mode |
string (required) | all |
all = full corpus; category = selected bias categories; seed = explicit profile URLs |
categories |
string[] | — | Bias categories when mode=category. Options: center, left-center, left, right-center, right, pro-science, conspiracy-pseudoscience, questionable, satire |
seedUrls |
string[] | — | Explicit MBFC profile URLs when mode=seed |
maxItems |
integer | 200 |
Maximum source profiles to return (0 = unlimited) |
includeBody |
boolean | true |
Include History / Funding / Analysis prose sections |
proxyConfiguration |
object | no proxy | Optional proxy (MBFC does not require proxy on plain UA) |
Output
Each record contains:
{
"sourceName": "247Sports",
"sourceUrl": "https://mediabiasfactcheck.com/247sports-bias-and-credibility/",
"sourceHomepage": "https://247sports.com",
"biasRating": "center",
"rawBiasRating": "LEAST BIASED",
"factualReporting": "HIGH",
"credibilityRating": "HIGH CREDIBILITY",
"country": "United States",
"countryFreedomRating": "MOSTLY FREE",
"mediaType": "Website",
"trafficPopularity": "High Traffic",
"categoryIndex": "center",
"history": "247Sports, established in 2010 by Shannon Terry...",
"fundedByOwnership": "247Sports is owned by CBS Interactive...",
"analysisBias": "247Sports focuses on sports news...",
"lastUpdated": "April 16, 2024",
"reviewedBy": "",
"articleJsonLd": { "...": "..." },
"bodyMarkdown": "## History\n\n...",
"status": "success",
"errorMsg": ""
}
Bias Rating Normalization
| MBFC Label | Normalized Slug |
|---|---|
| LEAST BIASED | center |
| LEFT-CENTER BIAS | left-center |
| LEFT BIAS | left |
| RIGHT-CENTER BIAS | right-center |
| RIGHT BIAS | right |
| PRO-SCIENCE | pro-science |
| CONSPIRACY-PSEUDOSCIENCE | conspiracy-pseudoscience |
| QUESTIONABLE SOURCE | questionable |
| SATIRE | satire |
Predefined Dataset Views
- Source Credibility Table — sourceName, sourceHomepage, biasRating, factualReporting, credibilityRating, country
- Low-Credibility Sources — focused view for conspiracy-pseudoscience, questionable, and low-credibility sources
Architecture
Pure HTTP two-level hierarchical crawl using CoreCrawler. No browser, no proxy required.
Level 1 (category): Walk each MBFC bias-category index page
(/center/ /leftcenter/ /left/ /right-center/ /right/
/pro-science/ /conspiracy/ /questionable/ /satire/)
→ parse <table> of source profile links
→ link text "Source Name (domain.com)" → extract sourceHomepage
Level 2 (profile): Fetch each source profile page
→ parse "Detailed Report" block for structured fields
→ extract History / Funded by / Analysis sections
→ extract JSON-LD Article metadata
Rate-limit handling: CoreCrawler detects 429 responses and backs off exponentially. MBFC enforces per-IP rate limits on aggressive crawlers; the actor uses polite concurrency (3 concurrent requests max).
Crawl Modes
mode=all (default): Walks all 9 bias-category index pages and fetches every source profile in the corpus (~7,000 profiles total). Suitable for full-archive downloads.
mode=category: Walks only the specified bias categories. Useful for targeted pulls (e.g., all conspiracy-pseudoscience sources for a disinfo pipeline).
mode=seed: Fetches only the explicitly provided MBFC profile URLs. Suitable for spot-lookups or updating specific source records.
Performance
- Default memory: 512 MB
- Full corpus run: ~7,000 profiles at polite 1-3 req/s ≈ 2-4 hours
- Category run (e.g., center ~500 profiles): ~15-30 minutes
- Seed run (single URL): under 1 minute