OrbTop

Media Bias/Fact Check Source Credibility Scraper

NEWSAIDEVELOPER TOOLS

Media Bias/Fact Check Source Credibility Scraper

Pull structured source-credibility records from Media Bias/Fact Check (MBFC) — the largest media-source reliability database on the internet (~7,000+ profiles).

For each outlet, returns:

  • Bias rating (normalized slug + verbatim MBFC label)
  • Factual reporting tier
  • MBFC credibility rating
  • Country and country press-freedom rating
  • Media type and traffic tier
  • Full History, Funded by / Ownership, and Analysis / Bias prose sections

Use Cases

  • Fact-checking pipelines — source-level trust layer alongside claim-level fact-checkers (PolitiFact, Snopes)
  • Disinformation research — identify and filter conspiracy-pseudoscience / questionable sources
  • Brand safety — screen media sources before advertising placement
  • RAG source-filtering — weight or exclude sources by credibility rating before ingestion
  • OSINT / media-literacy — annotate article URLs with source bias and credibility metadata

Input

Field Type Default Description
mode string (required) all all = full corpus; category = selected bias categories; seed = explicit profile URLs
categories string[] Bias categories when mode=category. Options: center, left-center, left, right-center, right, pro-science, conspiracy-pseudoscience, questionable, satire
seedUrls string[] Explicit MBFC profile URLs when mode=seed
maxItems integer 200 Maximum source profiles to return (0 = unlimited)
includeBody boolean true Include History / Funding / Analysis prose sections
proxyConfiguration object no proxy Optional proxy (MBFC does not require proxy on plain UA)

Output

Each record contains:

{
  "sourceName": "247Sports",
  "sourceUrl": "https://mediabiasfactcheck.com/247sports-bias-and-credibility/",
  "sourceHomepage": "https://247sports.com",
  "biasRating": "center",
  "rawBiasRating": "LEAST BIASED",
  "factualReporting": "HIGH",
  "credibilityRating": "HIGH CREDIBILITY",
  "country": "United States",
  "countryFreedomRating": "MOSTLY FREE",
  "mediaType": "Website",
  "trafficPopularity": "High Traffic",
  "categoryIndex": "center",
  "history": "247Sports, established in 2010 by Shannon Terry...",
  "fundedByOwnership": "247Sports is owned by CBS Interactive...",
  "analysisBias": "247Sports focuses on sports news...",
  "lastUpdated": "April 16, 2024",
  "reviewedBy": "",
  "articleJsonLd": { "...": "..." },
  "bodyMarkdown": "## History\n\n...",
  "status": "success",
  "errorMsg": ""
}

Bias Rating Normalization

MBFC Label Normalized Slug
LEAST BIASED center
LEFT-CENTER BIAS left-center
LEFT BIAS left
RIGHT-CENTER BIAS right-center
RIGHT BIAS right
PRO-SCIENCE pro-science
CONSPIRACY-PSEUDOSCIENCE conspiracy-pseudoscience
QUESTIONABLE SOURCE questionable
SATIRE satire

Predefined Dataset Views

  • Source Credibility Table — sourceName, sourceHomepage, biasRating, factualReporting, credibilityRating, country
  • Low-Credibility Sources — focused view for conspiracy-pseudoscience, questionable, and low-credibility sources

Architecture

Pure HTTP two-level hierarchical crawl using CoreCrawler. No browser, no proxy required.

Level 1 (category):  Walk each MBFC bias-category index page
                     (/center/ /leftcenter/ /left/ /right-center/ /right/
                      /pro-science/ /conspiracy/ /questionable/ /satire/)
                     → parse <table> of source profile links
                     → link text "Source Name (domain.com)" → extract sourceHomepage

Level 2 (profile):   Fetch each source profile page
                     → parse "Detailed Report" block for structured fields
                     → extract History / Funded by / Analysis sections
                     → extract JSON-LD Article metadata

Rate-limit handling: CoreCrawler detects 429 responses and backs off exponentially. MBFC enforces per-IP rate limits on aggressive crawlers; the actor uses polite concurrency (3 concurrent requests max).

Crawl Modes

mode=all (default): Walks all 9 bias-category index pages and fetches every source profile in the corpus (~7,000 profiles total). Suitable for full-archive downloads.

mode=category: Walks only the specified bias categories. Useful for targeted pulls (e.g., all conspiracy-pseudoscience sources for a disinfo pipeline).

mode=seed: Fetches only the explicitly provided MBFC profile URLs. Suitable for spot-lookups or updating specific source records.

Performance

  • Default memory: 512 MB
  • Full corpus run: ~7,000 profiles at polite 1-3 req/s ≈ 2-4 hours
  • Category run (e.g., center ~500 profiles): ~15-30 minutes
  • Seed run (single URL): under 1 minute