Media Bias/Fact Check Source Credibility Scraper

Pull structured source-credibility records from Media Bias/Fact Check (MBFC) — the largest media-source reliability database on the internet (~7,000+ profiles).

For each outlet, returns:

Bias rating (normalized slug + verbatim MBFC label)
Factual reporting tier
MBFC credibility rating
Country and country press-freedom rating
Media type and traffic tier
Full History, Funded by / Ownership, and Analysis / Bias prose sections

Use Cases

Fact-checking pipelines — source-level trust layer alongside claim-level fact-checkers (PolitiFact, Snopes)
Disinformation research — identify and filter conspiracy-pseudoscience / questionable sources
Brand safety — screen media sources before advertising placement
RAG source-filtering — weight or exclude sources by credibility rating before ingestion
OSINT / media-literacy — annotate article URLs with source bias and credibility metadata

Input

Field	Type	Default	Description
`mode`	string (required)	`all`	`all` = full corpus; `category` = selected bias categories; `seed` = explicit profile URLs
`categories`	string[]	—	Bias categories when `mode=category`. Options: `center`, `left-center`, `left`, `right-center`, `right`, `pro-science`, `conspiracy-pseudoscience`, `questionable`, `satire`
`seedUrls`	string[]	—	Explicit MBFC profile URLs when `mode=seed`
`maxItems`	integer	`200`	Maximum source profiles to return (0 = unlimited)
`includeBody`	boolean	`true`	Include History / Funding / Analysis prose sections
`proxyConfiguration`	object	no proxy	Optional proxy (MBFC does not require proxy on plain UA)

Output

Each record contains:

{
  "sourceName": "247Sports",
  "sourceUrl": "https://mediabiasfactcheck.com/247sports-bias-and-credibility/",
  "sourceHomepage": "https://247sports.com",
  "biasRating": "center",
  "rawBiasRating": "LEAST BIASED",
  "factualReporting": "HIGH",
  "credibilityRating": "HIGH CREDIBILITY",
  "country": "United States",
  "countryFreedomRating": "MOSTLY FREE",
  "mediaType": "Website",
  "trafficPopularity": "High Traffic",
  "categoryIndex": "center",
  "history": "247Sports, established in 2010 by Shannon Terry...",
  "fundedByOwnership": "247Sports is owned by CBS Interactive...",
  "analysisBias": "247Sports focuses on sports news...",
  "lastUpdated": "April 16, 2024",
  "reviewedBy": "",
  "articleJsonLd": { "...": "..." },
  "bodyMarkdown": "## History\n\n...",
  "status": "success",
  "errorMsg": ""
}

Bias Rating Normalization

MBFC Label	Normalized Slug
LEAST BIASED	`center`
LEFT-CENTER BIAS	`left-center`
LEFT BIAS	`left`
RIGHT-CENTER BIAS	`right-center`
RIGHT BIAS	`right`
PRO-SCIENCE	`pro-science`
CONSPIRACY-PSEUDOSCIENCE	`conspiracy-pseudoscience`
QUESTIONABLE SOURCE	`questionable`
SATIRE	`satire`

Predefined Dataset Views

Source Credibility Table — sourceName, sourceHomepage, biasRating, factualReporting, credibilityRating, country
Low-Credibility Sources — focused view for conspiracy-pseudoscience, questionable, and low-credibility sources

Architecture

Pure HTTP two-level hierarchical crawl using CoreCrawler. No browser, no proxy required.

Level 1 (category):  Walk each MBFC bias-category index page
                     (/center/ /leftcenter/ /left/ /right-center/ /right/
                      /pro-science/ /conspiracy/ /questionable/ /satire/)
                     → parse <table> of source profile links
                     → link text "Source Name (domain.com)" → extract sourceHomepage

Level 2 (profile):   Fetch each source profile page
                     → parse "Detailed Report" block for structured fields
                     → extract History / Funded by / Analysis sections
                     → extract JSON-LD Article metadata

Rate-limit handling: CoreCrawler detects 429 responses and backs off exponentially. MBFC enforces per-IP rate limits on aggressive crawlers; the actor uses polite concurrency (3 concurrent requests max).

Crawl Modes

mode=all (default): Walks all 9 bias-category index pages and fetches every source profile in the corpus (~7,000 profiles total). Suitable for full-archive downloads.

mode=category: Walks only the specified bias categories. Useful for targeted pulls (e.g., all conspiracy-pseudoscience sources for a disinfo pipeline).

mode=seed: Fetches only the explicitly provided MBFC profile URLs. Suitable for spot-lookups or updating specific source records.

Performance

Default memory: 512 MB
Full corpus run: ~7,000 profiles at polite 1-3 req/s ≈ 2-4 hours
Category run (e.g., center ~500 profiles): ~15-30 minutes
Seed run (single URL): under 1 minute

Media Bias/Fact Check Source Credibility Scraper

Media Bias/Fact Check Source Credibility Scraper

Use Cases

Input

Output

Bias Rating Normalization

Predefined Dataset Views

Architecture

Crawl Modes

Performance

Related AI & Data scrapers