HTML to Markdown & Reader Mode Extractor

Convert any URL or raw HTML to clean markdown, plaintext, and reader-mode metadata. A FREE, one-call primitive for LLM pipelines, text-processing workflows, and content extraction.

What it does

Pass a URL or raw HTML, get back:

Markdown — clean, ready for LLM ingestion
Plaintext — stripped of all markup
Reader-mode metadata — title, byline, published date, language, word count, reading time
Images — [{src, alt}] list
Links — [{href, text}] list

All in a single run. No crawler config needed.

Extraction modes

Mode	Best for	How it works
`readability` (default)	News articles, blog posts, documentation	Mozilla Readability strips navigation, ads, and boilerplate; then Turndown converts the clean article HTML to markdown
`turndown`	Preserving the full page structure	Direct Turndown conversion of the full HTML without article extraction — no stripping
`trafilatura`	Noisy pages, forums, user-generated content	Cheerio removes script/style/nav/header/footer elements, targets the main content area, then converts to markdown

Input

Field	Type	Default	Description
`inputs`	array	—	Required. Array of `{url?: string, html?: string}` objects
`mode`	string	`readability`	Extraction mode: `readability` / `turndown` / `trafilatura`
`output`	string	`markdown`	Output format: `markdown` / `plaintext` / `both`
`includeImages`	boolean	`true`	Include `images` list in output
`preserveLinks`	boolean	`true`	Include `links` list and preserve hyperlinks in markdown
`renderJs`	boolean	`false`	Fetch JS-rendered pages via browser (requires more memory)
`maxItems`	integer	`15`	Max inputs to process

Example input

{
  "inputs": [
    { "url": "https://en.wikipedia.org/wiki/Markdown" },
    { "html": "<html><body><h1>Hello</h1><p>World</p></body></html>" }
  ],
  "mode": "readability",
  "output": "markdown"
}

Output

One dataset record per input:

Field	Type	Description
`input`	string (JSON)	Original input `{url?, html?}`
`title`	string	Extracted page title
`byline`	string	Author / byline
`publishedAt`	string	Published timestamp (ISO 8601, when detectable)
`language`	string	Detected language code (ISO 639-1)
`markdown`	string	Markdown output
`plaintext`	string	Plain-text output
`wordCount`	number	Word count
`readingTimeMin`	number	Estimated reading time (at 240 wpm)
`images`	string (JSON)	`[{src, alt}]` image references
`links`	string (JSON)	`[{href, text}]` hyperlinks
`mode`	string	`readability` / `turndown` / `trafilatura`
`finalUrl`	string	URL after redirects (URL inputs only)
`status`	string	`success` / `timeout` / `error`
`errorMsg`	string	Error message on failure

Example output record

{
  "input": "{\"url\":\"https://en.wikipedia.org/wiki/Markdown\"}",
  "title": "Markdown",
  "byline": "Contributors to Wikimedia projects",
  "publishedAt": "2005-08-09T19:56:00Z",
  "language": "en",
  "markdown": "# Markdown\n\nMarkdown is a lightweight markup language...",
  "wordCount": 2459,
  "readingTimeMin": 11,
  "images": "[{\"src\":\"/static/images/icons/enwiki-25.svg\",\"alt\":\"Wikipedia\"}]",
  "links": "[{\"href\":\"/wiki/Main_Page\",\"text\":\"Main page\"}]",
  "mode": "readability",
  "finalUrl": "https://en.wikipedia.org/wiki/Markdown",
  "status": "success",
  "errorMsg": null
}

Pricing

FREE. Compute-billed only — you pay for the Apify platform run time, not per-record. Ideal as a preprocessing step before paid downstream actors (e.g., translation, summarization).

Use cases

LLM context preparation — fetch and clean web pages before passing to GPT/Claude/Gemini
RAG pipeline ingestion — extract clean text from arbitrary URLs at scale
Content archiving — convert live pages to portable markdown
Article extraction — strip ads and navigation from news/blog content
Text analytics — feed clean plaintext into NLP models

Technical notes

HTTP fetch via got-scraping (Chrome TLS fingerprint, follows redirects)
Concurrency: up to 20 inputs processed in parallel per worker
Memory: 512 MB default; increase to 1024 MB if renderJs: true
Timeout: 30 seconds per URL fetch

Further reading: Technical Data Utilities: DNS Audits, SSL Checks, Satellite Elements, and Bulk Parsing

HTML to Markdown & Reader Mode Extractor

HTML to Markdown & Reader Mode Extractor

What it does

Extraction modes

Input

Example input

Output

Example output record

Pricing

Use cases

Technical notes

Featured in

Related Developer Tools & Utils scrapers