HTML to Markdown & Reader Mode Extractor
AIDEVELOPER TOOLSNEWS
HTML to Markdown & Reader Mode Extractor
Convert any URL or raw HTML to clean markdown, plaintext, and reader-mode metadata. A FREE, one-call primitive for LLM pipelines, text-processing workflows, and content extraction.
What it does
Pass a URL or raw HTML, get back:
- Markdown — clean, ready for LLM ingestion
- Plaintext — stripped of all markup
- Reader-mode metadata — title, byline, published date, language, word count, reading time
- Images —
[{src, alt}]list - Links —
[{href, text}]list
All in a single run. No crawler config needed.
Extraction modes
| Mode | Best for | How it works |
|---|---|---|
readability (default) |
News articles, blog posts, documentation | Mozilla Readability strips navigation, ads, and boilerplate; then Turndown converts the clean article HTML to markdown |
turndown |
Preserving the full page structure | Direct Turndown conversion of the full HTML without article extraction — no stripping |
trafilatura |
Noisy pages, forums, user-generated content | Cheerio removes script/style/nav/header/footer elements, targets the main content area, then converts to markdown |
Input
| Field | Type | Default | Description |
|---|---|---|---|
inputs |
array | — | Required. Array of {url?: string, html?: string} objects |
mode |
string | readability |
Extraction mode: readability / turndown / trafilatura |
output |
string | markdown |
Output format: markdown / plaintext / both |
includeImages |
boolean | true |
Include images list in output |
preserveLinks |
boolean | true |
Include links list and preserve hyperlinks in markdown |
renderJs |
boolean | false |
Fetch JS-rendered pages via browser (requires more memory) |
maxItems |
integer | 15 |
Max inputs to process |
Example input
{
"inputs": [
{ "url": "https://en.wikipedia.org/wiki/Markdown" },
{ "html": "<html><body><h1>Hello</h1><p>World</p></body></html>" }
],
"mode": "readability",
"output": "markdown"
}
Output
One dataset record per input:
| Field | Type | Description |
|---|---|---|
input |
string (JSON) | Original input {url?, html?} |
title |
string | Extracted page title |
byline |
string | Author / byline |
publishedAt |
string | Published timestamp (ISO 8601, when detectable) |
language |
string | Detected language code (ISO 639-1) |
markdown |
string | Markdown output |
plaintext |
string | Plain-text output |
wordCount |
number | Word count |
readingTimeMin |
number | Estimated reading time (at 240 wpm) |
images |
string (JSON) | [{src, alt}] image references |
links |
string (JSON) | [{href, text}] hyperlinks |
mode |
string | readability / turndown / trafilatura |
finalUrl |
string | URL after redirects (URL inputs only) |
status |
string | success / timeout / error |
errorMsg |
string | Error message on failure |
Example output record
{
"input": "{\"url\":\"https://en.wikipedia.org/wiki/Markdown\"}",
"title": "Markdown",
"byline": "Contributors to Wikimedia projects",
"publishedAt": "2005-08-09T19:56:00Z",
"language": "en",
"markdown": "# Markdown\n\nMarkdown is a lightweight markup language...",
"wordCount": 2459,
"readingTimeMin": 11,
"images": "[{\"src\":\"/static/images/icons/enwiki-25.svg\",\"alt\":\"Wikipedia\"}]",
"links": "[{\"href\":\"/wiki/Main_Page\",\"text\":\"Main page\"}]",
"mode": "readability",
"finalUrl": "https://en.wikipedia.org/wiki/Markdown",
"status": "success",
"errorMsg": null
}
Pricing
FREE. Compute-billed only — you pay for the Apify platform run time, not per-record. Ideal as a preprocessing step before paid downstream actors (e.g., translation, summarization).
Use cases
- LLM context preparation — fetch and clean web pages before passing to GPT/Claude/Gemini
- RAG pipeline ingestion — extract clean text from arbitrary URLs at scale
- Content archiving — convert live pages to portable markdown
- Article extraction — strip ads and navigation from news/blog content
- Text analytics — feed clean plaintext into NLP models
Technical notes
- HTTP fetch via got-scraping (Chrome TLS fingerprint, follows redirects)
- Concurrency: up to 20 inputs processed in parallel per worker
- Memory: 512 MB default; increase to 1024 MB if
renderJs: true - Timeout: 30 seconds per URL fetch