OrbTop

HTML to Markdown & Reader Mode Extractor

AIDEVELOPER TOOLSNEWS

HTML to Markdown & Reader Mode Extractor

Convert any URL or raw HTML to clean markdown, plaintext, and reader-mode metadata. A FREE, one-call primitive for LLM pipelines, text-processing workflows, and content extraction.

What it does

Pass a URL or raw HTML, get back:

  • Markdown — clean, ready for LLM ingestion
  • Plaintext — stripped of all markup
  • Reader-mode metadata — title, byline, published date, language, word count, reading time
  • Images[{src, alt}] list
  • Links[{href, text}] list

All in a single run. No crawler config needed.

Extraction modes

Mode Best for How it works
readability (default) News articles, blog posts, documentation Mozilla Readability strips navigation, ads, and boilerplate; then Turndown converts the clean article HTML to markdown
turndown Preserving the full page structure Direct Turndown conversion of the full HTML without article extraction — no stripping
trafilatura Noisy pages, forums, user-generated content Cheerio removes script/style/nav/header/footer elements, targets the main content area, then converts to markdown

Input

Field Type Default Description
inputs array Required. Array of {url?: string, html?: string} objects
mode string readability Extraction mode: readability / turndown / trafilatura
output string markdown Output format: markdown / plaintext / both
includeImages boolean true Include images list in output
preserveLinks boolean true Include links list and preserve hyperlinks in markdown
renderJs boolean false Fetch JS-rendered pages via browser (requires more memory)
maxItems integer 15 Max inputs to process

Example input

{
  "inputs": [
    { "url": "https://en.wikipedia.org/wiki/Markdown" },
    { "html": "<html><body><h1>Hello</h1><p>World</p></body></html>" }
  ],
  "mode": "readability",
  "output": "markdown"
}

Output

One dataset record per input:

Field Type Description
input string (JSON) Original input {url?, html?}
title string Extracted page title
byline string Author / byline
publishedAt string Published timestamp (ISO 8601, when detectable)
language string Detected language code (ISO 639-1)
markdown string Markdown output
plaintext string Plain-text output
wordCount number Word count
readingTimeMin number Estimated reading time (at 240 wpm)
images string (JSON) [{src, alt}] image references
links string (JSON) [{href, text}] hyperlinks
mode string readability / turndown / trafilatura
finalUrl string URL after redirects (URL inputs only)
status string success / timeout / error
errorMsg string Error message on failure

Example output record

{
  "input": "{\"url\":\"https://en.wikipedia.org/wiki/Markdown\"}",
  "title": "Markdown",
  "byline": "Contributors to Wikimedia projects",
  "publishedAt": "2005-08-09T19:56:00Z",
  "language": "en",
  "markdown": "# Markdown\n\nMarkdown is a lightweight markup language...",
  "wordCount": 2459,
  "readingTimeMin": 11,
  "images": "[{\"src\":\"/static/images/icons/enwiki-25.svg\",\"alt\":\"Wikipedia\"}]",
  "links": "[{\"href\":\"/wiki/Main_Page\",\"text\":\"Main page\"}]",
  "mode": "readability",
  "finalUrl": "https://en.wikipedia.org/wiki/Markdown",
  "status": "success",
  "errorMsg": null
}

Pricing

FREE. Compute-billed only — you pay for the Apify platform run time, not per-record. Ideal as a preprocessing step before paid downstream actors (e.g., translation, summarization).

Use cases

  • LLM context preparation — fetch and clean web pages before passing to GPT/Claude/Gemini
  • RAG pipeline ingestion — extract clean text from arbitrary URLs at scale
  • Content archiving — convert live pages to portable markdown
  • Article extraction — strip ads and navigation from news/blog content
  • Text analytics — feed clean plaintext into NLP models

Technical notes

  • HTTP fetch via got-scraping (Chrome TLS fingerprint, follows redirects)
  • Concurrency: up to 20 inputs processed in parallel per worker
  • Memory: 512 MB default; increase to 1024 MB if renderJs: true
  • Timeout: 30 seconds per URL fetch