Skip to main content

Output Formats

The output_format parameter decides what data.content contains. Pick the format closest to what your code consumes so you do less post-processing.

FormatReturnsUse for
htmlRaw HTMLCustom parsing, archiving
markdownClean MarkdownLLM input, content pipelines
plain_textStripped textSearch indexing, NLP
autoparseAuto-detected JSONQuick structured data
screenshotBase64 PNGVisual capture (see Screenshots)

:::note Extraction is separate Precise field extraction is not an output_format — it's driven by parameters that work alongside any format:

  • Add css_selectors → results in data.css_extracted.
  • Add templates → results in data.template_extracted. :::

html (default)

Returns the page exactly as rendered, in data.content. Best when you have your own parser or need the full DOM.

{ "url": "https://example.com", "output_format": "html" }

markdown

Converts the main content to Markdown, dropping navigation, scripts, and styling. Ideal for feeding pages into an LLM or a content database.

{ "url": "https://blog.example.com/post", "output_format": "markdown" }

plain_text

Returns readable text with markup removed. Good for full-text search, keyword extraction, and sentiment analysis.

{ "url": "https://news.example.com/article", "output_format": "plain_text" }

autoparse

We detect the page type (product, article, listing) and return structured JSON under data.extracted_data without you writing selectors. Great for a quick start; use css_selectors when you need exact control.

{ "url": "https://shop.example.com/p/1", "output_format": "autoparse" }

Precise extraction

To pull exact fields, add css_selectors (see CSS extraction). The output_format can stay html:

{
"url": "https://shop.example.com/p/1",
"css_selectors": { "title": "h1", "price": ".price" }
}
tip

You can combine css_selectors with templates (e.g. ["links", "images"]) to get both your mapped fields and built-in extractions in one call.