Output Formats
The output_format parameter decides what data.content contains. Pick the format closest to what your code consumes so you do less post-processing.
| Format | Returns | Use for |
|---|---|---|
html | Raw HTML | Custom parsing, archiving |
markdown | Clean Markdown | LLM input, content pipelines |
plain_text | Stripped text | Search indexing, NLP |
autoparse | Auto-detected JSON | Quick structured data |
screenshot | Base64 PNG | Visual capture (see Screenshots) |
:::note Extraction is separate
Precise field extraction is not an output_format — it's driven by parameters that work alongside any format:
- Add
css_selectors→ results indata.css_extracted. - Add
templates→ results indata.template_extracted. :::
html (default)
Returns the page exactly as rendered, in data.content. Best when you have your own parser or need the full DOM.
{ "url": "https://example.com", "output_format": "html" }
markdown
Converts the main content to Markdown, dropping navigation, scripts, and styling. Ideal for feeding pages into an LLM or a content database.
{ "url": "https://blog.example.com/post", "output_format": "markdown" }
plain_text
Returns readable text with markup removed. Good for full-text search, keyword extraction, and sentiment analysis.
{ "url": "https://news.example.com/article", "output_format": "plain_text" }
autoparse
We detect the page type (product, article, listing) and return structured JSON under data.extracted_data without you writing selectors. Great for a quick start; use css_selectors when you need exact control.
{ "url": "https://shop.example.com/p/1", "output_format": "autoparse" }
Precise extraction
To pull exact fields, add css_selectors (see CSS extraction). The output_format can stay html:
{
"url": "https://shop.example.com/p/1",
"css_selectors": { "title": "h1", "price": ".price" }
}
You can combine css_selectors with templates (e.g. ["links", "images"]) to get both your mapped fields and built-in extractions in one call.