Skip to content
// source scraping

Any source. Any format. Any schedule.

Point Newsmill at any website and the AI-powered scraping cascade handles the rest. From simple RSS feeds to JavaScript-heavy paywalled sites, the system automatically selects the right extraction method and delivers clean, structured content into your pipeline.

The 4-level cascade

Not every page requires the same approach. Newsmill evaluates each source and escalates through four extraction levels, using the lightest method that works. This keeps costs low and speed high.

  1. Level 1: HTTP fetch

    For simple, well-structured pages. A direct HTTP request retrieves the HTML, and content is extracted from the DOM. Fast, cheap, and sufficient for most news sites and RSS feeds.

  2. Level 2: Cheap LLM extraction

    For semi-structured content where simple parsing falls short. A lightweight language model identifies article boundaries, titles, dates, and body text from messy HTML layouts.

  3. Level 3: Headless browser rendering

    For JavaScript-heavy sites that require full rendering. A headless browser loads the page, executes scripts, waits for dynamic content, and then extracts the rendered output.

  4. Level 4: Full AI agent navigation

    For complex or paywalled sites that require interaction. An AI agent navigates the page — clicking through menus, handling cookie banners, scrolling through infinite feeds, and extracting content that requires multi-step interaction.

L0
L1
L2
L3

Scheduling & monitoring

Sources are monitored on configurable schedules. Set check intervals per source — every 15 minutes for breaking news, hourly for industry publications, daily for press release pages. Newsmill detects new content automatically and feeds it into your pipeline without manual intervention.

Support extends beyond traditional news sites. Add RSS feeds for reliable structured data, website sections for targeted coverage, press release pages for corporate announcements, or any publicly accessible URL that publishes content you care about.

How it works

1. Add a URL

Paste any website URL into your Newsmill dashboard. Set a name, choose a check schedule, and assign it to a content group.

2. Auto-detection

Newsmill analyzes the source and selects the optimal scraping level. No configuration needed — the cascade handles it.

3. Extraction & cleaning

Articles are extracted with titles, dates, authors, and body text. Content is cleaned of ads, navigation, and boilerplate.

4. Pipeline entry

Clean content enters your pipeline for filtering, deduplication, rewriting, and publishing — fully automated.

This feature is included on every paid plan. See plans and pricing →

Ready to get started?

Sign up free and start scraping your first source in minutes.