// source scraping

Any source. Any format. Any schedule.

Point Newsmill at any website and the AI-powered scraping cascade handles the rest. From simple RSS feeds to JavaScript-heavy paywalled sites, the system automatically selects the right extraction method and delivers clean, structured content into your pipeline.

The 4-level cascade

Not every page requires the same approach. Newsmill evaluates each source and escalates through four extraction levels, using the lightest method that works. This keeps costs low and speed high.

Level 1: HTTP fetch
For simple, well-structured pages. A direct HTTP request retrieves the HTML, and content is extracted from the DOM. Fast, cheap, and sufficient for most news sites and RSS feeds.
Level 2: Cheap LLM extraction
For semi-structured content where simple parsing falls short. A lightweight language model identifies article boundaries, titles, dates, and body text from messy HTML layouts.
Level 3: Headless browser rendering
For JavaScript-heavy sites that require full rendering. A headless browser loads the page, executes scripts, waits for dynamic content, and then extracts the rendered output.
Level 4: Full AI agent navigation
For complex or paywalled sites that require interaction. An AI agent navigates the page — clicking through menus, handling cookie banners, scrolling through infinite feeds, and extracting content that requires multi-step interaction.

Scheduling & monitoring

Sources are monitored on configurable schedules. Set check intervals per source — every 15 minutes for breaking news, hourly for industry publications, daily for press release pages. Newsmill detects new content automatically and feeds it into your pipeline without manual intervention.

Support extends beyond traditional news sites. Add RSS feeds for reliable structured data, website sections for targeted coverage, press release pages for corporate announcements, or any publicly accessible URL that publishes content you care about.

How it works

1. Add a URL

Paste any website URL into your Newsmill dashboard. Set a name, choose a check schedule, and assign it to a content group.

2. Auto-detection

Newsmill analyzes the source and selects the optimal scraping level. No configuration needed — the cascade handles it.

3. Extraction & cleaning

Articles are extracted with titles, dates, authors, and body text. Content is cleaned of ads, navigation, and boilerplate.

4. Pipeline entry

Clean content enters your pipeline for filtering, deduplication, rewriting, and publishing — fully automated.

This feature is included on every paid plan. See plans and pricing →

Product updates

Introducing Newsmill — your AI content pipeline

Meet Newsmill: the AI-powered platform that turns any news source into publish-ready content. Here's what we built and why.

Content strategy

5 ways to automate your news content pipeline

Manual content curation doesn't scale. Here are five automation strategies that modern content teams use to stay ahead.

// explore features

Scraping Filtering Rewriting Humanization Publishing Analytics Originals