Any source. Any format. Any schedule.
Point Newsmill at any website and the AI-powered scraping cascade handles the rest. From simple RSS feeds to JavaScript-heavy paywalled sites, the system automatically selects the right extraction method and delivers clean, structured content into your pipeline.
The 4-level cascade
Not every page requires the same approach. Newsmill evaluates each source and escalates through four extraction levels, using the lightest method that works. This keeps costs low and speed high.
- Level 1: HTTP fetch
For simple, well-structured pages. A direct HTTP request retrieves the HTML, and content is extracted from the DOM. Fast, cheap, and sufficient for most news sites and RSS feeds.
- Level 2: Cheap LLM extraction
For semi-structured content where simple parsing falls short. A lightweight language model identifies article boundaries, titles, dates, and body text from messy HTML layouts.
- Level 3: Headless browser rendering
For JavaScript-heavy sites that require full rendering. A headless browser loads the page, executes scripts, waits for dynamic content, and then extracts the rendered output.
- Level 4: Full AI agent navigation
For complex or paywalled sites that require interaction. An AI agent navigates the page — clicking through menus, handling cookie banners, scrolling through infinite feeds, and extracting content that requires multi-step interaction.
Scheduling & monitoring
Sources are monitored on configurable schedules. Set check intervals per source — every 15 minutes for breaking news, hourly for industry publications, daily for press release pages. Newsmill detects new content automatically and feeds it into your pipeline without manual intervention.
Support extends beyond traditional news sites. Add RSS feeds for reliable structured data, website sections for targeted coverage, press release pages for corporate announcements, or any publicly accessible URL that publishes content you care about.
How it works
Paste any website URL into your Newsmill dashboard. Set a name, choose a check schedule, and assign it to a content group.
Newsmill analyzes the source and selects the optimal scraping level. No configuration needed — the cascade handles it.
Articles are extracted with titles, dates, authors, and body text. Content is cleaned of ads, navigation, and boilerplate.
Clean content enters your pipeline for filtering, deduplication, rewriting, and publishing — fully automated.
This feature is included on every paid plan. See plans and pricing →
Related Articles
Introducing Newsmill — your AI content pipeline
Meet Newsmill: the AI-powered platform that turns any news source into publish-ready content. Here's what we built and why.
Content strategy5 ways to automate your news content pipeline
Manual content curation doesn't scale. Here are five automation strategies that modern content teams use to stay ahead.
// explore features