SNEWPapers: Teaching AI to Read Historical Newspapers
SNEWPapers (snewpapers.com) is a live, AI-powered archive of American historical newspapers, the first of its kind at this scale and depth. Nearly a million pages of content from 1730s to 1960s was parsed, segmented and categorized into 6 million stories from 3,000+ American newspapers. Built and shipped solo.
Stack
Skills
Architecture
The substance is teaching machines to actually read the papers. Scanned pages from the Library of Congress Chronicling America collection (230 years of newsprint, 1730s through the 1960s, 3,000+ titles) flow through a multi-stage AI pipeline. SOTA document-layout and OCR models on a GPU Spot ECS cluster handle layout analysis and text recognition, with bespoke logic for region detection, column inference, and human reading order. Frontier LLM models with tool execution for spatial reasoning identify distinct components on a page (e.g. story, ad, obituary, editorial, etc.), merge them across page boundaries, and extract structured meta information based on content type. OpenSearch and embeddings provide hybrid BM25 / vector search.
Source
- Library of Congress
Chronicling America - JP2 scans · 1730s – 1960s
- 3,000+ titles
Ingest & OCR
- SOTA document layout analysis
- Custom OCR and vision models
- GPU ECS Spot workers
AI extraction
- xAI LLMs w/ tool calling & structured output
- Per-label structured extraction
- Component embeddings
Store & search
- Aurora PG + pgvector
- OpenSearch hybrid BM25 + kNN
Serve
- Rails 8.1 · Hotwire · Tailwind
- Subscription billing
- Streaming Sleuth agents
Every component is provisioned as code with the AWS CDK: Fargate, GPU ECS Spot, Lambda, SQS, Aurora, OpenSearch, ElastiCache, S3, Route 53, and the supporting network.
The product
The front end is a Rails 8.1 app: Hotwire, Tailwind, Devise + Google OAuth, with subscription billing for paid access. Two AI assistants, the Collection Sleuth (chat scoped to a curated set of articles) and the Discovery Sleuth (agentic search across the whole archive with tool-calling), stream live progress to the browser over Solid Queue → Valkey → ActionCable WebSockets, with a polling fallback for unfriendly networks. A custom Lua/Valkey token-bucket rate limiter gates AI usage by subscription tier, and a natural-language Search Assistant translates plain-English questions into a structured search.
Behind the Rails frontend, a separate Python/FastAPI backend on Fargate exposes the search API and runs the agentic Grok workflows. Both connect to Aurora PostgreSQL for operational data and a three-node OpenSearch cluster for content search.
Why
Roughly 2,500 hours and a lot of self-funded infrastructure cost over the last year and change, with no client behind it. I built this because no one else had: as far as I’ve been able to find, no existing newspaper archive actually reads its papers at this depth. They OCR the text and index it, but they don’t know what’s on the page. Doing the document-layout-analysis-plus-LLM combination at scale was the missing piece, and the result is a research tool that’s qualitatively different from anything else out there.
Highlights
- Built and shipped solo (backend, Rails frontend, and AWS infrastructure) the first AI newspaper archive at this scale and depth, after extensive search for prior art
- Designed a multi-stage AI ingestion pipeline combining SOTA document-layout analysis on a GPU fleet, xAI Grok with a code-execution tool for page-structure extraction, embeddings, and hybrid BM25 + kNN search in OpenSearch
- Implemented agentic Grok "Sleuth" assistants with tool-calling that stream live progress over Solid Queue, Valkey, and ActionCable WebSockets, plus a custom Lua/Valkey token-bucket rate limiter gating AI usage by subscription tier
- Live at snewpapers.com, a subscription product covering 6 million stories across 3,000+ American newspaper titles, 1730s–1960s