SNEWPapers: Teaching AI to Read Historical Newspapers

Architecture

The substance is teaching machines to actually read the papers. Scanned pages from the Library of Congress Chronicling America collection (230 years of newsprint, 1730s through the 1960s, 3,000+ titles) flow through a multi-stage AI pipeline. SOTA document-layout and OCR models on a GPU Spot ECS cluster handle layout analysis and text recognition, with bespoke logic for region detection, column inference, and human reading order. Frontier LLM models with tool execution for spatial reasoning identify distinct components on a page (e.g. story, ad, obituary, editorial, etc.), merge them across page boundaries, and extract structured meta information based on content type. OpenSearch and embeddings provide hybrid BM25 / vector search.

Source

Library of Congress
Chronicling America
JP2 scans · 1730s – 1960s
3,000+ titles

Ingest & OCR

SOTA document layout analysis
Custom OCR and vision models
GPU ECS Spot workers

AI extraction

xAI LLMs w/ tool calling & structured output
Per-label structured extraction
Component embeddings

Store & search

Aurora PG + pgvector
OpenSearch hybrid BM25 + kNN

Serve

Rails 8.1 · Hotwire · Tailwind
Subscription billing
Streaming Sleuth agents

Every component is provisioned as code with the AWS CDK: Fargate, GPU ECS Spot, Lambda, SQS, Aurora, OpenSearch, ElastiCache, S3, Route 53, and the supporting network.

The product

The front end is a Rails 8.1 app: Hotwire, Tailwind, Devise + Google OAuth, with subscription billing for paid access. Two AI assistants, the Collection Sleuth (chat scoped to a curated set of articles) and the Discovery Sleuth (agentic search across the whole archive with tool-calling), stream live progress to the browser over Solid Queue → Valkey → ActionCable WebSockets, with a polling fallback for unfriendly networks. A custom Lua/Valkey token-bucket rate limiter gates AI usage by subscription tier, and a natural-language Search Assistant translates plain-English questions into a structured search.

Behind the Rails frontend, a separate Python/FastAPI backend on Fargate exposes the search API and runs the agentic Grok workflows. Both connect to Aurora PostgreSQL for operational data and a three-node OpenSearch cluster for content search.

Why

Roughly 2,500 hours and a lot of self-funded infrastructure cost over the last year and change, with no client behind it. I built this because no one else had: as far as I’ve been able to find, no existing newspaper archive actually reads its papers at this depth. They OCR the text and index it, but they don’t know what’s on the page. Doing the document-layout-analysis-plus-LLM combination at scale was the missing piece, and the result is a research tool that’s qualitatively different from anything else out there.

SNEWPapers: Teaching AI to Read Historical Newspapers

Stack

Skills

Architecture

The product

Why

Highlights