VIE: Adult-Education Data Platform
A serverless AWS platform unifying adult-education records for millions of participants
Stack
Skills
Architecture
Sources
- State reporting systems
- Community-college exports
- Workforce-agency APIs
- SFTP / FTPS feeds
Ingest
- AWS Transfer Family
- S3 landing & staging
Orchestrate & transform
- Apache Airflow on MWAA
- ECS Fargate ETL jobs
- Record-linkage engine
Data vault
- Normalized MySQL
- 130+ tables, full lineage
Serve
- Tableau dashboards
- FastAPI partner APIs
Every component is provisioned as code with the AWS CDK: MWAA, ECS Fargate, RDS, Transfer Family, Cognito, S3, Secrets Manager, and the supporting network and monitoring.
The product
As Principal Data Engineer for Literacy Pro (later acquired by Pairin), I rebuilt the company’s entire data platform from the ground up.
The system they had was built in Pentaho, a visual ETL tool. It was deeply siloed, ran on massively denormalized tables that could take days to process, and cost around $250,000 a year on AWS. Reporting was unreliable, which is what I was originally brought in to fix.
I replaced it with a serverless Python platform. Data flows in from a dozen-plus external systems (state reporting systems, community-college exports, workforce-agency APIs, and SFTP/FTPS feeds), landing through AWS Transfer Family and S3. Apache Airflow on MWAA orchestrates the work as containerized ECS Fargate jobs that transform it into a normalized data vault of 130+ tables with full load lineage. At its core is a record-linkage engine I built from scratch: it matches millions of adult-education participants across otherwise-siloed systems on more than fifteen identity signals, with a complete audit trail of every match. The platform feeds Tableau dashboards and secure partner APIs.
Results
The rebuild did what it was brought in to do: reporting accuracy was restored, and cross-system participant matching improved dramatically as I refined the matching engine, all while cutting the platform’s annual running cost roughly 25-fold and turning multi-day processing into minutes. I designed, built, and operated the whole platform solo (the data vault, the matching engine, and every piece of AWS infrastructure) and worked directly with each external data vendor to shape and standardize the data they sent.
Highlights
- Cut annual AWS spend ~25×, from roughly $250K to $10K, by replacing a Pentaho ETL stack with a serverless Python platform
- Reduced data processing time from days to minutes
- Built a custom record-linkage engine matching millions of participants across 10+ siloed source systems on 15+ identity signals
- Designed a normalized 130+ table data vault with full load lineage, restoring reporting accuracy across hundreds of locations