VIE: Adult-Education Data Platform

A serverless AWS platform unifying adult-education records for millions of participants

Aug 2020 – Feb 2024 · Principal Data Engineer · Production

Stack

  • Python
  • Apache Airflow
  • AWS (ECS Fargate, MWAA, CDK)
  • MySQL
  • PostgreSQL
  • FastAPI
  • Docker
  • Tableau
  • pandas
  • SQLAlchemy

Skills

  • data platform architecture
  • entity resolution
  • ETL/ELT pipeline design
  • dimensional data modeling
  • infrastructure-as-code
  • cloud cost optimization
  • API design
  • data quality

Architecture

Sources

  • State reporting systems
  • Community-college exports
  • Workforce-agency APIs
  • SFTP / FTPS feeds

Ingest

  • AWS Transfer Family
  • S3 landing & staging

Orchestrate & transform

  • Apache Airflow on MWAA
  • ECS Fargate ETL jobs
  • Record-linkage engine

Data vault

  • Normalized MySQL
  • 130+ tables, full lineage

Serve

  • Tableau dashboards
  • FastAPI partner APIs

Every component is provisioned as code with the AWS CDK: MWAA, ECS Fargate, RDS, Transfer Family, Cognito, S3, Secrets Manager, and the supporting network and monitoring.

The product

As Principal Data Engineer for Literacy Pro (later acquired by Pairin), I rebuilt the company’s entire data platform from the ground up.

The system they had was built in Pentaho, a visual ETL tool. It was deeply siloed, ran on massively denormalized tables that could take days to process, and cost around $250,000 a year on AWS. Reporting was unreliable, which is what I was originally brought in to fix.

I replaced it with a serverless Python platform. Data flows in from a dozen-plus external systems (state reporting systems, community-college exports, workforce-agency APIs, and SFTP/FTPS feeds), landing through AWS Transfer Family and S3. Apache Airflow on MWAA orchestrates the work as containerized ECS Fargate jobs that transform it into a normalized data vault of 130+ tables with full load lineage. At its core is a record-linkage engine I built from scratch: it matches millions of adult-education participants across otherwise-siloed systems on more than fifteen identity signals, with a complete audit trail of every match. The platform feeds Tableau dashboards and secure partner APIs.

Results

The rebuild did what it was brought in to do: reporting accuracy was restored, and cross-system participant matching improved dramatically as I refined the matching engine, all while cutting the platform’s annual running cost roughly 25-fold and turning multi-day processing into minutes. I designed, built, and operated the whole platform solo (the data vault, the matching engine, and every piece of AWS infrastructure) and worked directly with each external data vendor to shape and standardize the data they sent.

Highlights

  • Cut annual AWS spend ~25×, from roughly $250K to $10K, by replacing a Pentaho ETL stack with a serverless Python platform
  • Reduced data processing time from days to minutes
  • Built a custom record-linkage engine matching millions of participants across 10+ siloed source systems on 15+ identity signals
  • Designed a normalized 130+ table data vault with full load lineage, restoring reporting accuracy across hundreds of locations