DetailPage: Amazon Marketplace Data Platform

Double digit terabyte scale data platform ingesting hundreds of millions of Amazon products every month

May 2023 – Oct 2025 · Principal Data Engineer · Production

Stack

  • Python
  • Amazon Redshift
  • Apache Airflow
  • AWS (ECS Fargate, Lambda, SQS, CDK)
  • PostgreSQL
  • FastAPI
  • Docker
  • pandas
  • SQLAlchemy

Skills

  • data warehouse architecture
  • large-scale data ingestion
  • distributed ETL
  • API integration engineering
  • infrastructure-as-code
  • cloud cost optimization
  • pipeline orchestration
  • engineering mentorship

Architecture

Sources

  • Keepa — price & sales rank
  • AsinDataApi — search results
  • Amazon keyword files
  • Amazon SP-API

Ingest & process

  • 24/7 data processing and dozens of ETLs running
  • SQS + Lambda fan-out
  • Rate-throttled vendor APIs

Orchestrate

  • Apache Airflow on MWAA
  • ECS Fargate ETL jobs

Warehouse

  • Amazon Redshift, double-digit TB
  • Aurora PostgreSQL + pgvector

Serve

  • 3× FastAPI services
  • Redis cache

Every component is provisioned as code with the AWS CDK: Redshift, MWAA, ECS Fargate, Lambda, SQS, EFS, Aurora, ElastiCache, Cognito, and the supporting network.

The product

As Principal Data Engineer for DetailPage, I built and ran the company’s entire data backend: an internal platform that powers DetailPage’s Amazon marketplace analytics product, including APIs that power their SEO and AEO optimizations, significant ML pipelines for sales predictions based on scraped product metrics, customer actuals, and Amazon self-reported “soft ranges”, as well as the abilities revolving around custom category universe creation for large clients (i.e. no Amazon category fits their universe, so we build their universe), which results in unparalleled reporting and competitor analysis.

DetailPage helps brands optimize how their products rank in the Amazon marketplace and increasingly via AI shopping engines, and that depends on knowing the marketplace in depth. I built the platform that gathers and serves that knowledge. Every month it ingests data on hundreds of millions of Amazon products from four external vendors: Keepa for pricing, sales rank, and history; AsinDataApi for search-results data; Amazon’s own keyword files; and the Amazon Selling Partner API. An always-on processor and a distributed SQS + Lambda extraction pipeline (with token-aware rate throttling and priority tiers) feed a double-digit terabyte Amazon Redshift warehouse, orchestrated by Apache Airflow on MWAA. Three FastAPI services serve the data to the product, with PostgreSQL and pgvector powering keyword-embedding search and an OpenAI-backed analytics feature.

Results

I designed and built the whole platform solo: every layer, from the AWS infrastructure to the ingestion pipelines, the Redshift schema and its stored procedures, the APIs, and the data-science pieces. Careful right-sizing, serverless components, and auto-scaling kept it all running on under $3,000 a month of AWS while covering tens of thousands of product categories, millions of keywords, and dozens of client brands. In the later stages I onboarded and trained a small team of mid-level engineers to take over the platform.

Highlights

  • Built and ran solo a double-digit TB scale Amazon Redshift platform ingesting hundreds of millions of Amazon products a month from four external data vendors
  • Kept the entire platform running on under $3K/month of AWS through serverless, auto-scaling, right-sized infrastructure
  • Designed a distributed SQS + Lambda extraction pipeline with token-aware rate throttling and priority tiers
  • Covered tens of thousands of product categories and millions of keywords across dozens of client brands; later onboarded and trained a team of mid-level engineers to take over the platform