How Netflix Migrated 400 PostgreSQL Clusters to Aurora with Zero Data Loss

Required Knowledge

PostgreSQL - An open-source relational database that stores data in tables, like a very powerful spreadsheet. Netflix uses it to power many of its backend services.
Amazon RDS vs. Aurora - Both are managed database services on AWS. RDS is the traditional offering; Aurora is AWS's cloud-native rewrite with better scalability, high availability, and cost efficiency at scale.
Replication - The process of continuously copying data changes from one database (the source) to another (the destination) so they stay in sync. This allows migrations with minimal downtime.
Write-Ahead Log (WAL) - A running record of every change made to a database, written before the change is applied. Replication works by streaming this log to a replica so it can replay the same changes.
Data Access Layer (DAL) - Middleware that sits between applications and databases. Netflix routes all database traffic through a DAL (built on Envoy proxies), which means switching databases requires only a config update - no app code changes needed.

My Key Takeaways

Automating migrations at scale requires eliminating manual touchpoints - With ~400 clusters, any step that needed human intervention or database credentials would have made the migration untenable. Netflix's workflow is fully self-service: service owners trigger it themselves with no platform team involvement per migration.
Infrastructure-layer quiescence is more reliable than app-layer controls - Trusting applications to stop writes is fragile. Netflix enforces quiescence by stripping security groups and rebooting the RDS instance, guaranteeing no stale connections can write to the source during cutover.
Understanding your metrics deeply prevents incorrect cutover decisions - The 0 → 64 MB oscillation in OldestReplicationSlotLag looks like lag but isn't. Without understanding this behavior, teams might wait indefinitely or cut over prematurely. Netflix's automation specifically watches for this pattern as a signal that replication is complete.
A proxy layer makes database migrations transparent to applications - Because all traffic flows through Netflix's Envoy-based Data Gateway, switching from RDS to Aurora required only a config update to the proxy. No application code changed, no credentials rotated, no service restarts needed.
Standardizing on fewer database technologies compounds value over time - By consolidating onto Aurora PostgreSQL, Netflix can invest engineering depth into one system, build shared tooling, and reduce operational overhead - rather than spreading expertise across many database engines that largely solve the same problems.