How Netflix Migrated 400 PostgreSQL Clusters to Aurora with Zero Data Loss

NetflixNetflix
Read original
PostgreSQLPostgreSQL
amazonaws
rds
aurora

Netflix built a fully automated, self-service pipeline to migrate its fleet of RDS PostgreSQL databases to Aurora PostgreSQL - achieving near-zero downtime with no credentials required and no application code changes.

Required Knowledge

  • PostgreSQL - An open-source relational database that stores data in tables, like a very powerful spreadsheet. Netflix uses it to power many of its backend services.
  • Amazon RDS vs. Aurora - Both are managed database services on AWS. RDS is the traditional offering; Aurora is AWS's cloud-native rewrite with better scalability, high availability, and cost efficiency at scale.
  • Replication - The process of continuously copying data changes from one database (the source) to another (the destination) so they stay in sync. This allows migrations with minimal downtime.
  • Write-Ahead Log (WAL) - A running record of every change made to a database, written before the change is applied. Replication works by streaming this log to a replica so it can replay the same changes.
  • Data Access Layer (DAL) - Middleware that sits between applications and databases. Netflix routes all database traffic through a DAL (built on Envoy proxies), which means switching databases requires only a config update - no app code changes needed.

My Key Takeaways

  • Automating migrations at scale requires eliminating manual touchpoints - With ~400 clusters, any step that needed human intervention or database credentials would have made the migration untenable. Netflix's workflow is fully self-service: service owners trigger it themselves with no platform team involvement per migration.
  • Infrastructure-layer quiescence is more reliable than app-layer controls - Trusting applications to stop writes is fragile. Netflix enforces quiescence by stripping security groups and rebooting the RDS instance, guaranteeing no stale connections can write to the source during cutover.
  • Understanding your metrics deeply prevents incorrect cutover decisions - The 0 → 64 MB oscillation in OldestReplicationSlotLag looks like lag but isn't. Without understanding this behavior, teams might wait indefinitely or cut over prematurely. Netflix's automation specifically watches for this pattern as a signal that replication is complete.
  • A proxy layer makes database migrations transparent to applications - Because all traffic flows through Netflix's Envoy-based Data Gateway, switching from RDS to Aurora required only a config update to the proxy. No application code changed, no credentials rotated, no service restarts needed.
  • Standardizing on fewer database technologies compounds value over time - By consolidating onto Aurora PostgreSQL, Netflix can invest engineering depth into one system, build shared tooling, and reduce operational overhead - rather than spreading expertise across many database engines that largely solve the same problems.