Database Migrations Explained
Zero-Downtime Database Migrations at Scale
Deconstructing how to alter schemas and move infrastructure without interrupting live traffic.

Deploying new application code is a relatively simple operation because compute servers are stateless. You can spin up a fleet of new servers, route incoming traffic to them, and safely terminate the old ones.

Databases, however, are entirely stateful. You cannot simply delete an old database and deploy a new one without permanently losing user data. Furthermore, relational databases utilize internal locks to maintain strict data integrity. If you attempt to add a new column to a table containing 500 million rows, the database engine may lock that table for hours to rewrite the underlying files, blocking all incoming read and write requests and effectively bringing the platform offline.
To evolve database infrastructure at scale, engineers must abandon single-step changes and utilize phased, non-blocking migration strategies.
The Offline Migration
The simplest strategy is to accept scheduled downtime. The engineering team displays a maintenance page, shuts down all application servers so no new data is written, runs a script to alter the database schema or copy the data to a new server, and then brings the application back online.
While this guarantees absolute data consistency because the data is perfectly frozen during the move, it is rarely a viable option. For globally distributed applications, massive e-commerce platforms, or critical SaaS products, forced downtime is financially and operationally unacceptable. System architecture must be designed to change while the engine is still running.
An offline database migration involves stopping all application traffic, exporting your data from the source database, importing it into the target system, and rerouting traffic. It is best for small datasets (under 100GB), lower environments, or situations where extended service downtime is acceptable.
The Expand and Contract Pattern
When you need to change a schema in a live relational database—for example, splitting a single full_name column into first_name and last_name—you cannot execute the change in a single step. If you drop the full_name column while the legacy application code is still running, the application will instantly crash when it attempts to query the missing data.

The Expand and Contract pattern solves this by breaking the migration into small, backward-compatible deployments:
Expand: You alter the database to add the new
first_nameandlast_namecolumns, explicitly allowing them to be nullable. You do not touch or alter the existingfull_namecolumn. Adding a nullable column is usually a fast metadata operation that avoids long table locks.Dual Write: You deploy new application code. When a user updates their profile, the application writes the data to the old
full_namecolumn and the newfirst_name/last_namecolumns simultaneously.Backfill: A background script iterates through the millions of existing database records, parses the old
full_namedata, and populates the new columns for all historical rows.Transition: You deploy another application update that explicitly reads from the new
first_nameandlast_namecolumns, completely abandoning the old column.Contract: Months later, once you are absolutely certain the new schema is stable and no legacy code is referencing it, you safely run a script to
DROPthe oldfull_namecolumn, reclaiming the disk space.
𝐋𝐞𝐚𝐫𝐧 𝐭𝐨 𝐛𝐮𝐢𝐥𝐝 𝐆𝐢𝐭, 𝐃𝐨𝐜𝐤𝐞𝐫, 𝐑𝐞𝐝𝐢𝐬, 𝐇𝐓𝐓𝐏 𝐬𝐞𝐫𝐯𝐞𝐫𝐬, 𝐚𝐧𝐝 𝐜𝐨𝐦𝐩𝐢𝐥𝐞𝐫𝐬, 𝐟𝐫𝐨𝐦 𝐬𝐜𝐫𝐚𝐭𝐜𝐡. Get 40% OFF CodeCrafters: https://app.codecrafters.io/join?via=the-coding-gopher
Migrating Engines: Dual Writes
If you are migrating off one database engine entirely—such as moving from a relational database like PostgreSQL to a NoSQL database like DynamoDB—you must run both databases in parallel to ensure a seamless cutover.
This is typically managed at the application layer:
The application is updated to write incoming data to both the old PostgreSQL database and the new DynamoDB database. PostgreSQL remains the primary source of truth for all reads.
A background job copies all historical data from PostgreSQL to DynamoDB.
The application is updated to execute “dark reads.” It queries both databases, but only returns the PostgreSQL data to the user. In the background, it compares the DynamoDB result to the PostgreSQL result and logs any discrepancies, mathematically proving the new database is functioning correctly.
Once validation is completely clean, a configuration flag is flipped. The application starts reading directly from DynamoDB, and PostgreSQL is eventually decommissioned.
Note: A dark read (often referred to alongside "dark writes") is a software testing technique where a new service, database, or feature is run in parallel with an existing production system. User requests are sent to the original system, but the request is simultaneously duplicated and sent "in the dark" to the new system
While effective, application-level dual writes are highly susceptible to partial failures. If the application successfully writes to PostgreSQL but a brief network timeout causes the DynamoDB write to fail, the two databases are instantly out of sync.
Infrastructure Synchronization: Change Data Capture (CDC)
To solve the partial failure problem of dual writes, engineers push the synchronization logic down from the application code into the infrastructure layer using Change Data Capture (CDC).

Relational databases maintain an internal, append-only log of every transaction that occurs to survive sudden crashes (such as the Write-Ahead Log in PostgreSQL or the Binlog in MySQL).
A CDC tool constantly tails this internal log. When the application writes a row to the primary database, the CDC system instantly reads that raw transaction event and asynchronously streams it—often via an event queue like Apache Kafka—directly into the secondary database.
This removes the burden of dual-writing from the application completely. The application only writes to the primary database, and the infrastructure guarantees that the secondary database will eventually and accurately reflect those exact changes, making zero-downtime database migrations significantly safer and more resilient to network failures.


