Postgres Aurora DB major version upgrade with minimal downtime

Published in

Lyft Engineering

13 min readMar 11, 2024

Photo by Frank Olsen UNDER CC BY-SA 3.0 DEED

Introduction

Our payment platform team had the unique challenge to upgrade our Aurora Postgres DB from v10 to v13. This DB was responsible for storing transactions within Lyft and contains ~400 tables (with partitions) and ~30TB of data. Upgrading the database in-place would have resulted in ~30 mins of downtime. Such significant downtime is untenable — it would cause cascading failures across multiple downstream services, requiring a large amount of engineering effort to remediate. Through native replication between the two Postgres major versions and a blue/green deploy, we were able to reduce the downtime to ~7 mins!

Considered Upgrade Approaches

In-place upgrade

This is the simplest method to upgrade the PG database. It involves upgrading the DB directly using the pg_upgrade command. However, when an in-place upgrade is triggered, there is downtime due to the upgrade occurring at the storage layer, not the compute layer. This results in the storage layer, and therefore effectively all write/read operations, becoming unavailable during the upgrade. Our analysis revealed that this process, and the resulting downtime, would have lasted ~30 minutes.

DMS

AWS provides the Data Migration Service, which allows logical replication between a source and target Postgres DB. This allows for instantaneously cutting traffic over to the upgraded DB, allowing us to minimize downtime. Unfortunately, when cutting over to the newly-upgraded DB, the traffic would fully load the new database, causing replication lag to remain elevated for the following days/weeks.

Native Logical Replication

To overcome the issue of full load using Amazon’s DMS — we can clone the existing DB, upgrade it to the desired version, and then setup replication between the source and the target DB. This was the least disruptive method in our situation, and the approach that we decided to employ. Our experimentation in a testing environment gave us enough confidence to move ahead with this approach.

Upgrade using Logical Replication

To understand logical replication, let’s familiarize ourselves with the following terms:

Publisher — the source DB publishing the data for replication
Subscriber — the target DB subscribing to the source DB for replication data
Replication slot — the basic idea of a replication slot is to let the publisher know about the transactions committed by the subscriber so far. This allows the publisher to know if the WAL up to a certain point should be preserved or flushed. A replication slot can have one or more tables added to it. If a table/partition has large number of insert, update, deletes happening, it is best to create a dedicated replication slot to avoid lag between the source and target DB

Let’s talk about the upgrade steps now. We divide the upgrade into two phases:

Setup replication (Phase 1)
Cutover traffic to the upgraded DB (Phase 2)

Replication Setup (Phase 1)

Part 1: Setup publisher (PG10)

Create publications

Let’s create publications for table1, table2, and other tables. We are creating a separate publication for table1 & table2 with heavy writes.

CREATE PUBLICATION pub_all_table;
CREATE PUBLICATION pub_table1;
CREATE PUBLICATION pub_table2;

SQL queries to create the 3 publications we’ll use for the replication.

2. Add tables to publication

ALTER publication pub_table1 ADD TABLE table1;
ALTER publication pub_table2 ADD TABLE table1;
ALTER publication pub_all_tables ADD TABLE table3,table4,table5;

SQL queries to add corresponding tables to each of the 3 publications.

3. Create replication slots¹

SELECT pg_create_logical_replication_slot(‘pub_all_table’, ‘pgoutput’);
SELECT pg_create_logical_replication_slot(‘pub_table1’, ‘pgoutput’);
SELECT pg_create_logical_replication_slot(‘pub_table2’, ‘pgoutput’);

SQL queries to create logical replication slots for each publication.

Part 2: Setup subscriber (PG13)

Clone the PG10 database cluster . Aurora Cloning² uses copy on write and thus is really fast. Our 30 TB data was cloned within 20 minutes or so.
Capture the LSN in the cloned DB. Think of the LSN as a global commit number which acts as the initial checkpoint from which we want the replication to begin.

SELECT * FROM aurora_volume_logical_start_lsn();
 
aurora_volume_logical_start_lsn
 — — — — — — — — — — — — — — — — -
 29B/A04B33A8

3. The replication slots & publication tables in the new subscriber DB are relics of cloning the PG10 DB, so we can safely drop them from the table.

SELECT pg_drop_replication_slot(‘pub_all_table’);
SELECT pg_drop_replication_slot('pub_table1');
SELECT pg_drop_replication_slot('pub_table2');

DROP PUBLICATION pub_all_table;
DROP PUBLICATION pub_table1;
DROP PUBLICATION pub_table2;

4. Upgrade our PG10 clone to PG13. This upgrade takes ~30 mins to complete, during which we can run the following command in the source DB (PG10) to find the replication lag between our two DBs.

SELECT now() AS CURRENT_TIME,
 slot_name,
 active,
 active_pid,
 pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS diff_size,
 pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS diff_bytes
FROM pg_replication_slots
WHERE slot_type = ‘logical’;


current_time | slot_name | active | active_pid | diff_size | diff_bytes 
 — — — — — — — — — — — — — — — -+ — — — — — — — — — — — — — + — — — — + — — — — — — + — — — — — -+ — — — — — — 
 2023–03–27 04:33:46.974475+00 | pub_all_table | f | | 66 MB | 69123352
 2023–03–27 04:33:46.974475+00 | pub_table1 | f | | 66 MB | 69106928
 2023–03–27 04:33:46.974475+00 | pub_table2 | f | | 66 MB | 69084400
 2023–03–27 04:33:46.974475+00 | pub_table3 | f | | 66 MB | 69074600
 2023–03–27 04:33:46.974475+00 | pub_table4 | f | | 66 MB | 69064856
 2023–03–27 04:33:46.974475+00 | pub_table5 | f | | 66 MB | 69053368
 2023–03–27 04:33:46.974475+00 | pub_table6 | f | | 66 MB | 69035888

At this time, we see that the size of the lag between our databases is 66MB.

current_time | slot_name | active | active_pid | diff_size | diff_bytes 
 — — — — — — — — — — — — — — — -+ — — — — — — — — — — — — — + — — — — + — — — — — — + — — — — — -+ — — — — — — 
 2023–03–27 04:43:45.983791+00 | pub_all_table | f | | 200 MB | 209291552
 2023–03–27 04:43:45.983791+00 | pub_table1 | f | | 200 MB | 209275128
 2023–03–27 04:43:45.983791+00 | pub_table2 | f | | 200 MB | 209252600
 2023–03–27 04:43:45.983791+00 | pub_table3 | f | | 200 MB | 209242800
 2023–03–27 04:43:45.983791+00 | pub_table4 | f | | 200 MB | 209233056
 2023–03–27 04:43:45.983791+00 | pub_table5 | f | | 200 MB | 209221568
 2023–03–27 04:43:45.983791+00 | pub_table6 | f | | 200 MB | 209204088

Just 10 minutes later, the replication lag has increased 3x, from 66MB to 200MB!

5. Create a subscription in PG13
Note: Create but do not enable the subscription yet!

CREATE SUBSCRIPTION sub_all_table 
CONNECTION ‘host=mydb-cluster.cluster-c123.us-east-1.rds.amazonaws.com dbname=db_name port=5432 user=user password=xxx’ 
PUBLICATION pub_all_table WITH ( copy_data = false, 
create_slot = false, enabled = false, connect = true, 
slot_name = ‘pub_all_table’ ); 

CREATE SUBSCRIPTION sub_table1 CONNECTION 
'host=mydb-cluster.cluster-c123.us-east-1.rds.amazonaws.com dbname=db_name port=5432 user=user password=xxx' 
PUBLICATION pub_table1 WITH ( copy_data = false, 
create_slot = false, enabled = false, connect = true, 
slot_name = 'pub_table1' );

CREATE SUBSCRIPTION sub_table2 CONNECTION 
‘host=mydb-cluster.cluster-c123.us-east-1.rds.amazonaws.com dbname=db_name port=5432 user=user password=xxx’ 
PUBLICATION pub_table2 WITH ( copy_data = false, create_slot = false, 
enabled = false, connect = true, slot_name = ‘pub_table2’ );

We run the above SQL to create a connection between our PG13 DB and the PG10 DB, allowing us to eventually enable the subscription and begin to reduce replication lag.

6. Capture Replication Origin³

SELECT * FROM pg_replication_origin;

roident | roname
— — — — -+ — — — — — — -
1 | pg_53563397
2 | pg_53563398
3 | pg_53563399
4 | pg_53563400
5 | pg_53563401
6 | pg_53563402
7 | pg_53563403

We capture the replication origin of the new DB, which will be used to start the replication process from the appropriate checkpoint for each slot.

7. Advance Replication Origin to the LSN number captured earlier.

This is a crucial step to ensure we start CDC from a known point for each slot

SELECT pg_replication_origin_advance(pg_53563397,’29B/A04B33A8');
SELECT pg_replication_origin_advance(pg_53563398,’29B/A04B33A8');
SELECT pg_replication_origin_advance(pg_53563399,’29B/A04B33A8');
SELECT pg_replication_origin_advance(pg_53563400,’29B/A04B33A8');
SELECT pg_replication_origin_advance(pg_53563401,’29B/A04B33A8');
SELECT pg_replication_origin_advance(pg_53563402,’29B/A04B33A8');
SELECT pg_replication_origin_advance(pg_53563403,’29B/A04B33A8');

Use Postgres’ pg_replication_origin_advance feature to advance to the LSN captured earlier.

Re-checking the replication lag between the two databases, it has continued to grow, since we have not yet enabled subscription.

current_time | slot_name | active | active_pid | diff_size | diff_bytes 
 — — — — — — — — — — — — — — — -+ — — — — — — — — — — — — — + — — — — + — — — — — — + — — — — — -+ — — — — — — 
 2023–03–27 06:07:23.448674+00 | pub_all_table | f | | 1177 MB | 1234508464
 2023–03–27 06:07:23.448674+00 | pub_table1 | f | | 1177 MB | 1234492040
 2023–03–27 06:07:23.448674+00 | pub_table2 | f | | 1177 MB | 1234469512
 2023–03–27 06:07:23.448674+00 | pub_table3 | f | | 1177 MB | 1234459712
 2023–03–27 06:07:23.448674+00 | pub_table4 | f | | 1177 MB | 1234449968
 2023–03–27 06:07:23.448674+00 | pub_table5 | f | | 1177 MB | 1234438480
 2023–03–27 06:07:23.448674+00 | pub_table6 | f | | 1177 MB | 1234421000

In ~85 minutes, the replication lag has shot up to 1177MB

8. Finally, with the advanced replication origin, we can enable subscription in PG13, which should start to reduce the replication lag between the databases.

ALTER SUBSCRIPTION sub_all_table ENABLE;
ALTER SUBSCRIPTION sub_table1 ENABLE;
ALTER SUBSCRIPTION sub_table2 ENABLE;
-- Enable other subscriptions as well


current_time | slot_name | active | active_pid | diff_size | diff_bytes 
 — — — — — — — — — — — — — — — -+ — — — — — — — — — — — — — + — — — — + — — — — — 
 2023–03–27 06:17:19.748714+00 | pub_all_table | t | 11757 | 168 MB | 176012552
 2023–03–27 06:17:19.748714+00 | pub_table1 | t | 10825 | 643 MB | 673892912
 2023–03–27 06:17:19.748714+00 | pub_table2 | t | 12181 | 992 MB | 1040602856
 2023–03–27 06:17:19.748714+00 | pub_table3 | t | 17747 | 1099 MB | 1152246216
 2023–03–27 06:17:19.748714+00 | pub_table4 | t | 16015 | 344 MB | 361209240
 2023–03–27 06:17:19.748714+00 | pub_table5 | t | 17878 | 1205 MB | 1263605248
 2023–03–27 06:17:19.748714+00 | pub_table6 | t | 17961 | 1205 MB | 1263605248

The replication lag now is decreasing. Note that the replication lag of a heavy writes table will be larger than the others. But it is catching up! Also note that the active_pid field is now true which means that there are processes assigned to carry out the replication. Pretty cool!

current_time | slot_name | active | active_pid | diff_size | diff_bytes 
 — — — — — — — — — — — — — — — -+ — — — — — — — — — — — — — + — — — — + — — — — — — + — — — — — -+ — — — — — — 
 2023–03–27 06:25:08.704248+00 | pub_all_table | t | 11757 | 37 kB | 38144
 2023–03–27 06:25:08.704248+00 | pub_table1 | t | 10825 | 50 kB | 51576
 2023–03–27 06:25:08.704248+00 | pub_table2 | t | 12181 | 697 MB | 731124584
 2023–03–27 06:25:08.704248+00 | pub_table3 | t | 18329 | 73 kB | 75200
 2023–03–27 06:25:08.704248+00 | pub_table4 | t | 16015 | 47 kB | 47760
 2023–03–27 06:25:08.704248+00 | pub_table5 | t | 17878 | 37 kB | 38144
 2023–03–27 06:25:08.704248+00 | pub_table6 | t | 17961 | 73 kB | 75200

Within ~20 mins of enabling subscription in step 8, our replication lag is down to a few kBs now.

Success! There will always be some form of replication lag due to network latency, but our subscription process has successfully minimized it!

9. We can now run VACUUM ANALYZE on the parent tables to update statistics on our target DB for optimal query performance.

Bring traffic to our upgraded DB (Phase 2)

Our goal is now to bring this replication lag to 0 and cut the traffic over to the upgraded DB.

Reason to halt traffic

Because of the ingress write traffic to our database, there will always be some amount of replication lag between the source and target DBs. If we cutover to the new database while there is replication lag, we’ll lose the lagging data. The only way to ensure 0 replication lag between the two DBs, in turn ensuring a lossless migration, is to halt ingress traffic for a (short) period of time.

Note: In our case, only 1 service was connecting to our transactions DB and all downstream services used CRUD APIs through that service to communicate with the DB. If you have multiple services connecting to the DB, you will have to halt traffic to all those services.

Approach to halt traffic and avoid writes

To avoid writes, we made our PG10 DB read only. Our internal infra uses k8s as the service orchestrator and envoy as a proxy. To prevent any traffic from flowing to our DB, we could consider scaling our service pods down to 0 or intentionally fail our healthcheck. However, this would add to our downtime during cutover, as it would take some time for our service to scale back up and be ready to serve traffic. Instead, we chose to use an envoy circuitbreaker, which returns an HTTP 503 code immediately to the downstream caller. This combination of circuit breaker + read only source DB ensures there are no new writes happening and the replication lag between the source and target DB quickly goes to 0.

Switching traffic to the target DB

A simple approach to switching traffic is to update a runtime config value for the connection string, changing it from the old DB to the new DB. However, this would require deploying a runtime config change and also redeploying the application to use the new connection string. To overcome this issue, we opted instead for AWS Route53. Our service would use route53 connection URL to initially connect with the PG10 DB and then during cutover, we would update the route53 entry to point to the upgraded DB URL. Route53 guarantees that DNS entries will be updated within 60 seconds, allowing us to ensure that the application connects to the upgraded DB with minimal delay.

Cutover steps

Double-check the replication lag. It should be no more than a few kBs.
Downtime starts — Trigger circuit breaker to halt all traffic to application
Why: To ensure no data loss during cutover.
We wait for 1 min or so for all requests to finish processing.
How to verify: SSH into a downstream pod and ensure that a 503 is returned for any GET requests.

3. Put PG10 DB in read-only mode
Why: Ensure that even if requests somehow makes it to the application(s), there is no data written.
How to verify: Write a script to write a dummy entry into a dummy table. The write will ideally fail loudly, with an error conveying that write transactions are disabled.
ALTER DATABASE my_db SET default_transaction_read_only = true;

4. Terminate all connections to source PG10 DB
Why: Ensures no connections to PG10 DB from application(s).

5. Check replication lag
Why: If replication is 0, it means both PG10 and PG13 DB are in sync.

SELECT now() AS CURRENT_TIME,
 slot_name,
 active,
 active_pid,
 pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS diff_size,
 pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS diff_bytes
FROM pg_replication_slots
WHERE slot_type = ‘logical’;

current_time | slot_name | active | active_pid | diff_size | diff_bytes 
 — — — — — — — — — — — — — — — -+ — — — — — — — — — — — — — + — — — — + — — —
 2023–04–03 08:37:46.169934+00 | pub_all_table | t | 58088 | 0 bytes | 0
 2023–04–03 08:37:46.169934+00 | pub_table1 | t | 57556 | 0 bytes | 0
 2023–04–03 08:37:46.169934+00 | pub_table2 | t | 57691 | 0 bytes | 0
 2023–04–03 08:37:46.169934+00 | pub_table3 | t | 59805 | 0 bytes | 0
 2023–04–03 08:37:46.169934+00 | pub_table5 | t | 59025 | 0 bytes | 0
 2023–04–03 08:37:46.169934+00 | pub_table6 | t | 59376 | 0 bytes | 0
 2023–04–03 08:37:46.169934+00 | pub_table7 | t | 59530 | 0 bytes | 0

Example SQL query to ensure that there is no replication lag between the databases. The diff_bytes is 0 now!

6. We now need to reset sequences in, which we accomplished with the following script: https://wiki.postgresql.org/wiki/Fixing_Sequences
This ensures that the sequence starts from the last entry of the individual tables.
Why: The target DB (PG13), which was a clone of the source DB(PG10), has all of its sequence number(s) lagging since the cloning was completed. We need to reset the sequence numbers so that new writes do not cause any sequence number collision.

7. Update route53, changing the DB connection string to refer to the PG13 DB.

8. Verify DNS update of both reader & writer of route53 URL
nslookup my-route53-writer-url.net
This will take a few seconds to update from PG10 to PG13 connection string.

9. SSH into an application pod and run a script which writes to a dummy table.
Why: If the write succeeds, you know that the PG13 DB successfully processed the write (PG10 would error out, given that it is in read-only mode).

*Route53 now points to our PG13 (upgraded) database.*

10. Downtime over — Turn the circuit breaker off, allowing ingress traffic to the application. You should now see regular traffic flowing and writes happening to PG13. The downtime in our case was just ~7 mins which was a big win considering the in-place upgrade would have taken ~30 mins.

Circuit breaker disabled. Traffic flowing into the service again

This is what our traffic dashboard looked like during the downtime:

SUCCESS!

The only traffic to the pod was a health check, which is why you can see the single green line despite no traffic to our application. Please note that we also had setup a rollback replication from PG13 -> PG10 in case of any issues with PG13, however, it is not discussed here in the interest of simplicity.

Closing Notes

Our model of blue/green deployment for Aurora upgrade was created and executed in April 2023. As of October 2023, AWS now supports blue/green deployment for Aurora Postgres. This should definitely help bring down the downtime further given that all of the steps for replication, cutting the traffic over to the upgraded DB, and the rollback setup would be handled within their platform. This blog would be of immense help to understand what happens under the hood with AWS blue/green deployment!

Acknowledgements

Special thanks to Shailesh Rangani and Suyog Pagare whose Postgres expertise made this upgrade with minimal downtime possible.

If you’re interested in solving problems like this one, please take a look at our careers page!

Footnotes

¹ PostgreSQL: Documentation: 16: 49.2. Logical Decoding Concepts
² Cloning a volume for an Amazon Aurora DB cluster
³ PostgreSQL: Documentation: 16: Chapter 50. Replication Progress Tracking