Story about unexpected slowdown during AWS RDS upgrade to AWS Aurora and InnoDB adaptive hash index parameter

Henrik Peinar
Security and Node.js
5 min readSep 9, 2019

--

TL;DR at the end.

The parameter.

MySQL 5.7 documentation about InnoDB adaptive hash index. Turning this parameter ON enables the database engine to analyze index searches and to automatically adapt to the queries/searches you are running. It does so by making custom indexes for these specific cases, in return making your queries run faster because they can now use the automatically generated index to find the data faster. The downside is the added performance overhead caused by index search analyses. So, after enabling it, you should make sure your system can actually benefit from it.

You can check the parameter value in your MySQL with the following query:

SHOW VARIABLES WHERE variable_name LIKE 'innodb_adaptive_hash%';

To enable it (Note that for AWS based databases (RDS, Aurora etc) you need to change this in the parameter group):

SET GLOBAL innodb_adaptive_hash_index = 'ON';

The easiest way is to just benchmark query times before/after this has been enabled. You can also use InnoDB metric information to determine if and how much is the adaptive hash index used.

SELECT * FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE subsystem = 'adaptive_hash_index';

For example, in our production system adaptive_hash_searches count is at 65 618 624 425. The metric has been enabled for a month, so this translates to about 25 thousand adaptive hash searches per second.

The story

We have a web application which is considerably heavy with DB writes/reads and we figured it would benefit from using AWS Aurora. So a decision was made to move from using one large AWS RDS instance to AWS Aurora instead. The main reason being that in case of high load you can easily load balance by firing up more read replicas without any downtime and AWS will automatically rotate connections between these. Side note: To actually make any use of this, your application must actually create new connections as well. Old, already open connections can’t be balanced between new replica instances. Also note, because AWS is using DNS for load balancing, there might be a delay once the balancing kicks in because of DNS caching on your application side.

The application has high load endpoints that only deal with reading from the database, so the plan was to make these endpoints to use the read replica instead. Meaning the write (master) would receive less load and handle spikes better.

As one would imagine, these kinds of changes are first tested in QA environment to make sure everything works as expected. We started with adding a separate database connection pool to the application that would use the read replica endpoint. This pool can only be used for reading and you need to make sure not to use it in-between writes because there is always a replica sync delay and it might just happen that the data you just wrote is not yet available on the read replica (at least theoretically).

After updating the code, we migrated QA RDS to Aurora and fired up the environment. We ran some tests and concluded the system is stable and we’re good to go for production release.

And here comes the release day. We have clearance for a maximum of one hour downtime, estimated actual downtime was around 10–15 minutes, another 40–50 was in-case something goes wrong and we need to restore from the backup and make sure the system is healthy.

We kill the services. Instances stop. Dev-Ops guys migrate data from RDS to Aurora. I run few SQL queries on the new Aurora cluster to see that latest data is indeed there and we didn’t miss anything. Clear. Instances reboot. API’s come back. Traffic continues. App health checks are clear. A quick glance over some of the metric graphs to ensure everything looks good and then we notice “it”.

Outgoing API request times, note that this is also includes API heavy lifting and not only database query time

Our outgoing requests duration graph feels… more dense. We investigate it closer and indeed, 5-minute average before the migration was ~720ms, but now it’s at ~1s. It’s a small difference at a glance, but the graph annoyingly suggests like more requests are slower and peaks are higher than before. But because of the nature of the application, we wrote it off as a side-effect of the downtime. There was 10 minutes worth of data that needed to get out, so probably exports were just larger, thus taking more time. We sign it off as a successful migration, but we promised to check back on the graph on the next day.

Next morning the graph looked even worse than we had hoped for. At our peak hours, the requests were taking twice as long as before. This was a business affecting problem so we started investigating right away. We also had a considerable amount of code changes deployed with the migration, so we first started investigating these. We skimmed over the diffs but came back with nothing. We didn’t find anything apparent that could have cause such a slowdown.

So the next possible cause was the database migration, but how? The size of the Aurora was on the bar with the RDS we were using. Same MySQL versions AND now we were using read replicas. If anything it was supposed to be faster, not slower. So we boot up our old RDS and take the parameters list from there with SHOW VARIABLES; and do the same for the Aurora. Diffed these two lists to see maybe there was a notable difference there somewhere. After going through the differences, we finally come to innodb_adaptive_hash_index This was disabled in the Aurora cluster, but enabled in our RDS. I never knew about this parameter until that day, but it makes a huge difference in a read/write heavy application.

After running some queries, we confirm that indeed, turning this off in the old production RDS also makes queries slower and we start planning for Aurora parameter group change, which would turn this on.

To my surprise the graph looked exactly the same after the parameter group change… slow. So now what? I logged the variables out again from Aurora cluster and the innodb_adaptive_hash_index was ON. It took some poking around and few cups of coffee before we managed to find out that the same parameter in read replica was still OFF, thus the queries were still slow there.

After some e-mails with AWS support, we finally arrive at a sad fact that the innodb_adaptive_hash_index cannot be enabled for Aurora read replicas. It is currently disabled there by design and there is no ETA when will it be supported.

But our application was still slow, so we had to do something about it and at that point, there were only two options, we either migrate back to RDS or we use Aurora writer for reading as well. We chose the latter because it had a lot smaller impact on the application and we still have a small hope that this parameter will be supported on the read replicas in the (near) future by Amazon.

TL;DR: At the time of writing this post (June, 2019), it is not possible to enable innodb_adaptive_hash_index for AWS Aurora read replicas. So if you want to take benefit of this parameter, you can only use the master writer in Aurora.

--

--