AWS CDK Aurora and Spring Cloud - AWS and serverless

AWS CDK Aurora RDS and Spring Cloud

Spring Cloud for AWS is part of the Spring Cloud project that I wanted to integrate with CDK Java project that I've implemented last week:

https://slawomirstec.com/blog/2021/04/cdk-rds-vpn

The infrastructure provisioned in this project includes among other AWS Aurora in multi-AZ mode and Vpn Client for secure connection to the private subnet where DB is deployed to.

I wanted to check the current status of AWS Cloud integration in the Spring Cloud project. For that, I've implemented a simple starter project to test out database failover and integration with the AWS manager. The project can be found on Github:

https://github.com/stokilo/aws-spring-aurora-cluster

Spring properties

There are 3 main parts of the Spring application.properties file.

First is app name and cloud.aws.* context setup. That is the standard setup required by the library.

The second part consists of two data sources, one for the writer and one for the reader endpoint. Credentials and jdbcUrl are fetched directly from the AWS Secret Manager. URL reader/writer.rds.com is configured in a private hosted zone as part of infrastructure provisioning.

The remaining configuration is for JPA and connection pool, in general, it is optional for this setup.

I've configured Spring Cloud 2.3.10.RELEASE with Spring boot starter.

Secret management

Infrastructure provisioned with CDK includes RDS generated with a new secret with name /config/aws-spring-aurora-cluster

In order to fetch this secret, I'm passing a configuration in bootstrap.properties to change the secret prefix, I change it from /application to /config

I didn't spend much time on secret configuration. I had some problems with it. In order to make it work you need to include dependency to spring-cloud-starter-aws-secrets-manager-config in the pom.xml. This is enough for Spring Cloud to make a web service call to AWS and fetch secrets from two scopes, application and global. More details of what the secret name should be can be found in the documentation. My issue with this approach was that, in contrast to the AWS Parameter Store, I was not able to find a way to define more secrets. It worked ok for database secret but I still don't know how to use more than two secrets in the application. It looks like Spring expects I keep all my secrets in a single instance of the secret. In my case I provision stack with CDK and database secret is a special construct that is required by RDS, I can't modify it and include more data, and that is probably not ok to do it this way.

Read replica

The main feature of Aurora RDS in multi-AZ mode is HA.

I've configured two data sources in the spring context. One for the writer and one for the reader. Additionally, I've created two transaction managers associated with data sources.

Here it is important to notice that reader.rds.com and writer.rds.com are private hosted zone DNS CNAME entries. Vpn connection is required to test this setup locally because RDS has public access disabled. Additionally, for the failover scenario, please notice we depend on Java DNS caching.

Unfortunately, I could not configure the Aurora cluster in the tested Spring Cloud version. The cluster configuration is not supported, it will be included in 3.0.x version.

https://github.com/spring-cloud/spring-cloud-aws/issues/356

Suggested solutions are described in the following blogs:

https://vladmihalcea.com/read-write-read-only-transaction-routing-spring/

https://fable.sh/blog/splitting-read-and-write-operations-in-spring-boot/

These solutions are not implemented in this project, I've implemented a simple failover handler class that monitors writer endpoint errors. In such cases, the connection pool is evicted. This project configures two data sources, one for the writer and one for the reader endpoint.

Spring Cloud is using an approach to annotate read-only service functions with @Transactional(readOnly=true) annotation. Because of mentioned limitations of Spring Cloud, I've decided to implement a different approach. I annotate read only service functions, that I want to route over read replica, with a value of TransactionalOverReadReplica.READ_REPLICA. This will route them via transaction manager associated with the read replica.

This configuration results in the distribution of read/writes over writer and reader nodes. The picture on top of the page is an example of this behavior. Testing instructions are included on github README.MD

Failover

Failover testing can be performed directly from the AWS Console. You can also use SDK API to run failover action.

I've decided to run multiple read/write transactions and execute failover in between. As result, I get a database connection errors.

As mentioned before, I could not use Spring Cloud for my setup. Spring Cloud support failover retry scenario. In my case, I've decided to implement a simply exception handler in form of the Spring aspect.

I've implemented catchDbQueryException function to detect Postgres specific error with code 25006.

This is an error that is thrown in case the writer become read replica and pending connection pool connection attempts to execute write queries.

Imagine the following scenario, AZA with the writer instance and AZB with the reader. When you execute failover, the role of these db instances is switched. This can be observed almost immediately after running failover from the console because this is not real world testcase. In a real scenario, there would be a delay. Now, the reader becomes a writer and a writer a reader. Endpoint DNS name stays that same, only IP behind the DNS CNAME is updated. This is a problem when the connection pool still has pending connections. When there are connections that attempt to execute write transactions, the transaction will fail with Postgres error 25006, it is because they point now to the read replica.

I evict all connections from the read/write connection pools when this exception is thrown. In an effect, there is a short time window when transactions fail, but the system resumes after 20-30 seconds.

One side effect for read transactions with this approach is not handled in the project. It is possible that after failover, read replica is not used anymore and all queries go over the writer. This is because we still have pending connections in the pool and reader DNS CNAME is cached by Java. The connection pool manager is not checking DNS resolution and we end up with unused read replica instances.

Why is it working for the writer data source after failover? Because we listen on the error and when a pool is evicted, it is possible that DNS caching still applies to the reader endpoint (reader.rds.com). After the writer pool is evicted we don't get errors anymore, reader pool remains as it was before failover, thus it is possible that talk with the writer node.

One solution I can think of is to monitor reader.rds.com IP and evict the connection pool when change is detected. Such a task could run i.e. every minute to keep recovery time as short as possible.

I think that is probably worth investigating a solution with checking AWS database events using SDK. When the writer node is changed we could do the same connection pool eviction as in the case of updated IP.

Failover with AWS PostgresSQL JDBC driver

I found out new, initial release of AWS PostrgresSQL JDBC driver, it supports failover for Aurora cluster

More details can be found here:

https://github.com/awslabs/aws-postgresql-jdbc

This is a new driver released week ago. It supports custom domain name for JDBC URL but it didn't work for me after failover, my connections were terminated. However, setting cluster reader/writer endpoint DNS name worked. I configured two data sources, reader and writer, failover worked for both same way as for my custom PostgresFailoverAspect implementation.

This driver is enabled by default in the sample project. To switch to PostgresFailoverAspect, check its code, plus pom.xml and application.properties must be updated.

Slawomir Stec