Slawomir Stec

Quarkus Fargate Load Testing

Quarkus Fargate load testing

I've performed a few load tests against my real estate listing application today. Application is deployed on AWS Fargate container service and connecting to managed RDS instance (db.micro).

I wanted to see how many requests it can handle for public endpoint and in the cheapest AWS config option. I've executed tests with much more requests I expect to ever handle. The purpose of this was to find missing configuration adjustments and monitor IOPS on the RDS side.

I was testing only one endpoint. This endpoint executes address autocomplete for listing.

The database was populated with around 380.000 addresses. The executed query was doing a full-text search over all entries. I've generated artillery test data to ensure that all possible combinations are used as an autocomplete input. I wanted to make sure that the database won't cache anything during the test. The database is a Postgres RDS, micro instance, single AZ, and no read replicas.

The Fargate container specifications are described in test cases. I activated spot instances for all tests.

The Quarkus application endpoint authentication is optional. It was handling GET request that was wrapping simple full-text search query and returning a JSON as a response with max 10 results.

All these tests are testing the HTTP service layer on the Quarkus side and the database instance that handles queries. No caching on the Quarkus side was implemented. Additionally, the resource executes native query so no 1st level Hibernate cache is involved.

I've executed the following artillery tests:

TEST 1: duration: 120 arrivalRate: 50 rampTo: 100

Fargate: cpu: 512, memoryLimitMiB: 1024, desiredCount (number of Fargate instances): 1

Result: http.codes.200: 9002

http.request_rate: 75/sec

p95: 96.6 ms

p99: 133 ms

Summary: All requests handled without an error, most response times < 200 ms

No issues with the application at this point. I see some reserves and possible improvements in this config. This is not a native image, we can improve this.

TEST 2: duration: 160 arrivalRate: 150 rampTo: 350

Fargate: cpu: 512, memoryLimitMiB: 1024, desiredCount (number of Fargate instances): 1

Result: http.codes.200: 18552

Result: ETIMEDOUT: 11691

http.request_rate: 232/sec

p95: 1686.1 ms

p99: 4231.1 ms

Summary: this was expected, too high load for given config, 50% of requests to timeout. Note: container was running uninterrupted, looks like a timeout originated from the database side.

TEST 3: duration: 120 arrivalRate: 50 rampTo: 100

Fargate: 512, memoryLimitMiB: 1024, desiredCount (number of Fargate instances): 2

Result: http.codes.200: 9015

http.request_rate: 77/sec

p95: 94.6 ms

p99: 308 ms

Summary: Same as TEST 1 but with 2 container instances. No real performance difference compared to 1 instance config.

TEST 4: duration: 160 arrivalRate: 150 rampTo: 350

Fargate: cpu: 512, memoryLimitMiB: 1024, desiredCount (number of Fargate instances): 2

Result: http.codes.200: 19195

Result: ETIMEDOUT: 10909

http.request_rate: 232/sec

p95: 1939.5 ms

p99: 4147.4 ms

Summary: Same as TEST 2 but with 2 container instances. Still, no improvement compared to previous tests. Both containers were working without interruption. RDS is a reason for the timeouts.

TEST 5: duration: 120 arrivalRate: 50 rampTo: 100

Fargate: cpu: 512, memoryLimitMiB: 1024, desiredCount (number of Fargate instances): 2 NATIVE IMAGES

Result: http.codes.200: 9060

http.request_rate: 78/sec

p95: 51.9 ms

p99: 62.2 ms

Summary: Same as TEST 3 but with 2 container instances and NATIVE Quarkus image. Better stats for p95 and p99 percentiles. Better memory management on the container side is visible on metrics at the end.

TEST 6: duration: 160 arrivalRate: 150 rampTo: 350

Fargate: 512, memoryLimitMiB: 1024, desiredCount (number of Fargate instances): 2 NATIVE IMAGES

Result: http.codes.200: 19279

Result: ETIMEDOUT: 10382

http.request_rate: 238/sec

p95: 3072.4 ms

p99: 5826.9 ms

Summary: Same as TEST 4 but with 2 container instances and NATIVE Quarkus image. No little improvement for response times but almost the same number of timeouts.

Fargate container CPU metrics for all 6 tests below:

Native image tests were performed at the end. You can observe lower CPU usage during peaks.

For all cases, 100% spikes were caused by container restart and initial Quarkus startup.

Fargate container memory metrics for all 6 tests below:

We can observe better memory management for native image tests executed at the end.

RDS metrics are below.

CPU utilization hit the max of 20 % during all tests. The max number of connections for the pool (75) was maxed out twice, it was in the case of the 350 concurrent requests.

Summary, what to test next:

1. I should rerun Test4 and Test6 on db.medium, and compare the number of timeouts. I should also monitor and analyze IOPS usage for read-only operations

2. I should include a new scenario to test authenticated resources (Cognito) to find out if we don't get rate limited from this AWS service.

3. Add testing scenario for write operations. It can be a listing creation scenario with an option to search listing.

4. Implement resource level caching for the search term and execute Test5 and Test6 again. What are the rate of timeouts in this case and Fargate container memory usage?

5. Add read instance to RDS. What is rate of timeouts for Test6?

Summary, lesson learned:

In the cheapest config (RDS 9$, Fargate Spot 20$) you can handle a quite large load on search location resources.

This is an endpoint that can be called by unauthenticated users. Additionally, this is frequently used as an entry point to the application. Should handle a large number of queries. I think in case of a higher load it will be easy to add the RDS reader instance.

WAF is mandatory for this setup to protect the application load balancer against script kiddies. ALB doesn't support rate-limiting at the moment.

UI template

PyAws 0.2.4