Monday, March 26, 2018

AWS Fargate POC

This week I had the opportunity to work on a POC using AWS Fargate to provide the desired scalability to our most critical service: the one which calculates taxes and fiscal information of a commercial operation.

AWS has had the ability to run containers for quite a while and we've tested ECS before with the same goal but we found it a bit cumbersome as we had to create auto-scaling groups, launch configurations and manage the scale out of EC2 instances. This felt wrong somehow, as we had 2 different places to control the size of the cluster: 1) part of it was in the ECS service and 2) part of it was in the EC2 auto-scaling groups.

AWS Fargate solves this by putting all scalability configuration in the ECS service. Now you can manage the services without even thinking about the EC2 instances which back their execution.

Setup a Fargate cluster is a piece of cake using the wizard provided by AWS. There's a lot of information about that in the internet so I won't get into the details of creating a Fargate cluster. Instead, I will focus on the benefits and drawbacks we found of using Fargate to provide the infrastructure to our service.

Our tax calculation service has a very heavy initialization cost because it has to load millions of tax rules and build several indexing structures before start calculating. The whole process can take up to one minute. For this reason we couldn't have a task instantiated for each calculation request. So we chose to have each container running a Tomcat for an indefinite amount of time servicing our tax calculation service.

Our cluster:

  • Tasks with 5GB of memory and 2 vCPU each. Each task was a Tomcat running our service that could serve requests for as long as the task was alive.
  • Auto scaling was based on requests per target: ALBRequestCountPerTarget at 100
  • Auto scaling limits: 1-10 tasks

The load was provided by 3 t2.xlarge EC2 instances using 16 threads each to fire requests to our cluster. The requests were taken from a database of approximately 15000 different tax scenarios read in a round robin fashion.

The test consisted of 278040 scenarios sent to the cluster. Each test thread would send a request and wait for the complete response before sending another request. All tests were started with the cluster running only one task.

We ran 8 tests. The results were:

Worst case: 278040 tax scenarios in 2544 seconds => 109,3 scenarios/s
Best case: 278040 tax scenarios in 491 seconds => 566,3 scenarios/s
Average: 278040 tax scenarios in 945 seconds => 294,2 scenarios/s

Checking the scalability of the cluster:
  • 3 of these tests ran with only one task for the whole time because ECS could not provide new task instances due to lack of available capacity (service <service-name> was unable to place a task. Reason: Capacity is unavailable at this time).
  • 2 of these tests had more tasks instanced but it took a long time to provision the resources and the impact on the performance was reduced.
  • 3 of these tests had new tasks instanced very quickly (as desired) and that showed a great impact in the performance. 

Given the results we had, my opinion is that the use of ECS + Fargate to provide scalability to a heavy load service is feasible. Setting up a cluster and managing it is very simple and the results we had showed that the infrastructure can be very flexible, adapting to different loads quite quickly.

The fact that you could be left on your own when demand on your availability zone increases, however, is a bit worrying to me. In one occasion no new instance could be added to the cluster for about 2 hours even though the load was high. I'm not sure we could support degraded performance for such a long time in production.

For this reason my opinion is that a ECS Fargate cluster is not yet the best choice for a critical service where performance is so importante. I got a good impression from Fargate and I think managing a ECS cluster is now much more intuitive and easier than it was in our first ECS POC. But being left with a cluster that was not able to scale out under heavy load for about 2 hours really kills any possibility of using ECS + Fargate for production of our most critical services for now.