Saturday, August 25, 2018

Infrastructure as code - troposphere

Much is said about infrastructure as code but, in my experience, very few (if any) companies do it. Personally, I've never seen one. Companies work in the cloud as they used with physical infrastructure, starting machines, defining subnets and configuring load balancers manually. Then they write a bunch of shell scripts on these machines and claim their infrastructure is automated.



We were no different and followed exactly that same recipe. There's nothing wrong with that and it's a very efficient strategy when one has to manage just a few products running on a few dozen machines. But things start to get out of control when the amount of resources being managed grow past a certain point (for us this point was around 10 products running in 3 AWS regions). That was when we took the decision to embrace infrastructure as code.

The first difficulty I faced was to get used to the cloud formation API. I was used to interacting with AWS only by the web console and its fantastic wizards so I never spent much time learning all the attributes of the resources created and how they applied to different scenarios. When using the cloud formation API I didn't have the wizards so knowing the resources attributes and how they were supposed to be used became necessary. It was not a big pain, for sure, but it took me some time to get used to the AWS documentation (very good, by the way!) and get reasonably fluent building my stack templates.

I was then writing stack templates for our applications, parameterizing all that could change, using cloud formation functions but still didn't fell like code. It was more like writing a report on libreoffice writer. I could add format, use some functions, get some input parameters, but in the end it's just text (in my case, YAML!).

That, per se, was not a problem. The way I feel about something is not relevant if it's the best way to solve a problem. But the templates were starting to get extremely confusing and have a lot of duplicate "code" as the stacks grew. As for the template being difficult to understand, one can solve that simply breaking a large template in smaller ones. It's possible to reference templates within templates. But that would not solve the "code" duplication issue. Enters now troposphere!



Troposphere is a python library that allows for the creation of cloud formation templates programmatically. It really makes infrastructure as **code**. Real code. Python. One can encapsulate resource creation in functions avoiding code duplication. One can hierarchicalize resource creation, making it easier to understand and maintain. And one can use any python code to extend what is possible to do with cloud formation templates. And in the end one just has to run the python code to produce a pure JSON template that can be used in AWS cloud formation.

My intention here is not to pitch about troposphere. I'm not connected with the project in any way and I don't gain anything by making it more popular. I'm just suggesting that if someone wants to treat infrastructure as code he/she should really take a look at it as it makes, in my personal opinion, cloud formation templating much more powerful, easier and maintanable.

Troposphere:

https://github.com/cloudtools/troposphere

PS: there are other tools like troposphere available out there. I haven't evaluated them. If you like some other tool, please leave me a comment.

Thursday, June 7, 2018

MOTD on AWS linux instances

MOTD (message of the day) is that greeting message that appears on your screen every time you log into a linux system. On a AWS linux instance it looks like:

Last login: Wed Jun  6 18:02:10 2018 from 000.000.000.000

       __|  __|_  )

        _|    (     /   Amazon Linux AMI
     ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/

8 package(s) needed for security, out of 35 available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2018.03 is available.

This message is located in the file /etc/motd (which is actually a symlink to /var/lib/update-motd/motd) but that's not the right place to edit in case you want to change it. That's because the message is dynamically generated by running /usr/sbin/update-motd on a cronjob (/etc/cron.d/update-motd). If you edit the /etc/motd file it will be overwritten when this job executes.

To change the MOTD on you AWS linux instance simply add a new script to /etc/update-motd.d/. The scripts will be executed in alphabetical order (hence the number prefix to make it easy to set the order) and the final MOTD will be the sum of the output of all the scripts.

Wednesday, April 11, 2018

One more case for guard clauses

Another case where guard clauses make the code more readable is when you have to verify some parameter state before applying your logic.

Isolating all the checks in guard clauses right at the beginning of your function makes the logic much clearer.

Example

Using nested conditionals:

 if (file != null && file.isAFile()){
   if (user != null && user.isLogged()){
     if (user.isAdmin()){
       if (page != null && page.isAnInteger && page > 0){
         text = file.read();
         return getPageFromText(text, page);
       } else {
         return "Invalid page."
       }
     } else {
       return "User does not have sufficient privileges."
     }
   } else {
     return "Invalid user."
   }
 } else {
   return "Invalid file.";
 }

Using guard clauses:

 if (file == null || !file.isAFile()){
   return "Invalid file.";
 }
 if (user == null || !user.isLogged()){
   return "Invalid user."
 }
 if (!user.isAdmin()){
   return "User does not have sufficient privileges."
 }
 if (page == null || !page.isAnInteger || page <= 0){
   return "Invalid page."
 }

 text = file.read();
 return getPageFromText(text, page);

Guard clauses and single exit point

Developers are often told to have a single exit point in functions but in my opinion this just makes things more difficult than they have to be.

Instead of having a single exit point I prefer to use guard clauses in most of the functions I write, using the single exit point paradigm only when dealing with resources that have to be manually closed (having a single exit point makes resource handling easier and less prone to error if you don't have something like a finally block to help you).

As a matter of fact, I prefer to use what I call "return early" paradigm, that is, return as soon as the job is done (guard clauses are a specific case of this). That avoids nested conditionals and makes the code simpler and easier to read.

Example

Using nested conditional to have a single exit point:

 String ret = null;
 if (isGreen && isRound && smellsCitric){
   ret = getLime();
 } else {
   if (isGreen && isRound){
     ret = getGreenBall();
   } else {
     if (isGreen && smellsCitric){
       ret = getCitricSoap();
     } else {
       if (isRound && smellsCitric){
         ret = getOrange();
       } else {
         if (isGreen){
           ret = getGreenColor();
         } else {
           ret = getUnknown();
         }
       }
     }
   }
 }

 return ret;

Using the return early paradigm:

 if (isGreen && isRound && smellsCitric){
   return getLime();
 }
 if (isGreen && isRound){
   return getGreenBall();
 }
 if (isGreen && smellsCitric){
   return getCitricSoap();
 }
 if (isRound && smellsCitric){
   return getOrange();
 }
 if (isGreen){
   return getGreenColor();
 } 

 return getUnknown();

Monday, March 26, 2018

AWS Fargate POC

This week I had the opportunity to work on a POC using AWS Fargate to provide the desired scalability to our most critical service: the one which calculates taxes and fiscal information of a commercial operation.

AWS has had the ability to run containers for quite a while and we've tested ECS before with the same goal but we found it a bit cumbersome as we had to create auto-scaling groups, launch configurations and manage the scale out of EC2 instances. This felt wrong somehow, as we had 2 different places to control the size of the cluster: 1) part of it was in the ECS service and 2) part of it was in the EC2 auto-scaling groups.

AWS Fargate solves this by putting all scalability configuration in the ECS service. Now you can manage the services without even thinking about the EC2 instances which back their execution.

Setup a Fargate cluster is a piece of cake using the wizard provided by AWS. There's a lot of information about that in the internet so I won't get into the details of creating a Fargate cluster. Instead, I will focus on the benefits and drawbacks we found of using Fargate to provide the infrastructure to our service.

Our tax calculation service has a very heavy initialization cost because it has to load millions of tax rules and build several indexing structures before start calculating. The whole process can take up to one minute. For this reason we couldn't have a task instantiated for each calculation request. So we chose to have each container running a Tomcat for an indefinite amount of time servicing our tax calculation service.

Our cluster:

  • Tasks with 5GB of memory and 2 vCPU each. Each task was a Tomcat running our service that could serve requests for as long as the task was alive.
  • Auto scaling was based on requests per target: ALBRequestCountPerTarget at 100
  • Auto scaling limits: 1-10 tasks

The load was provided by 3 t2.xlarge EC2 instances using 16 threads each to fire requests to our cluster. The requests were taken from a database of approximately 15000 different tax scenarios read in a round robin fashion.

The test consisted of 278040 scenarios sent to the cluster. Each test thread would send a request and wait for the complete response before sending another request. All tests were started with the cluster running only one task.

We ran 8 tests. The results were:

Worst case: 278040 tax scenarios in 2544 seconds => 109,3 scenarios/s
Best case: 278040 tax scenarios in 491 seconds => 566,3 scenarios/s
Average: 278040 tax scenarios in 945 seconds => 294,2 scenarios/s

Checking the scalability of the cluster:
  • 3 of these tests ran with only one task for the whole time because ECS could not provide new task instances due to lack of available capacity (service <service-name> was unable to place a task. Reason: Capacity is unavailable at this time).
  • 2 of these tests had more tasks instanced but it took a long time to provision the resources and the impact on the performance was reduced.
  • 3 of these tests had new tasks instanced very quickly (as desired) and that showed a great impact in the performance. 

Given the results we had, my opinion is that the use of ECS + Fargate to provide scalability to a heavy load service is feasible. Setting up a cluster and managing it is very simple and the results we had showed that the infrastructure can be very flexible, adapting to different loads quite quickly.

The fact that you could be left on your own when demand on your availability zone increases, however, is a bit worrying to me. In one occasion no new instance could be added to the cluster for about 2 hours even though the load was high. I'm not sure we could support degraded performance for such a long time in production.

For this reason my opinion is that a ECS Fargate cluster is not yet the best choice for a critical service where performance is so importante. I got a good impression from Fargate and I think managing a ECS cluster is now much more intuitive and easier than it was in our first ECS POC. But being left with a cluster that was not able to scale out under heavy load for about 2 hours really kills any possibility of using ECS + Fargate for production of our most critical services for now.