Avoiding Chef-Suck with Auto Scaling Groups
I have been instrumental in helping implement a cloud solution for a large client that is interested in hosting their new applications in AWS rather than data center. The client was already using Chef for all of their data center deployments. It is only natural to leverage the same technology in AWS. We shared the same Chef server for both AWS and data center infrastructure. All of the AWS nodes are inside VPC with a VPN tunnel back to the data center. This was done so we don’t have the overhead of managing another Chef server just for the AWS infrastructure. Also all of the software packages reside behind the firewall.
Below are some of the issues we encountered using Chef with the AWS setup:
Failed Chef runs are inevitable. However, the first Chef run is the most important since it registers the node with the Chef server so it can be managed for future Chef runs. If, for whatever reason, the first Chef run fails, you are screwed. Manual intervention will be required to join the node (I’ll give an example below).
In the data center, this is less of an issue. Since all the nodes are effectively “static”, you know when the first Chef run will happen (adding new machines, provisioning new OS). It is much easier to monitor for failed Chef runs. However, in AWS auto scale will kick in whenever it needs to so it’s much harder to monitor. Since new nodes are being provisioned all the time (auto scale), the likelihood of failing first Chef runs are much higher.
I know what you guys are thinking, if the cookbooks are written properly Chef run wouldn’t fail. This is mostly true. In AWS, IP addresses are recycled. We were running inside VPC so the subnet IP range is much smaller. IP collisions happen all the time. When a node is terminated by ASG, Chef server doesn’t know about it. Next time that same IP address appears again, Chef server will reject it. To combat this issue, we had to create additional tooling to make sure all of the IP addresses are cleaned in the Chef server.
Chef client isn’t the fastest program in the world, but that’s not really the true issue. To provision new nodes we are running all the recipes and transferring all the necessary packages from our data center to the cloud. This can take some time, since provision from top to bottom is very common for us. You can tolerate this in the data center since it only really happens once. This can take up to 25 mins for the first Chef run to complete. In the meantime, our cluster is getting killed. There might be ways to optimize the provision time, but the bottom line is that this will always take more time than we would have liked.
ASG can be triggered a variety of ways. For some of the applications, we use average CPU. Since Chef run happens at some predictive interval (default 30 mins). It is using up enough resources to trigger ASG when the server load is near the threshold. This is an issue because after the Chef run, the load goes back down, now ASG will terminate that newly provisioned node. We can tweak our ASG, but this became a hassle since every application has a different ASG profile.
So, what did we end up doing?
In order to combat the above issues, we ended up creating a simple tool for developers to create (bake) an Amazon Machine Image (AMI) on demand, a la Netflix style. It runs Chef-solo behind the scene, since we already have all the recipes. Actually, most of the applications are Java, we just created a common recipe. In many cases, the person deploying does not need to be technical or require any Chef work. It’s just one button push away. The important takeaway is that we version our image (AMI in this case) with the software. In general, we run one service per box and each service can be clustered (horizontal scalability). The name of the image correlates to the service version baked in. In order to release a new version of the service, a new image has to be baked, then we rotate it into the cluster. If there are no issues with the new service, the old cluster is taken down. The really nice thing about this approach is that we have all the previous versions “frozen” in the exact state. If anyone needs to hunt down a bug, it will be very easy to replicate with the exact production environment. Since everything is in an ASG, all the nodes will be launched in a predictable fashion and as fast as possible since provision isn’t happening just in time. We no longer have to worry about failures during provisioning time.
An important thing to note is that when we build a new image, we do not build it from the previous version. It starts from some known clean base, usually it is pretty bare. This way, there’s no cruft being accumulated over time.
WAIT, what about configuration management?
I know what you guys are thinking. How do you manage the configurations? How do you do service discovery? These two things are previously managed by Chef runs. We no longer run Chef client on your boxes. In order to archive this, we moved both into the application tier. This is not done solely for the sake of removing Chef to build the AMI.
Very often we want to change our application configurations as fast as possible without waiting for some predetermined period. We introduced a REST configuration service that the application will call out before starting. If for some reason that service is down, it will bootstrap itself using some known configuration, typically injected by userdata during launch time.
Service discovery1 is also moved to the application tier because everything is inside an ASG. Nodes in the cluster are always appearing or disappearing. One service needs to know where another service is. A REST service2 is provided for nodes to discover each other. When the node gets launched, it registers itself (unregister during termination). It will communicate to the discovery service periodically. Otherwise, it will be marked as unavailable. Before the service is launched it will retrieve necessary info via discovery service. Discovery service is a critical component. It runs in a cluster with replication to avoid single point of failure. Also Chef run will not save us here since it doesn’t know the cluster will be resized.
It takes a decent amount of work to get here, but it’s possible to take an incremental chiropractic approach to evolve over time. We evolved the infrastructure while we were running in production. This approach results in your application being more robust, resilient and predictable. Currently this system is doing billions of requests per day and has drastically sped up deployment and reduced deployment related issues greatly.
I am interested in hearing about your experiences, if you use Auto Scaling Groups, Chef or are experiencing pain in the cloud I'd love to hear from you: [email protected].
1 Service discovery is when a service (application) is dependent on another service to perform an operation. For example, you might have an aggregation service that needs to call out to multiple sources (services) in order to complete its task. This is very typical in an SOA environment where a discrete function has been broken up into separate services in order to leverage isolation and horizontal scalability. The simplest way to archive this is to hard code the dependent services into a configuration file. This approach will not work when the services can appear dynamically.
2 We are using Eureka from Netflix for our service discovery, but there are a number of solutions out there which I will discuss in a future post (Zookeeper, HA Proxy etc). Eureka offers a REST interface and client SDK (if you are Java) for easy consumption. The bottom line is that this is the most resilient solution we have found so far.
Aaron Feng is the founder of forty9ten LLC. He is a passionate Software Engineer with a special interest in cloud based infrastructure and DevOps. He has organized various tech groups since 2007, but is most well known for Philly Lambda. He is currently organizing DockerATL.
Twitter: @aaronfeng
Email: [email protected]