Puppet and Elastic, Role-Based EC2 Infrastructures

In this post, I'll introduce you to some Amazon EC2 automation concepts and how they can be leveraged to run an elastic, role-based infrastructure. Examples given assume the use of Puppet (Master), but the concept should be portable to other configuration managers, as well as other, conceptually-similar infrastructure providers.

Background

If you're already familiar with EC2 features such as instance management and auto-scaling, feel free to skim or just skip past this.

In Amazon EC2, you're able to create fleets of machines, give them fixed IP addresses, give them uniquely-identifying hostnames, put those hostnames into DNS to point at the machines, and use those hostnames to control their configuration using a system like Puppet.

This isn't wrong; it can't be because it works. However, this incurs manual overhead to create and assign identities to those machines, which may need to be repeated when reconfiguring or replacing them. Worse still is if you need to do this in the middle of the night, or in the middle of a crisis, or both!

Compounding the problem in EC2 specifically, changing machine configuration sometimes requires entirely recreating the machine, whereas in a traditional data centre this may not be necessary (such as adding more RAM).

Even if you do all of this via their APIs rather than their web console, it's still micro-managing an explicitly defined set of machines.

An alternative to this is to leverage the hands-off management features, and the fluid nature of EC2.

In addition to being able to create machines, EC2 allows you to create configurations which describe a machine, and then tell EC2 to create machines for you. These are called Launch Configurations and Auto Scaling Groups, respectively.

You can use this to describe things like the instance type, how many to create, their operating system, firewall settings, inherited AWS API permissions... just about anything you can configure from the API or web console manually. Typically, you would do this once for each distinct role within your infrastructure.

EC2 Auto Scaling will then create them on your behalf, and if one is ever terminated or fails its automatic health checks EC2 will create another to replace it, ad infinitum. You can also change the size of the group or set conditions under which the size of the group automatically grows or shrinks (based on load factors or message queue lengths, etc.) and the Auto Scaling services will take care of the rest.

The same approach works just as well for one instance as it does for many: have a thing you want to run that requires always (and only) one machine? Set the instance count to 1.

The caveat here is that EC2 will also create them identically. Their only differences will be things that are inherently random; things such as network addresses, EC2 instance IDs, and so on. For example, if you described a hostname in the machine's bootstrap script to aid with configuration management, they will all have the same hostname (notably, this can be problematic for things like Puppet membership and DNS).

Puppet, EC2 and Roles

Puppet uses SSL certificates to uniquely identify machines. In a typical setup, the machine's hostname is used as the certificate's name. This certificate name (the hostname) is then, within Puppet, mapped to a configuration, or mapped to a role and then indirectly to a configuration.

Taking the above aspects of auto-scaled instances into account, for each given role in your infrastructure, each bit of information about an instance within that role is going to be either identical or random. This includes the instance's hostname which Puppet will rely on.

For identical hostnames, without bypassing some sane security defaults, the Puppet Master won't like multiple machines claiming to be the same name, so I'll rule that out here. Purely random hostnames are just that; not predictably mappable to a role.

For the sake of this example, I'm also going to rule out adjusting the EC2-assigned hostname. There's a great DNS homing feature of EC2 which I like to make frequent use of. This allows public hostnames such as ec2-75-101-137-243.compute-1.amazonaws.com to be resolved to either the public IP 75.101.137.243 when outside of EC2 or to an internal IP like 10.190.134.5 when inside EC2. Any interruption to that has to be done carefully otherwise it's easy to have your instances talking to each other over the public networks at both a monatery and performance cost. You can of course use DNS CNAMEs to work around this, but continue reading and you may find you don't need to.

Given all this, how can we have the Puppet Master decide on a role for a machine?

Tags, Facts and External Node Classifiers

Rather than having it decide, what if we told it specifically? All we need is a little bit of extra metadata on each machine to say what role it is for Puppet to use.

EC2 Tags are perfect for this (e.g. Role: appserver) and they can be added to instances automatically by auto-scalers. On top of that, the EC2 API can be queried programmatically to return a list of Tags for any given instance.

These aren't immediately useful to Puppet though; we still need some way to leap from the hostname to the Role tag we've introduced.

Enter External Node Classifiers.

In short, an External Node Classifier is any executable which, when given a node identifier (the certificate name), returns facts about that node and class names which Puppet should apply.

This means we can write a small script which queries the EC2 API and returns the Role information we tagged it with above. This is configured on the Puppet Master, like so:

puppet config set --section master external_nodes /path/to/script
puppet config set --section master node_terminus exec

So far though, in terms of identifying the machine to query, we only have the hostname. For instances on which we have not directly managed the hostname, a hostname is not guaranteed to be stable. Hostnames granted by EC2 based on the IP address can change if the machine is assigned an Elastic IP, for example.

The ideal way of identifying a machine within EC2 is by its instance ID.

Thankfully, there's a way we can get Puppet to use this ID in place of the hostname. This is a simple change in Puppet agent configuration. Typically you would do this when the machine comes to life via a userdata script. For example, before running the agent for the first time (thus, before generating the SSL certificate):

export INSTANCE_ID=`facter ec2_instance_id`
puppet config set --section agent certname "${INSTANCE_ID}"

This will then identify the machine to Puppet as i-#######. The Master will pass this ID to the External Node Classifier, which in turn can query the EC2 API to return data for Puppet to use. Here's a simple (if error-prone) ENC example doing just that:

#!/bin/bash
NODE_ID=$1
AWS_REGION=`facter ec2_placement_availability_zone | sed 's/.$//'`
NODE_ROLE=`aws ec2 describe-instances --region ${AWS_REGION} --instance-ids ${NODE_ID} | jq -r '.Reservations[0].Instances[0].Tags[] | select(.Key=="Role") | .Value'`
echo "---
classes:
role::${NODE_ROLE}:
environment: production
"

The Puppet Master will then apply the corresponding role class (role::${NODE_ROLE}), should it exist in the Puppet manifests.

This allows Puppet to associate a Role to a Node with only a few configuration changes within Puppet itself, thus avoiding potential headaches associated with hostname management within a fluctuating infrastructure.

Using the auto-scaling features from above, newly-introduced machines with the same Role tag will all automatically receive the same treatment.

Note: This assumes the Puppet Master has the right credentials to query the EC2 API, such as by being given an IAM role with the rights to perform the DescribeTags action.

Identity Crisis

This approach of naively applying the same configuration to every machine within a Role has its down-sides of course. Not every application is OK with every other node in its cluster being exactly the same.

I'll attempt to address some of those here so that you can plan for them in your own adaptation.

Role Change

Tags are not immutable, so it's entirely possible to change the Role tag on an instance to something else. I'd simply advise against this as the results would be unpredictable at best, and likely dramatic, as Puppet attempted to deal with the differences in state.

If you really need to change one machine of role A to role B then you're better off completely terminating one of A and creating a new B to replace it.

Stateful Services

It's the stable, stateful services which allow the stateless application code to be scaled up to the compute power it needs. Unfortunately, most stateful services abhor a fluctuating infrastructure and many even need stable and unique server IDs to properly operate as members of a cluster.

The first thing I would ask myself is: do I really need to run this? If AWS provide an equivalent service, the answer should generally be "No". For instance, unless you really need to, don't run your own PostgreSQL. Use RDS and let AWS manage it.

If you still need to run your own, you should carefully evaluate the suitability of said service for operating in such a setup. I've successfully run Zookeeper and Kafka within this setup despite both services needing persistence and unique server IDs, but that's because both have some level of automated failover management available to them, and I was able to use the EC2 instance IDs as the unique server IDs (tip: if you're after a decimal ID, you can convert the i-######## hexadecimal number to decimal and use that so long as being non-sequential is OK).

Also remember that EBS volumes don't necessarily need to come and go with their EC2 instances. A tagged set of N volumes can be managed separately from a cluster of N instances and as each instance comes to life a script could claim an unused volume. Volumes which aren't deleted on instance termination are released, so that when the replacement instance comes to life the volume and all of its data is reclaimed.

Certificate Signing

This is highly specific to Puppet, but normally you must sign a certificate from a new machine. If you're operating an infrastructure which has machines being frequently created and destroyed then this is going to become a headache.

Puppet lets you configure Autosigning to take care of this. This can be as simple as automatically signing all certificates and, while risky, so long as your network is secure and your application not too sensitive might be perfectly applicable.

Similar to External Node Classifiers, you can also specify an executable which can make a call on whether a certificate should be signed based on whatever conditions you wish. At the very least, you'd want to verify that the instance ID is one that belongs to your account.

Monitoring, Logging, Debugging

We take predictable hostnames for granted when it comes to this; allowing you to quickly identify (and connect to) problematic or interesting hosts by using a portion of its hostname. In discarding this hostname management, you'll need to find alternatives or risk reduced visibility and productivity.

The possibilities here are going to greatly depend on your own setup, but things to consider are ensuring each log and metric is tagged with the role and instance ID, and, if necessary, coming up with tooling that allows you to quickly SSH to a machine based on this. Remember: the EC2 API is there to make your life easier, if you find you spend a lot of time in the AWS console try to wean yourself off it in favour of using the APIs.