June 17, 2016

HOWTO: CloudFormation and Masterless Puppet on the Baseball Workbench Project

Within days of my successful dissertation defense in February, I started Baseball Workbench, a side project around a self-service tool for advanced baseball analytics, to have some fun and sharpen my development skills.

One area I've put way too much effort into so far is the automated creation and configuration of AWS resources for the project. The creation and configuration of AWS resources for Baseball Workbench is now completely automated, using a combination of:
  • AWS CloudFormation
  • CloudFormation's support for cloud-init
  • r10k
  • hiera
  • puppet apply (AKA, local Puppet runs, without a Puppet master)
  • Custom "role" and "profile" Puppet classes
  • A custom Puppet module for "superbuilds" for configuring my CI server
  • A custom Puppet module "aws_ec2_facts" for converting EC2 tags into Puppet facts
In this post, I walk through the details of each of these components in-turn. Hopefully, this combination of implementation choices is interesting to you. The concepts should translate to similar approaches as well.

AWS CloudFormation

In general, CloudFormation is a hosted service for creating and doing what I think of as "AWS-level" configuration of AWS resources (tagging, applying IAM roles, applying security groups, etc.). CloudFormation service calls are driven by a JSON file, called a "template", and templates can take in "parameters".

For the Baseball Workbench project, I adapted a few publicly available CloudFormation templates from AWS to create AWS resources like a Virtual Private Cloud (VPC), an Internet Gateway, an ECS Cluster, Security Groups, and other resources. You can view the full template on my "Standard AWS" github repo over here.

The most interesting of these resources for this walkthrough are a couple of EC2 instances:

  • BuildServer - an EC2 instance for running code builds.
  • ProxyServer - an EC2 instance for proxying requests to services
I am also using an Autoscaling group of Docker hosts through the AWS EC2 Container Service behind the ProxyServer. To-date, I haven't Puppet-ized the setup of the these EC2 instances, so I'll omit them from the discussion here. You can still see the full list of shell commands (triggered through cloud-init) in the template above, and I may post an update once these are converted to Puppet steps as well.

My CloudFormation template takes in parameters for Instance Types (one each for BuildServer, ProxyServer, and DockerServer), EC2 keypair, and Base64-encoded Hieradata. I have a couple of small wrapper scripts which generate a Keypair and Base64-encode a local hieradata file prior to calling CloudFormation "create".

CloudFormation's Support for cloud-init

The CloudFormation template, then, creates resources and applies "AWS-related" configurations to them. This is not quite sufficient for complete instance configuration. As considered so far, we can create, tag, and configure the properties of EC2 instances; but we cannot, for example, install and configure Nginx on the ProxyServer instance. We will use Puppet for those instance configuration steps, so the problem is reduced to needed to install and configure Puppet, then "apply" Puppet classes to configure instances. I use AWS EC2's support for cloud-init to perform these steps.

AWS support for CloudInit allows us to create files, run shell commands, start services, and more. In fact, it could be used in the place of more full-featured instance configuration tools like Puppet; but in my experience, cloud-init lacks the finer grained control of the more advanced tools, and would require scripting most instance configuration steps in bash. This just doesn't seem feasible for anything but the most basic instance configurations, such as our use case here of using cloud-init to trigger installation and execution of Puppet code on instance boot.

In particular, CloudFormation exposes a configuration property for cloud-init metadata when defining EC2 resources, and I use this property to specify that I want my instances to apply their latest relevant Puppet code on boot. I also specify a "UserData" script to the EC2 instance, which explicitly triggers cloud-init on boot.

Consider the configuration of the BuildServer resource as an example:

A few things related to cloud-init configuration here:

  • "Metadata" is a top-level property of the resource
  • There is a specific Metadata property for Cloudformation::Init, and within that, a "config" property for specifying the cloud-init metadata directly.
  • In the "UserData" property on line 43, I update and trigger the cfn-init script, which is CloudFormation's wrapper script for cloud-init (installed by default on Amazon Linux base images)
  • The "UserData" text requires hard-coding the region and logical ID (BuildServer in this case) of the CloudFormation resource
Within that config property (starts on line 6), I am using cloud-init to do three things:
  • Use the "sources" block to stage the contents of my standard-aws repository from Github, expanded into root's home directory /root (lines 7-9)
  • Use the "files" block to create a file at /root/init/datadir/custom.yaml, whose contents are the decoded form of the Base64-encoded "Hieradata" parameter provided from a local file by my wrapper script at launch (lines 10-18)
  • Use the "commands" block to trigger a script init.sh that was staged from the standard-aws repository by the sources block (lines 19-23)
The real guts of the Puppet installation and triggering of Puppet code, then, are in the init.sh script from the standard-aws repo (and a corresponding update.sh for actual triggering of a local puppet run):

A few things to note from these scripts, many of which are discussed in greater detail in their respective sections below:
  • I install Puppet, Git, Rubygems and hiera from their respective repositories The examples in this walkthrough are not yet using the "eyaml" hieradata backends also being installed.
  • The running of Puppet is isolated to a separate script that we can call independently (we could also set up a cron job to call this regularly, if desired).
  • r10k and Puppet are called from the working directory /root/init.
  • r10k is called before each Puppet run to obtain the latest modules, using a Puppetfile staged from the standard-aws repo
  • The Puppet local apply command gets its site.pp manifest and hieradata configuration from static files staged from the standard-aws repo
  • Puppet gets its modules from two sources: the "modules" directory populated by r10k, and the "site" directory staged from the standard-aws repo


r10k is a tool for managing Puppet modules. In the setup for Baseball Workbench, my use of r10k is relatively trivial: I use r10k to check out Puppet code from the Puppet Forge and custom Github repositories.

I install r10k from rubygems (line 10 of init.sh above) and run the r10k "puppetfile install" command to check out modules and place them into a local "modules" directory, which is its default behavior.

r10k uses the Puppetfile format for specifying Puppet modules. My Puppetfile currently includes a number of community-supported modules for installing and configuring tools, including Jenkins, Packer, Docker, R, Nginx, Consul, and others, as well as two custom modules I will discuss in more detail below: superbuilds and aws_ec2_facts You can view my full Puppetfile in the standard-aws repository.

Importantly, I decided not to use r10k to manage hieradata, or to create Puppet environments. I elaborate on the hiera setup below. As for environments, I have no other reason for having separate environments at the moment.


Hiera is a tool for providing variables to Puppet modules. Hiera allows Puppet code to avoid "hard-coding" environment-specific values. For the Baseball Workbench, I use hiera variables (called "hieradata") to provide parameters for specific Jenkins "seed jobs" on the BuildServer. I also use hieradata to configure the ProxyServer for specific endpoints.

I install hiera during the init.sh script, and configure Puppet to use a hiera.yaml configuration file (staged from the standard-aws repo) during its "apply" command in update.sh. Here is the current hiera.yaml and site.pp used by the Baseball Workbench project:

Walking through the configuration:
  • Hiera is configured to use a "yaml" backend, which is available by default.
  • The yaml backend is pointed to a source directory ("datadir") of /root/init/datadir, which has most of its contents staged from the standard-aws repo.
  • Recall from the cloud-init configuration that an additional file custom.yaml is staged into the datadir, with its contents passed in as a CloudFormation parameter at stack creation time.
  • The hierarchy for lookups of hiera variables is configured to prefer custom.yaml entries over entries from a "host-specific" yaml, which are preferred over entries from a common.yaml.
  • The host-specific yaml files are in a "hosts" subdirectory of datadir, with the name of the host-specific yaml provided by the "aws_cloudformation_logical_id" custom fact, which I elaborate on below.
To provide a more complete example of the hierarchy, the common.yaml file applies a single "base" profile class to every node, whereas the buildserver.yaml and proxyserver.yaml files in the hosts subdirectory provide role classes and other variables which are specific to the configuration of each host. The custom.yaml at the highest level provides values only known at runtime.

I chose to include the custom.yaml level in the hierarchy for a couple of reasons:
  • I want to be able to reuse the standard-aws repo and its CloudFormation stack across any number of side projects (i.e., not just Baseball Workbench)
  • I want to eventually be able to provide secret values at stack creation time, through the Hieradata parameter of my CloudFormation template, rather than committing these values to a repository in advance.
As an alternative to providing the custom.yaml contents through a CloudFormation parameter, I could have chose to use r10k to check out the datadir contents from a repository (potentially, a repository specific to a side project like Baseball Workbench). As of the time of my implementation in February 2016, however, there were some significant drawbacks to this approach that led me in my current direction:
  • The yaml backend for hiera only supported specification of a single datadir, making it tricky for r10k to check out yaml files from multiple locations (e.g., common files from the standard-aws repo, and custom files from a Baseball Workbench repo).
  • Use of r10k to stage hieradata appears to require use of environments, and I have no need for environments as far as Puppet environments are typically concerned.
Another important point in my use of hiera is that I am using hiera as a "node classifier" for Puppet. This is accomplished by adding the "hiera_include("classes") line in the site.pp manifest being applied by the Puppet command. This causes Puppet to look at a special hiera value "classes", and use the contents of that variable to determine which Puppet classes to apply to each EC2 instance.

The "classes" entry is a list, and Puppet first concatenates all "classes" entries from the hierarchy, then attempts to apply a class corresponding to each entry, assuming that every entry in the list is the name of a Puppet class. Importantly, because we have host-specific hieradata in our hierarchy, the classes applied to a given instance can vary, because the concatenated "classes" list comes from common.yaml, custom.yaml, and a host-specific yaml. You can read more about the use of hiera as a node classifier here.

Alternatively, I could have chosen (and in the long run, may indeed choose) to not use hiera as a node classifier, and instead to use host-specific manifests when making the "puppet apply" call. This would not have much effect overall on my use of hieradata for custom variables. Switching to this model would be a matter of committing host-specific Puppet manifests instead of host-specific classes entries in the host-specific yaml files. This alternative would also have some limitations - classes could only come from the Puppet manifest, and could not be concatenated from multiple sources - but would also enforce use of the roles/profiles pattern discussed below.

puppet apply

We've finally made it to another star of the show: local ("Masterless") Puppet runs with puppet apply!

Up to this point we have:
  • Installed Puppet
  • Checked out modules via r10k
  • Staged environment and host-specific configuration files for hiera, including the list of classes to apply to each node
It's now time to actually run Puppet on our instances. This happens in line 8 of the update.sh script from above. Let's review the call to "puppet apply" in detail, which specifies:
  • A very generic site.pp manifest is being applied. This can be generic because we are using hiera as a node classifier.
  • A generic hiera.yaml, discussed in detail above, is also being used as the hiera configuration.
  • The Puppet modulepath, which is a list of directories where Puppet modules will be found, is set to include the "modules" directory populated by r10k and the "site" directory staged from the standard-aws repo. The "site" directory contains custom role and profile Puppet classes which I elaborate on below.
  • Use of the "future parser" option for Puppet, which I found necessary for some Puppet features I wanted to use within classes.
Referring back to the site.pp shown when discussing hieradata, I found the "virtual packages" block necessary to avoid warnings for the way some of my Puppet modules were loaded. The only signficant content of the site.pp is the "hiera_include" (discussed above) which sets up use of the hiera variable "classes" for the application of Puppet classes.

Role and Profile Classes

The idea of using "roles" and "profiles" is that they provide two layers of abstraction for organizing Puppet classes. In terms of implementation, they are simply Puppet classes. A role is specific to the type of instance being configured (e.g., BuildServer or ProxyServer), and refers to one or more profile classes. A profile is finer-grained, performing the configuration of a common component (e.g., NTP, user accounts, etc.) and refers more directly to resources from Puppet modules.

For the Baseball Workbench Project, I currently have the following roles and profiles (you can browse them in detail on Github):

  • Role proxyserver
    • Profile proxy
  • Role buildserver
    • Profile builds
    • Profile consul

Within each profile, I include the Puppet resources needed for configuration of the instance. The "proxy" profile configures Nginx and Nginx Template, the "builds" profile configures Jenkins and build tools (including a reference to the "superbuilds" module discussed below), and so on. There is also a profile "base" which is applied through the common.yaml "classes" hiera variable to every instance. This profile configures NTP and the use of the aws_ec2_facts puppet module.

In the long run, I question the value of including everything in Profile classes. The real value of Profile classes seems to be shared configuration components - otherwise, we are only adding a layer of abstraction through which we have to debug and maintain. In the future, I plan to refactor my roles and profiles setup to improve on this.

The Superbuilds Module

The vast majority of instance configuration I currently need for the BuildServer and ProxyServer was provided by community-supported Puppet modules from the Puppet Forge. In a couple of cases, I found it necessary to develop my own Puppet modules beneath the role and profile layers specific to the standard-aws project.

The first such example was a Puppet module I called "superbuilds", which you can view on Github. Inside this module, I install a number of tools using existing Puppet modules: jenkins, docker, packer, R, and NodeJS in particular.

In the long run, the Superbuilds module may be more appropriate as one or more Profile-level classes referenced from a Role class. The latest version of Jenkins (Jenkins 2!) also simplifies the configuration of Jenkins, which was a major driver for the existence of this separate module in the first place.

The aws_ec2_facts Module

The second Puppet module I developed for the Baseball Workbench was a quite simple one related to the creation of facts for Puppet and Hiera. The "facter" tool installed as part of the Puppet 3 installation on each EC2 instance in my stack is responsible for maintaining facts for Puppet and Hiera. I can run "facter -p" as sudo on my instances to see the list of available facts that facter obtains by default. Unfortunately, there are no facts related to EC2 tags.

AWS adds "tags" to EC2 instances, as well as most created resources across AWS services. As discussed above, I have referenced at least one such tag in my hiera hierarchy - the logical ID of the EC2 instance in the CloudFormation stack. To make this available to Puppet and Hiera requires writing a bit of Ruby code to define Custom Facts for facter. Below is the custom code in my current aws_ec2_facts module:

Let's review some details of this code, which lives in a file facts.rb in the "lib/facter" subdirectory of the aws_ec2_facts module.

  • At a high level, the Facter object has methods to use for adding fact values.
  • For each fact, I am constructing a command-line call for which the standard output will be the value of the fact. I am using Facter::Core::Execution.exec to execute the command-line call.
  • I use a combination of curl calls to AWS "instance metadata" and calls to the more powerful AWS CLI to get the values of facts
  • The complete list of custom facts in the current implementation is:
    • AWS availability zone
    • AWS region
    • CloudFormation logical ID
    • CloudFormation stack name
    • Private IPs of all EC2 instances in the same stack
I use my common.yaml "classes" hiera value to add this Puppet class to every EC2 instance in my CloudFormation stack. I then have all of the facts above available for reference from Puppet code and Hiera configurations (as well as the output of facter -p).


Hopefully this has been a helpful (if lengthy) overview of my particular implementation of masterless Puppet on an AWS CloudFormation stack. I have inserted quite a bit of commentary on design decisions and potential changes, and I welcome any feedback you may have.

I hope to follow up this particular post with some additional details from the Baseball Workbench stack. If you find any of this interesting, I invite you to "Watch" the various repos (especially standard-aws and the Puppet modules) on Github.

This post also requires a Hat Tip to @tokynet, who taught this "Java guy" 99% of what I know about Puppet. He also presented on the use of AWS tags within hiera hierarchy at the 2015 Spring Puppet Camp in DC.

No comments:

Post a Comment