Avvo is a growing company. Like other growing companies, we started with a small server footprint which is now growing. As our small
scale server environment grew, we found that installing the OS (ubuntu) is slow and takes a lot of time and effort.
So everyone can really understand how painful our server provisioning process was, let me describe it to you.
First, we booted the host to an installation disk (iso image). That part in itself is difficult enough if you don't have a server provisioning stack, and your servers
are in a remote datacenter. We typically used onboard management cards, or a network attached KVM in the datacenter to mount the iso as virtual media and get the process
started. After the host booted the ISO, we manually entered all of the details such as partition layout, network info, packages to install, etc. Then we waited for the
OS install to complete, at which point we logged in via ssh and ran chef on it to do the remaining configuration and install more software. In addition to that install
process, we also had to manually assign an IP, then add that IP to DNS (bind).
There are a lot of options out there to solve this problem. There's full-blown server provisioning stacks that can handle much of the work. They're designed to automate
the server provisioning process, make it more consistent, more reliable, etc. We evaluated many of them, including MAAS, Cobbler, OpenCrowbar, and Foreman. In general,
we didn't actually dislike any of them, but none of them fit us quite right for various different reasons.
My Little Lie
Now that I've described the problem to you, let me now get something off my chest: I've lied to you. We don't actually fully reprovision a server in a minute.
It's currently 88 seconds. But after a hard drive wipe, our OS boots up and is ssh-able in 37 seconds. The remaining 51 seconds is disk formatting,
configuration, and software installation. By the time the host is ssh-able it has received an IP, and both forward and reverse DNS entries have been created
automatically. But since Wikipedia defines software installation as part of the server provisioning process (and really, it is) I guess I lied to you.
To be fair, I'm certain that if we moved all of the software to a local repository instead of downloading it from the internet, we could get the entire process down
to less than a minute. Shaving those extra 28 seconds off of the time didn't seem as important after we reduced server provisioning from a multiple hour manual process,
to an 88 second automated process. Think about it, in the time it takes you to get through a standard TV commercial break, your server could be reformatted and
running something completely new.
What's funny about this whole thing? Super fast server provisioning wasn't even our end-goal. Our main goals were just to build out a cluster, in an automated and maintainable way.
Being able to re-provision a host rapidly is just a nice side effect of the design we chose.
The Devil in the Details
TL;DR I'll explain the little details that make this so fast:
- I benchmarked on a VM. Some of our baremetal hosts take more than 88 seconds just to POST and initialize firmware. VMs conveniently skip that mess.
- We're using a cloud-config file to do most of the configuration. Chef, ansible, salt, puppet, et-all are great, but for the simple stuff, cloud-init is faster.
- Our software installation process actually boils down to a simple 'docker run' from the cloud-config file, and our container orchestration system.
- Our network is reasonably fast (dual 10Gbit to each physical VM host)
And the last sorta-kinda little detail:
- We don't actually install the OS on the drive. We're pxe-booting a live image, and the OS is only using the drive for data persistence. The OS does format the drive
(if needed) on bootup, and store any applicable files to disk (including the 'software' mentioned above). We are still using persistent storage, and it's still part of
our server model. We just don't use it for storing or running the OS (in this cluster anyway).
NOW I get it!
So you might be saying "Oh, well no wonder you get such great times. You're just pxe-booting a live image, in a VM, on a fast network. You're not doing much software installation, and your configuration is just a simple cloud-config."
And to you, unruly naysayer, I would say: "Yep, that's right!"
Why in the World Would You Do That in Production?
There are many benefits to this approach. But here are the highlights:
- It's easier to maintain than a full server provisioning stack
- It's fast. As in, boot-up is fast, and the OS binaries run from RAM instead of sloooooow magnetic drives.
- It's as reliable as our old-school hand-crafted artisinal Ubuntu installs
- We get better usage from our disk space (we don't install gigs of OS binaries and packages)
There's a ton of other reasons we're doing this, including all of the benefits of embracing the microservices revolution, such as easy software builds, reliable testing,
and simple deployments.
But, aren't there a LOT of Drawbacks?
Ok, admittedly there's downsides to this approach:
Configuration Can't be Complex
Any configuration we have to do must be covered in the scope of cloud-config, which for our OS (RancherOS), is surprisingly limited. We aren't running chef, ansible,
puppet, or any other major configuration management service on these hosts. We could, but that would kind of defeat the purpose of keeping these hosts as lightweight
and disposable as possible.
You might notice I said RancherOS. If you're not familiar with that particular flavor of linux, then give it a try. Similarly to CoreOS, it basically just runs docker
and doesn't come with all of the cruft you get from a full-blown server OS. The kernel image we're using clocks in at 3.7M, and the initrd is 25M. An OS footprint of
28.7 megabytes explains why that bootup process is so fast.
We're using stock images from RancherOS though. So it's not like we have any overhead in maintaining the images. If they release a new image, we try it out, and if it
works for us then we use it. Since we're using stock images, and cloud-config, a full OS upgrade is literally 88 seconds away. Trying out a new OS version is similarly fast.
As an aside, I consider complex configuration the wrong way to go anyway. If you have complex configuration management, that means you need to manage your configuration
management. Some people like that, but I like to keep things simple and work on the important stuff like keeping our website healthy. So really, enforcing simple
configuration is actually a bonus! If for some reason we find that we really need more complex configuration, we'll probably move that into the docker images.
The more we use docker images, the less need we have for a complex configuration management system. Why deal with configuration management, when you can just define
the exact state of your docker images in a Dockerfile? I guess if you don't want to maintain lots of Dockerfiles (and their associated images), then you could
maintain it in a tool like chef. I don't know that using chef to build or configure your docker images would buy you much in the complexity department though.
Instead of maintaining Dockerfiles, you end up maintaining recipes, cookbooks, databags, roles, nodes, and environments.
Software Options are Limited
We're limited to just using software that can be run in a container. That's a lot of software actually, but anything proprietary will need to be packaged up in a
container before we can run it on these hosts. There are also a lot of positive side effects from working with containers at the OS level, and cutting
out the cruft of traditional package management. As one example, apt/yum both do a great job of building out a depedency chain and pulling them in during an install.
However, they introduce their own issues with package conflicts and silly dependency chains that are difficult to work through. With docker images, the dependencies are
in the image. Package conflicts effectively go away.
I should take a moment to mention the software security aspect here. Modern os package management systems (yum/apt/etc) have grown to support package signatures, trusted repos and maintainers, etc.
Contrast that with downloading images from dockerhub, where the image isn't necessarily signed or maintained by a trusted person/group. Limiting which images
can be downloaded from dockerhub, and/or using a trusted registry helps improve the software security aspect. Though, for the time being, this is one area where yum and
apt have an advantage.
Data Persistence is Still Iffy
Our most difficult challenge so far, is figuring out how to reliably dockerize our SQL databases. Some people will be quick to say "well there's a container for mysql
and a container for postgres, and a container for ..." But hold on there cowboy, if you put your entire database in a container where does that data go? If you store it
in the container itself, that data goes away when the container is destroyed. If you have bind mounted a volume to your container, then your SQL container is joined at the
hip to the host which originally ran the container. Using a "data container" and linking it to the SQL container is a popular solution, but has the same problem of being
stuck on the host they started on.
We don't want any container to be stuck to a single host. We're aiming for lightweight and disposable hosts here. Less pets, more cattle. If the host stores some
mission critical database, then it's no longer disposable. For that reason, we treat all on-host storage as volatile, and plan around the possibility of it being
destroyed at any time without warning. Traditional approaches for SQL data reliability include backups, and slave DB servers, but translating those concepts to containers
comes with a new set of complexity and problems.
One of Docker's approaches to solving that problem is with support for storage drivers, and we're currently looking into both Flocker and Rancher's Convoy which are two popular
storage drivers. We've been discussing other ideas to solve sql data persistence, some of them more wildly experimental than others, such as an off-cluster "super-slave" for all
database containers, or sending binlogs to Kafka, but so far haven't found a silver bullet here.
As an aside, the data persistence problem is more easily solved for companies that have an enterprise-grade SAN, which we don't have (yet).
We Have to Maintain a Custom Provisioning Stack
There's a lot of moving pieces in a server provisioning stack, and we have to maintain them. A tftp server, a web server, dhcp server, dns server ...
and I'm sure I'm forgetting some others. In our case, we have that all maintained in Chef. I didn't say we have NO complex configuration anywhere, we just keep it away
from our Rancher cluster and maintain it in Chef. Try to imagine a picture of the Dos Equis guy here, "I don't always have complex configuration, but when I do, I use
Chef". Our server provisioning stack isn't really that complicated anyway. These are all standard services, and we're not configuring them in any off-the-wall ways.
The most complicated part is actually how we generate cloud-config files. We created a quick CGI script that simply calls out to consul-template to generate cloud-config
files on-the-fly. Any specific host configuration is stored in our consul cluster (such as hostname, environment, etc).
There's a lot of advantages and disadvantages to the cluster we've built. I highlighted the speed of reprovisioning as the topic for this article, but only because it's an
interesting datapoint, not because it's important to our use case. Hosts in our Rancher Cluster are so disposable now, that even if server provisioning took 30 minutes
instead of 88 seconds, I don't think we'd notice. If a disposable host dies without any impact to your services, do you really care anymore that it took 88 seconds or 30 minutes to
replace it? Something we take for granted is that building a docker cluster enabled us to focus less on server maintenance and more on other issues that needed our attention.
Using docker at the OS level and treating hosts as disposable, moved us to a more stable and maintainable platform overall, and maybe that's a topic worth discussing
all on its own.