Skip to main content

Your First Day as an Engineering Manager

My first day leading a team at Avvo was very different from my first day leading a team. That’s because my first ever day leading a team was an unmitigated disaster. I had no idea there was a fairly set script for coming in as a people leader. I had to learn that the traditional way, through a series of disastrous career mistakes and embarrassing social gaffes. We’ll cover those in detail in future posts. For this entry, I want to cover my script for an incoming manager or team leader.

As an incoming people leader, you might be meeting this team for the first time. You might not. That doesn’t actually change what the team needs from you, but that might change how you go about forging the relationships you’ll need to become an effective leader.

There are three main phases to your first day (days) as a leader, and they correspond to the needs of your team.

Phase #1 — Humanize yourself, and state your expectations.

Depending on the size of your team, this phase may take the entirety of your first day or two. You need to make a personal connection with each member of your new team. This means 1:1 time. For the first day, you’ll give your introduction about who you are and what you represent, and you’ll listen. I like to talk about some of the larger mistakes I’ve made, balanced with some of the achievements that came from those tough learning experiences. It’s important that your team understands that mistakes will happen, but that it’s how we correct and prevent future mistakes that’s important. Humanize yourself, but more importantly state your expectations.

Buy your new report a coffee, and get yourself something large. Spend the next hour slowly sipping so you’re not tempted to talk unless asked a direct question. Out of the next hour, you should take at most five minutes to make your introduction, then the next 55 minutes should be spent listening and writing notes. You will be asked questions, but they’ll often be the light softball questions that occur when folks are feeling each other out. Make sure you state your expectations up front, and if there’s something they need to know, this is your five minute chance to let them know right off the bat. For me this is where I take five minutes to discuss ‘share of voice’ (the idea that everyone needs to feel like they have an equal share of the conversation), feeling respected and respecting others at work, and how our 1:1 conversations will go.

Lots and lots of coffee Lots and lots of coffee

Don’t eat breakfast on your first day, as you’re going to be showered with free food and likely a free lunch. As a people leader you’ll need to get used to expensing lunches and coffees. Lots and lots of coffee. You’ll also want to 1:1 with your product manager, UX/design/content strategist, qa/testers, data and business analyst peers. These are vital team members, and you need to understand the process and relationships from all viewpoints.

Phase #2 — Understanding your Team’s Challenges

Your first day or two full of 1:1s will be a deluge of information. Often there will be friction between business units (i.e. developers and QA should work hand-in-hand, but in a dysfunctional organization the natural tension between their roles becomes unhealthy.) There may also be friction around interpersonal dynamics within one business unit (i.e. there may be an underperforming team member, unhealthy deadline pressure, poor interpersonal interactions from frustrated developers, etc.)

Bring a notebook, and write it all down. Offer no solutions, as you don’t have a full picture yet. This is just your time to listen, and answer any questions around you and your background and leadership style.

Phase #3 — Public Goal-Setting

If you’re extremely lucky you’ll get to this on on your first day. Commonly this could be day two or three. Do not let this go past the first week. You need to be clear with a public goal, and it needs to solve a challenge that you recorded during your first week of 1:1 conversations. You may have heard, “first impressions are the most important” or “there’s only one chance for a first impression.” It’s not so dire as that, but there’s truth in these adages. The tone and expectations you initially set with your team will be the backdrop against all of your future requests and leadership. Similarly, the success you have with the team towards a shared challenge will be the backdrop against all of your future initiatives with the team. Are you a leader who can follow-through with their vision? It’s vitally important to the team that you are, and setting that expectation will pay dividends in future initiatives.

I can hear you thinking “But wait Hunter, my direct reports didn’t voice any major challenges, things must be perfect on my new team!” Okay, I can’t really hear you thinking that, because there’s no perfect team. Still, you know there are common areas that a team needs to improve in, and yours will be no different: * code review and pair programming * continuous integration and deployment speeds * engineering focus and code quality * interpersonal relationships and dev/UX process * agile processes and agile readiness * TDD / test coverage / test culture If no improvements have been offered, pick one you know they need from the above list, create an MVP around it that solves an important aspect of the problem or sets the team up to be able to solve it themselves, and have it be something you know you can finish by the end of your second week. You’ll be asking your team to break stories down into smaller parts, and you should be able to demonstrate this paradigm as a team leader. Preferably choose something you’ve done in a previous position. Worked as a release engineer in another life? Automate the release process. Awesome at pair programming? Bridge that out to a pair schedule and sell it to the team. It’s vital you work to build trust with your team, and for an engineering team that means visible output. And that’s how you start leading an engineering team.

Building a Management Training Curriculum at Avvo

Building a Management Training Curriculum at Avvo

This week, we kicked off manager training for Avvo technology managers. Before we could build a curriculum, we needed to decide what was important to learn and where we as a group needed the most development.

If I had done this with my organization at Adobe, years ago, I might have made a list of capabilities or requirements for our roles and then assessed each person against those requirements. I’ve since learned that the top-down approach tends to isolate and alienate people. It is something done TO them. They don’t feel investment or ownership of the process. If they disagree with the list or my assessment of them, it is hard to challenge due to the nature of the process.

When I was at Spotify, I worked with Paolo Brolin Echeverria and Mats Oldin to build manager training for my Tribe. They developed an excellent kick-off exercise that I repurposed for my team at Avvo.

The process is straightforward. 

We began by individually thinking about the qualities of a good leader in our organization. We each wrote every important quality we could think of onto individual post-its. This effort took about 20 minutes. Then, one by one, we put each of our post-its onto a large board. As we placed each quality, we explained why we believed it was important for a leader at Avvo.

Kyle getting us started Kyle getting us started

When we finished putting all of our post-its on the board, we affinity-grouped them. Affinity-grouping resulted in 30 groups of similar qualities as well as a few individual post-its that did not fit into any group. The grouping process required a lot more discussion so that we could all agree on the final groupings.

Nic, Ian and Jordan working on cleaning up the affinity groups Nic, Ian and Jordan working on cleaning up the affinity groups

At this point, we had collectively described 30 essential qualities of a leader at Avvo, which is far too many to effectively focus on. To narrow things down, we each received six votes to put towards any group of qualities that we felt were the most critical. Then, we tallied the votes and took the top eight as our core qualities of a manager at Avvo.

The voting process also led to a lot of valuable discussions as we saw where we had voted as a group. Were these right eight? Were they the most important eight? The eight qualities that we picked were: empathy, develops autonomy, builds good teams, is real and trustworthy, is a big picture thinker, supports mastery, gives feedback, and has a bias for action.

Dot voting in progress Dot voting in progress

Individually, we then assessed ourselves against the eight core qualities on a three-point scale: “I need training on this,” “need training, but it can wait,” and “I can train others on this.” One by one, we went up to a board that had the eight qualities mapped on a spider graph. We put dots on a line for each quality where we rated ourselves. We explained why we chose that assessment. This led to further good discussion about how to assess ourselves against these qualities.

Our collective spider graph Our collective spider graph

The group as a whole found this exercise to be very valuable. We had excellent discussions on what it means to be a good leader at our company, including the values we agree on, and the ones that we don’t. We were also able to prioritize those collectively in a way that everyone feels ownership and allegiance to them.

And we came to an understanding of where we need to develop the most as a group. This mutual understanding will inform the curriculum for our management training – my original goal.

Day 06 - Account Subscriptions

Day 05 - Account Subscriptions [Rails 5]


  • Application: Complete and tested with one endpoint
  • CI: Complete
  • Deployment:
    • Dev: Complete
    • Production: Complete


Week two started off with a trip back to Rails 5. Carson Stauffer was able to pair with me for an account subscription api. We started off with API design which was really insightful since the previous model locked our subscriptions to only an email. Carson had a great idea to increase flexibilty of contact channels by altering the table structure to handle dynamic channels such as SMS or messengers. After a week of iterations, the process became pretty smooth and looked like this:

Design: * Interface * methods * inputs * schema * Naming (not as easy as it sounds)

Skeleton: * Github repository * Docker hub repository * Choose service port * Generate application using our CLI

Configure: * DataDog * dotenv * database * honeybadger * health check endpoint

Implement: * generate endpoints * write more tests * add pagination * json schema for 1.0

Deployment: * circle CI * Docker * Dockerfile * rancher configuration

Now that we've done this a few times, maybe we should flip the deployment and implement since thats our goal. As an engineering organization we could work more efficiently if the application is deployed with minimum effort and we begin iterating immediately.

Learning for the day: API Generators

Naming APIs
There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Naming is one of the two hard things to do in Computer Science. Today we started with something as simple as email subscriptions but since we generalized it, email didn't make sense. There were magnetic poetry words on a whiteboard behind us so I immediately went to suggest fun word combinations. Those included dragon-byte, virus-love information-dungeon, and even some that were somewhat close to topic such as byte-marketing. Ultimately Carson made a great case that although those names all sounds like fun, they hide the purpose of the service. So we settle on account subscription.

Namespaced Generators

We leverage rails scaffolding to generate endpoints quickly, but they always seem to need work to get them perfect. In particular, we need proper namespacing for versioning in the future. Although not an extreme amount of work to hand build, we saved a lot of time by namespacing when generating the scaffolding.

In rails you can add namespaces to your objects and it will automatically add those namespaces to your files.

$ rails g scaffold AccountSubscriptons::V1::Subscriptions name:string
  invoke  active_record
  create    db/migrate/20170613064344_create_account_subscription_v1_subscriptions.rb
  create    app/models/account_subscription/v1/subscription.rb
  create    app/models/account_subscription/v1.rb
  invoke    test_unit
  create      test/models/account_subscription/v1/subscription_test.rb
  create      test/fixtures/account_subscription/v1/subscriptions.yml
  invoke  resource_route
   route    namespace :account_subscription do
                namespace :v1 do
                  resources :subscriptions
  invoke  scaffold_controller
  create    app/controllers/account_subscription/v1/subscriptions_controller.rb
  invoke    test_unit
  create      test/controllers/account_subscription/v1/subscriptions_controller_test.rb
  invoke    jbuilder
  create      app/views/account_subscription/v1/subscriptions
  create      app/views/account_subscription/v1/subscriptions/index.json.jbuilder
  create      app/views/account_subscription/v1/subscriptions/show.json.jbuilder
  create      app/views/account_subscription/v1/subscriptions/_account_subscription_v1_subscription.json.jbuilder

This is nice but not ideal. It has segmented off the models into a namespace as well, when all we needed was the controller to be namespace.

More customized generators will be a perfect addition to our CLI.

Day 05 - Ad Resolver

Day 05 - Ads [Phoenix/Elixir]


  • Application: Complete and tested with one endpoint
  • CI: Complete
  • Deployment:
    • Dev: Complete
    • Production: In-progress


We've made it one week and four applications. Our fifth was completed with the help of Matt Purdy and his Elixir skills. Matt has been working with elixir longer than I have and brought a great deal to the project. In complete honesty, we had started out with a different api topic, but after some discussion, there weren't any viable use cases for it, so we switched to an Ad Resolver.

Matt and I quickly looked at the current use cases and put together an app with two end points that seemed to fit the bill.

One problem... the data backing our current ad resolver was implemented in a view without a primary key.. This was completely new to us in a phoenix project. All the data was generated using a database Stored Procedure, so we could remove create, update and delete from our app. This also made a show endpoint irrelevant. Our app had become super small, which was perfect for this project. #reducescope

Learnings of the day:

Environment variables

In our previous Phoenix applications, we used System.get_env('SOME_VAR') and the "${SOME_VAR}" but to be honest, we didn't really understand what the difference was. Upon configuring this app, we skipped the latter and only left System.get_env('SOME_VAR') for our database configs. Since we moved relatively quickly through building the endpoints, we were onto dockerizing and deployment. Over and over we tried to configure and rebuild the docker image, but each time the app could not connect to the database. Well, it turns out the difference of the two environment variables syntax's is quite important.

System.get_env /1 pulls the system variable at compile time, or when we are building the container.

"${SOME_VAR}" loads the environment variable at run time.

These two distinctions could have saved us a lot of time debugging our docker containers.

Stay tuned for more APIs next week!

Day 04 - Education

Day 04 - Education [Phoenix/Elixir]


  • Application: Complete and tested with five endpoints
  • CI: Complete
  • Deployment:
    • Dev: Complete
    • Production: In-progress


Back to Phoenix and Elixir for the next API. Today's focus was domain information around schools, but this time it would be a bit different. Up to this point we haven't worked with many associations. This API did have multiple associations, but the best part was Phoenix's handling of associations. When using the generators you can add the reference on the command line:

$> mix phx.gen.json Schools SchoolAlias school_alias name:string school_id:references:school

# and you get this included in your migration
add :school_id, references(:school, on_delete: :nothing)
create index(:school_alias, [:school_id])


It took no more than 2 hours to build out five endpoints, tests, CI, and docker config files.

Learnings of the day: account

We use for our exception reporting, and many of our applications are configured to have separate accounts for test environment and the production environment. On our current account we are limited to only 50 applications, of which we've used...well...50. This is problematic since it means new applications going out will not have error reporting, which is completely unacceptable. The support staff at have been great at assisting us and now we are at a choice to either throttle test environments and handle the switch to the upgraded account, or pay exponentially more for a 2XL plan. This is a bit larger of a decision than 15 in 15, and we're working with the rest of the engineering team to find the best choice to get us moving forward.

Has anyone else run into this while expanding into SOA? What did you do?

Stay tuned for our next API!

Day 03 - Preferences

Day 03 - Preferences [Rails 5]


  • Application: Complete and tested with two endpoints
  • CI: Complete
  • Deployment:
    • Dev: Complete
    • Production: complete


Today we decided to change it up by creating our API using Rails 5. Most of our services are written in Rails 4, so this was a good chance to see how our libraries handled the update. Our project topic was user preferences flags, which amazingly, gets accessed often and on each page view. Building this out should be beneficial to our front end services and the move to SOA. Our awesome senior engineer, Jake Sparling, paired with me for a ride through our custom CLI library. Jake has recently worked through an API for our user's saved content so he brings his experiences to the party. The coding was extremely quick, and leveraging rails' generators sped us along very quickly. Typically we've used activemodelserialzers but since Rails 5 has jbuilder included in the Gemfile, we decided to use it. It was super easy to use and intuitive from only a few examples. I recommend giving it a try since it's less abstract than AMS.

Learnings of the day:

API design

User preferences are strange. The term user preferences in a developer's mind immediately brings visions of join tables.

Why would you make an endpoint for a join table?

Well looking over all our use cases we decided it was best to have a resource that is a user_preference. This would hold a reference to a user, the preference, as well as a flag to tell if that preferences was on or off.
That last flag was the deciding factor to use user_preferences as our endpoint. Perhaps user_preference_flag would be better, but if that changed to contain more information than a flag, then we're reducing understanding.

Has anyone else had similar endpoint architecture discussions? What did you do?

Updates needed to our CLI

While ruby is super slick with its generators,

rails g scaffold Preference name:string

we still ran into a few hurdles around manual script building and configurations. Some of these included: 1. Adding common files to .gitignore 2. Auto generate docker build and publish scripts 3. Pregenerate database.yml to include environment variables. 4. Include dotenv to our Gemfiles for quick use in development 5. Handle new database creation.

All great things to increase our development velocity.

Successful deployment


Jake and I worked closely with Aaron Devey in Infrastructure to see how the new configuration management tool worked. It was amazing that only after a few minutes of work with our pre-deploy scripts we had the application running in our Test, Staging and Production. Another step towards full dockerized infrastructure.

I'd like to thank the Infrastructure team for working hard to get the custom configuration manager working.

Stay tuned for our next API!

Day 02 - Companies

Day 02 - Companies [Phoenix/Elixir]


  • Application: Complete and tested with two endpoints
  • CI: Complete
  • Deployment:
    • Dev: In-progress
    • Production: In-Progress


Our elixir skills are quickening, and after a day of working with the new changes to Phoenix we created a new company service api in a matter of minutes, not hours. Within 30 minutes we had two endpoints built and a new template company service api in a matter of minutes, not hours. Within 30 minutes we had two endpoints built and a new template identified for building elixir non-umbrella apps.

Learnings of the day: Dockerizing new elixir apps.

At first we used edib but have recently moved to mix_docker. Following the tutorial was easy enough except for two gotchas.

  1. We leverage an internal library for our influxdb reporting, reporting and service configuration. In order to handle that we need to copy our deps in our dockerfiles. Continuously we were bumping into a deps/ directory not error. Pro tip: .dockerignore has deps included so be sure to remove that line before trying to copy the deps directory in your Dockerfile script.

  2. Enviroment variables for the docker containers have a slightly different syntax for mixdocker. Use "${}" along side `System.getenv()`.

Our infrastructure team has been ready to assist with any blocking issues we have with deployments. Today they identified issues with our internal configuration file. From the issues found, there is a need for either a validator or application that can help craft these configurations, especially since there is domain knowledge of all the details.

Still working towards full deployment. I expected the first few days to be backed up but unlocked in the coming days.

Stay tuned for deployment updates.

Day 01 - Organizations

Day 01 - Organizations [Phoenix/Elixir]


  • Application: Complete and tested with four endpoints
  • CI: Complete
  • Deployment:
    • Dev: In-progress
    • Production: In-Progress


Our first application's topic was for lawyer organizations, which we decided to build using Phoenix, Ecto and Elixir.
Our first hurdle was understanding the new Phoenix v1-3-0-rc-n file structure changes along with a move away from Models in MVC.

TL:DR - Move /web into the lib/ directory of the app and now the generators build a module for fetching records from the database

Donald Plummer and Matt Purdy were amazing enough to build a phoenix template generator into our avvo-cli library. Though this was helpful we decided an umbrella app was not necessary for such small apps.

Learnings of the day : Add different templates for elixir apps to our cli

Onto the next change: the mix tasks have switched from mix to the new mix that handle the new directory structure

To use that we had to install the new tasks:

mix archive.install

API Design set out a guideline for the 1.0 specifciations which greatly differ from our current schema. sample json json { "meta": { "total-pages": 13 }, "data": [ { "type": "articles", "id": "3", "attributes": { "title": "JSON API paints my bikeshed!", "body": "The shortest article. Ever.", "created": "2015-05-22T14:56:29.000Z", "updated": "2015-05-22T14:56:28.000Z" } } ], "links": { "self": "[number]=3&page[size]=1", "first": "[number]=1&page[size]=1", "prev": "[number]=2&page[size]=1", "next": "[number]=4&page[size]=1", "last": "[number]=13&page[size]=1" } }

avvo sample json json { lawyers: [ { id: 28995, prefix: null, firstname: "Mark", middlename: "S", lastname: "Britton", ... }], meta: { status: 200, per_page: 10, page: 1, total_entries: 100 } }

There was some debate that we should stay the course with our old schemas, but after looking at the conventions in the Rails and Phoenix communities, we decided to move into the new jsonapi 1.0 specifications.

As expected we ran into issues deploying our application to rancher. I'll be working with our infrastructure team and their new configuration tool they have built to allow self-serve deployments.

Stay tuned for deployment updates.

15 Micro Services in 15 days

15 Microservices in 15 Days

A Molasses Migration started out as a Rails app and, like many people, we used Capistrano for deployments with server configurations managed through Chef. This worked for some time but as our stack grew, the architecture headed into SOA and more servers were added. We quickly ran into environment variable dependencies and conflicting system libraries(i.e different ruby versions, libv8, or imagemagick). This created an un-maintainable infrastructure and limited our self service capability.

At the same time Docker came onto the scene.

Contained dependencies and environment variable management through a 3rd party application(

Sounded great to us. Sign us up!

Welcome to Docker-ZORK! aka DOCK. Your adventure begins with only a console and your wits...

Adventures in Docker

We set off to migrate all of our capistrano deployed applications to docker. After many iterations and countless trials of different software solutions, we decided to use There was even a "Dockerize all the things" bash day to get all our applications dockerized. Even though it took a few weeks after to get completely ready for docker, we did it. Our stack was finally ready to deploy using rancher....except we still relied on Chef to set up our infrastructure containers. Environment variables were stored securely and relied on our Infrastructure team to update. A few services were deployed in rancher and others still using Capistrano.

The deployment pipeline was not as fluid as we wanted it to be.

A Foot in Each Room

With a production environment mostly still in Capistrano, and our development stacks completely dockerize, we stood with one foot in the old way, and one foot using our new pipeline infrastructure. Each new application was always a struggle to get out the door. Be it global configs, database connections settings, creating new databases, setting up our CI, or integrating into our new development enviroments with Rancher, each time there was mised results of success. A few times it would go smoothly, while most of the time it was like wandering around a dark basement bumping your head continuously.

Shake Out the Snakes

One day while discussing ACLs on a new product, which involved another service to create, Donald Plummer says "While you're at it can you please break out sales regions, resolve and location from our location service?" (paraphrase) That was followed up with two more services mentioned in our dev chat. I off-handedly said "whelp, time to start on the 15 service apis Donald was suggesting..". At the time it seemed like a joke, but then we thought about it more. * What if we could build and stand up services in one day in all environments? * What if each of the engineers had no barriers to releasing services? * What if we did move to super small easily rewritable services?

Then the most awesome Hunter Davis had a wonderful conversation with another awesome engineer, Jacob Kemp, and came up with an even better addition. "Let's pair program our way to success"

Thus 15 apps in 15 days was born.

Here’s how it will work: * Starting June 5th, I will pair with one engineer every day for 15 days. * The services we build would be small (a few endpoints and small in scope) * The morning will include API design and building out the endpoints. (break for lunch) * The afternoon would be dedicated to standing up the services in our development environment, Staging and Production. * By the end of the day we should have services running in all environments * After each iteration, we’ll have a quick retrospective

Our goal: 1. Discover new tools and libraries 2. Uncover deployment impediments 3. Broaden the engineering team’s knowledge for creating and releasing new services. 4. Move closer to the engineering goals: * 10 minute deployments * 1 day to create and release a service

Stay Tuned

I'll be blogging about what we learned each day, including the stumbling blocks we uncovered and improvment suggestions

An excursion in Elixir / Phoenix / Raspberry Pi


An excursion in Elixir / Phoenix / Raspberry Pi

The Impetus

Avvo hosts monthly lightning talks. Generally there are three to five presenters covering a range of topics; most are not work related and about varied topics such as Reading a room, Practical knot tying, and How to dig for the elusive Geoduck.

The rules of lightning talks are:

  1. Must be five minutes or less;
  2. Slides should be useful and understandable outside of the talk;

To learn more about Lightning Talks there is a great article written by Diane Geurts available.

When the call came out for presenters this time I started to consider what I could present. A glance around my desk showed the usual things: computer(s), synthesizers, micro-controller programming boards, and a Raspberry Pi 2.


The Raspberry Pi has been sitting there running for a few months with little-to-no use. Could I do something simple and fun that could be presented in five minutes?

Raspberry Pi and General Purpose Input / Output

One thing that makes Raspberry Pi's interesting is the set of 26 General Purpose Input / Output (GPIO) pins. These pins allow connection with physical devices at a very low level. So it's not like attaching a keyboard, mouse, or other USB peripheral; think LEDs or simple switches.

Fundamentally a GPIO pin is either on or off. When the pin is 'pulled high' there is a voltage present. When the pin is 'pulled low' there is no voltage present. The voltage is usually 3.3V or 5V DC.

With an LED attached, when the pin is pulled high the LED will turn on; conversely when the pin is pulled low the LED will turn off.

This is the simple case of using a GPIO pin.

Raspberry Pi Advanced GPIO

That a single LED or other simple on/off can be controlled with a single GPIO pin is nice and all, but what about more advanced devices. What about an array of LED's? Something along the lines of the Bicolor LED Matrix from AdaFruit?

With 64 different LED's the Bicolor LED Matrix would consume more pins than are available on the Raspberry Pi!

Fortunately there are ways to handle this situation. At least two other protocols can be used:

  1. SPI: Serial Peripheral Interface
  2. I2C: Inter-Integrated Circuit

Full details about these protocols are beyond the scope of this article, but good starting points are SPI and I2C.

The Idea

Recently I have been learning the Phoenix Framework using the Elixir language. Could I get a Phoenix web service running on a Raspberry Pi?

If I could, what could it do that could be interesting enough to demo to a group of 10 - 30 coworkers of various disciplines in five minutes or less?

Given all this I came up with the following plan.

Step One

First, get a simple web service based on the Phoenix Framework that will respond to a /ping endpoint. This would be merely to prove that Phoenix, Elixir, and Erlang can run on the Raspberry Pi (with Raspbian as the OS).

Step Two

Next, have the service blink an LED while it is running. This proves that it is possible to access the GPIO pins from Elixir. Additionally, it uses a Periodic Actor by way of Erlang's and Elixir's lightweight Process implementation.

Step Three

Next up is to control a second LED with an endpoint. So a POST method to /led with body {value: 0} turns the LED off, and {value: 1} turns the LED on.

Step Four

Now things get interesting, provide the ability to control the aforementioned Bicolor LED Matrix. For this case, several predefined patterns are available that can be chosen via a POST method to the /matrix endpoint with body {pattern: n, brightness: x} will cause a pattern to be displayed at the given brightness.

If a pattern number outside the available pattern count is given a pattern is chosen at random and displayed. A pattern value of -2 turns the matrix off.

Step Five

An endpoint that accepts a matrix to be rendered to the Bicolor LED Matrix.


As was discussed previously, there are two forms of advanced GPIO use. For this project I2C was used as AdaFruit provides a nice I2C-compatible backpack for the LED Matrix, you can buy the package from AdaFruit (minimal soldering is required).

The Schematic

The schematic is pretty simple. Need a couple of LED's, a resistor for each LED, and the Bicolor LED Matrix soldered on to the I2C backpack.

The Eagle CAD schematic can be found in the repo.

The Code

The source code is available for perusal, and Pull Requests are welcome.

The main points of interest are:

  1. The use of Elixir ALE, to communicate with the GPIO pins; and,
  2. the use of a periodic actor to blink the LED while the service is running.
Elixir ALE

Everything in Unix / Linux is a file. So it is possible to simply write values to files located in the /sys/class/gpio/ directory to control the pins (Hertaville has a more detailed explanation).

A better way, however, is to use system drivers to interface with the GPIO pins. To this end, fhunleth has created a port of the Erlang/ALE library for use with Elixir. This makes it possible to write idiomatic Elixir when working with the GPIO pins.

Periodic Actor

This is a staple of many Actor-based systems. The ability for an actor to send a message to itself. Introducing a time-based dimension allows a solution to be self-contained and obviates the need to use external processes such as cron.

The Repo

The README in the GitHub repo has detailed steps to set up the Raspberry Pi to run the code.

For the Lightning Talk a demo script was written. This was done to exercise the service while the talk was in progress and the slides were presented. This was the easiest way to demo the service as it removed the need to have a network connection on the Raspberry Pi. In order to demo the service from a laptop would have required finding the IP address on the corporate network and introduced risk of demo failure. Creating the script and 'installing' it in /etc/rc.local made the Raspberry Pi a self-running, self-contained demo machine.

This also involved creating the Systemd script to ensure the service started on system boot.

The Lightning Talk

In general I am not a fan of public speaking. I recognize this as a growth opportunity which is what compelled me to do the lightning talk. Five minutes? Meh, that's easy.

For the most part I think the talk went well, but it all happened so fast I was not convinced that anything coherent came out of my mouth.

It is a pretty rich topic to cover in such a limited time which is why I wrote this blog post to add more color and detail to the project.

Future of the project

Well, I don't really know. It was not intended to be a long-lived project. I have made a couple of tweaks here and there, but nothing major. I have no larger plans either. I am definitely open to Pull Requests on the code if you see a better, more Elixiry way of doing things.

I did not use TDD for this either. That is a huge departure for me. Setting up mocks for the GPIO header was not something for which I had the time, nor the inclination to pursue.

Improving capistrano deployment performance

Measure it

First you have to find out what is slow before you can fix it. capistrano-measure is a project that can help you do that.

Your output will be something like this:

I, [2017-04-27T11:40:30.585704 #11159]  INFO -- : ============================================================
I, [2017-04-27T11:40:30.586152 #11159]  INFO -- :   Performance Report
I, [2017-04-27T11:40:30.586189 #11159]  INFO -- : ============================================================
I, [2017-04-27T11:40:30.586237 #11159]  INFO -- : production
I, [2017-04-27T11:40:30.586413 #11159]  INFO -- : ..load:defaults 0s
I, [2017-04-27T11:40:30.586645 #11159]  INFO -- : ..rvm:hook 1s
I, [2017-04-27T11:40:30.586847 #11159]  INFO -- : ..rvm:check 1s
I, [2017-04-27T11:40:30.586978 #11159]  INFO -- : ..bundler:map_bins 0s
I, [2017-04-27T11:40:30.587122 #11159]  INFO -- : ..deploy:set_rails_env 0s
I, [2017-04-27T11:40:30.587530 #11159]  INFO -- : ..avvo:map_bins 0s
I, [2017-04-27T11:40:30.587971 #11159]  INFO -- : production 3s
I, [2017-04-27T11:40:30.588021 #11159]  INFO -- : deploy
I, [2017-04-27T11:40:30.588063 #11159]  INFO -- : ..deploy:starting
I, [2017-04-27T11:40:30.588314 #11159]  INFO -- : ....deploy:check
I, [2017-04-27T11:40:30.588938 #11159]  INFO -- : ......deploy:check:directories 0s
I, [2017-04-27T11:40:30.589179 #11159]  INFO -- : ......deploy:check:linked_dirs 0s
I, [2017-04-27T11:40:30.589410 #11159]  INFO -- : ......deploy:check:make_linked_dirs 0s
I, [2017-04-27T11:40:30.596548 #11159]  INFO -- : ......deploy:check:linked_files 0s
I, [2017-04-27T11:40:30.597144 #11159]  INFO -- : ....deploy:check 0s
I, [2017-04-27T11:40:30.597678 #11159]  INFO -- : ....deploy:set_previous_revision 0s
I, [2017-04-27T11:40:30.597982 #11159]  INFO -- : ..deploy:starting 0s
I, [2017-04-27T11:40:30.598162 #11159]  INFO -- : ..deploy:started 0s
I, [2017-04-27T11:40:30.598318 #11159]  INFO -- : ..deploy:new_release_path 0s
I, [2017-04-27T11:40:30.598369 #11159]  INFO -- : ..deploy:updating
I, [2017-04-27T11:40:30.598660 #11159]  INFO -- : ....deploy:set_current_revision 0s
I, [2017-04-27T11:40:30.598710 #11159]  INFO -- : ....deploy:symlink:shared
I, [2017-04-27T11:40:30.598853 #11159]  INFO -- : ......deploy:symlink:linked_files 0s
I, [2017-04-27T11:40:30.598994 #11159]  INFO -- : ......deploy:symlink:linked_dirs 1s
I, [2017-04-27T11:40:30.599133 #11159]  INFO -- : ....deploy:symlink:shared 1s
I, [2017-04-27T11:40:30.599297 #11159]  INFO -- : ..deploy:updating 10s
I, [2017-04-27T11:40:30.599526 #11159]  INFO -- : ..bundler:install 11s
I, [2017-04-27T11:40:30.599583 #11159]  INFO -- : ..deploy:updated
I, [2017-04-27T11:40:30.599615 #11159]  INFO -- : ....deploy:compile_assets
I, [2017-04-27T11:40:30.599768 #11159]  INFO -- : ......deploy:assets:precompile 1s
I, [2017-04-27T11:40:30.599914 #11159]  INFO -- : ......deploy:assets:backup_manifest 0s
I, [2017-04-27T11:40:30.600134 #11159]  INFO -- : ....deploy:compile_assets 2s
I, [2017-04-27T11:40:30.600464 #11159]  INFO -- : ....deploy:normalize_assets 0s
I, [2017-04-27T11:40:30.600650 #11159]  INFO -- : ....deploy:migrate 13s
I, [2017-04-27T11:40:30.600811 #11159]  INFO -- : ..deploy:updated 16s
I, [2017-04-27T11:40:30.600849 #11159]  INFO -- : ..deploy:publishing
I, [2017-04-27T11:40:30.600986 #11159]  INFO -- : ....deploy:symlink:release 0s
I, [2017-04-27T11:40:30.601213 #11159]  INFO -- : ..deploy:publishing 0s
I, [2017-04-27T11:40:30.601337 #11159]  INFO -- : ..deploy:restart 6s
I, [2017-04-27T11:40:30.601472 #11159]  INFO -- : ..deploy:published 0s
I, [2017-04-27T11:40:30.601510 #11159]  INFO -- : ..deploy:finishing
I, [2017-04-27T11:40:30.601627 #11159]  INFO -- : ....deploy:cleanup 0s
I, [2017-04-27T11:40:30.601749 #11159]  INFO -- : ....honeybadger:env 0s
I, [2017-04-27T11:40:30.601916 #11159]  INFO -- : ....honeybadger:deploy 4s
I, [2017-04-27T11:40:30.602019 #11159]  INFO -- : ..deploy:finishing 5s
I, [2017-04-27T11:40:30.602055 #11159]  INFO -- : ..deploy:finished
I, [2017-04-27T11:40:30.602157 #11159]  INFO -- : ....deploy:log_revision 0s
I, [2017-04-27T11:40:30.602324 #11159]  INFO -- : ..deploy:finished 0s
I, [2017-04-27T11:40:30.602425 #11159]  INFO -- : deploy 52s
I, [2017-04-27T11:40:30.602456 #11159]  INFO -- : ============================================================

Ok, it was probably your assets. Now what?

Skip them a lot of the time!

Tons of shipments don't even have asset changes, so why bother compiling again when we can reuse the previously compiled assets? There's a gem for that! capistrano-faster-assets.

At Avvo this saves about 15 minutes of deploy time for both staging and production deployments, for a total savings of 30 minutes per no-assets-changed shipment.

Remove some?

We took out all of the images in our biggest project and moved them to S3. They are easier to update there anyways (doesn't require a developer), and this sped up asset compilation by several minutes.

Speed them up!

Sometimes the bruteforce option is the best option. If it's been a while since you've dug into your asset pipeline, you might have missed two big improvements to asset compilation: SassC and MiniRacer.

Better V8 with MiniRacer

MiniRacer is an alternative to therubyracer, and does more or less the same thing: it connects to V8 to run JavaScript build tools. I've heard MiniRacer can help speed up JavaScript minification, but we didn't see a huge improvement in our own performance testing.

What we did notice, though, is that MiniRacer is less tied to a specific version of V8. With therubyracer, we had a tough time keeping our V8 libraries in sync in each of our environments and across different operating systems. It was the gem that failed to install most frequently.

MiniRacer tries to always support the newest version of V8, which is much easier to find and install. And the change was easy:

-  gem 'therubyracer'
-  gem 'libv8', ''
+  gem 'mini_racer'

Faster Sass with SassC

Sass was originally written in Ruby. To speed it up, a group of people wrote a C++ implementation of Sass, called libsass. Libsass needs some help before you can use it in your app, though, and that's where SassC can help you out.

SassC is a more-or-less drop-in replacement for Sass:

-gem 'sass-rails', '5.0.3'
+gem 'sassc-rails', '~> 1.3.0'

After making that change, we saw a 230% speedup in asset compilation time. Pretty great for less than an hour's worth of work.

Where to start when fixing tests

My test suite isn't horrible, but it isn't great either...

You have a test suite that runs decently well, but you have some transient failures and the suite has been taking progressively longer as time goes on. At some point, you realize that you are spending the first 15 minutes of a deploy crossing your fingers hoping the tests pass, and the next hour re-running the suite to get the tests to pass "transiently". You have a problem that should be addressed, but where do you start?

Generally, you should start with stabilizing your suite. Consistently passing in 1 hour is a better situation than having to run a test suite 2-3 times at 45 minutes each.

How do you know which tests to tackle first?


Our testing stack: minitest, Capybara, PhantomJS, Poltergeist, and Jenkins or CircleCI.

We use Jenkins and CircleCI as part of our continuous integration process, which means they are on the critical path to deploying. If our tests pass quickly and consistently locally, but not in our CI environment, we still can't (or shouldn't) ship. "It works on my machine" is rarely a good enough defense. To solve our slow and flaky problem, we want to be sure we are looking at our test performance on servers in our deploy path.

How big of a problem do you have?

How often does your test suite pass? Are there particular suites within the project that fail more frequently? Jenkins and CircleCI can show you this history, but we couldn't find summary level data like, "this suite has passed 75% of the time in the last month".

How do you find your flakiest tests?

We couldn't find an easy way. You can have people document failing tests when they come across them, but manual processes are destined to fail.

How do you find your slowest tests?

There are a few gems that can help you identify your slowest tests locally, like minitest-perf, but we want to know how our tests perform in the continuous integration environment. Jenkins and CircleCI provide some of this data, but it is pretty limited.


We created JUnit Visualizer to help collect the data we want

Gathering test data

Jenkins and CircleCI support the JUnit.xml format, which includes test timing, test status, and number of assertions. With JUnit.xml, we can leverage an industry standard, and CircleCI maintains a gem, minitest-ci, that exports minitest data to the format. The gem can basically be dropped into an existing project using minitest. It creates an xml file per test file that is run, and saves it in the "test/reports" directory by default.

To standardize our integration with Jenkins and CircleCI, we push the xml files to S3, using a directory per build. We use the following to accomplish pushing to S3:

S3 upload configuration

Displaying test data

The main categories of test data we want to view:

  1. Historical information that shows how frequently our tests pass or fail. This is broken down by suites if we have more than 1 suite within a project. This is helpful in focusing our attention to the worst offenders.
  2. Single list of failures that shows all of the test failures, across suites, on one page. This is a convenient way to see all of the failures without having to click into the details of each suite.
  3. Unstable tests list that shows which tests fail the most. This allows us to see our "transient" test failures, as well as identify areas of our code that may be fragile. This provides guidance on where to start fixing tests.
  4. Slowest tests list that shows which tests are taking the most time. There is no point in speeding up a test that takes 1 second, if you have a test that is taking 45 seconds.
  5. Duration trends that show how your test duration is changing over time. It is helpful to see that we are making progress.

For screenshots of how these look in JUnit Visualizer, check out the section at the bottom of this post.

Next Steps

We have made great progress on our stability and speed since starting on JUnit Visualizer, how we addressed the test issues is chronicled here.

Some potential next steps for JUnit Visualizer:

  • Enhance the trend charts to account for outliers
  • Be able to reset the unstable test list when we think we have fixed an unstable test

Check out the code for JUnit Visualizer here:

Screen shots

Historical Information

We wanted to show how often our tests pass, broken down by project and the suite within the project.

Project view with suites

Single list of failures

We wanted a better summary view of the tests that failed. In Jenkins v1, you can only see the failures within a suite, which means there is a lot of clicking around.

View of the errors for a specific build, where skips and errors are on top

Failures across suites

Unstable tests

We wanted to find the tests that failed most frequently.

View of the tests, with most frequent failures on top

Unstable Tests

Slowest tests

The tests that take the most time, sorted slowest to fastest.

View of the tests for a specific build, slowest test at the top.

Slowest Tests

Duration trends

As we started to fix slow tests, we wanted to be able to see how our test duration changed over time.

There is a simple graph that shows the duration (in seconds)

Test duration over time

Performance and stability in capybara tests

If you've got flaky or very slow UI tests this is the post for you. Do any of these problems sound familiar?

  • Unexplainable exceptions during tests

  • Capybara is timing out

  • Capybara cannot find elements on the page that are clearly present

  • I hate writing tests this is awful please send help

  • Tests take forever to do things that are fast manually

  • Order of tests is affecting stability

  • PhantomJS raises DeadClient exception

  Capybara::Poltergeist::DeadClient: PhantomJS client died while processing
  • None of this is consistently reproducible, if at all

  • Existential dread
    Existential dread

Most of the specifics discussed here will be about rails, minitest, capybara, poltergeist, and phantomJS. This is a common stack but the principles here are useful elsewhere.

A test that tests something correctly is the first priority in writing a test. I can't help you with getting the test right, but after that comes stability, then performance. We created a gem that includes most of the things we're going to cover here, and most of the code snippets are directly from this gem.


intransient_capybara is a rails gem that combines all of the ideas presented here (and more). By inheriting from IntransientCapybaraTest in your integration tests, you can write capybara tests that are far, far less flaky. The README explains more on how to use it and exactly what it does.

The goals of intransient_capybara are debuggability, correctly configuring and using minitest, capybara, poltergeist, and phantomJS, and improving on some of those things where there are gaps (most notably with the genius rack request blocker). This combines a ton of helpful stuff out there into a gem that will take you 10 minutes to set up.

Test stability

Test stability is monstrously difficult to nail down. Flaky tests come from race conditions, test ordering, framework failures, and obscure app-specific issues like class variable usage and setup/teardown hygiene. Almost nothing is reproducible. We can all stop writing tests, or we can try to understand these core issues a little bit and at least alleviate this pain.

Use one type of setup and teardown. Tests use both setup do and def setup and it matters which you pick, because it affects the order things are called. I recommend always using def setup and def teardown in all tests, because when you have to manually call super, you can choose to run the parent method before or after your own. The example below shows the two options.

class MyTest < MyCapybaraBaseClass
  # Option 1
  def setup
    # I can do my setup stuff here, before MyCapybaraBaseClass's setup method
    super # You MUST call this
    # ... or I can call it after

  # Option 2
  setup do
    # I do not have to call super because I am not overriding the parent method...
    # but am I before or after MyCapybaraBaseClass's setup method??

Use setup and teardown correctly. Your setup and teardown methods will invariably contain critical test pre- and post-conditions. They must be called. It is very easy to override one or both in a specific test and forget to call super. This creates frustrating issues and is very hard to track down. Fix these in your app, and add some code to raise exceptions if you haven't called these methods in the base test class. intransient_capybara does this for you.

Warm up your asset cache. The very first integration test fails transiently a lot? That is suspicious. Gitlab had the same problem. Use a solution like theirs to warm up your asset cache before trying to run integration tests. intransient_capybara does this for you. Wow!

Wait on requests before moving on. Tests can leave around AJAX queries even if you don't have "hanging" queries at the end of a test, and these create two issues. First you might be missing stuff these requests need in order to complete successfully, because you are awesome and have all the right stuff in teardown and are calling it correctly. Now you get obscure things like "missing mock for blahblah" in the next test that is completely unrelated! Second, these use up your test server's likely sole connection and produce even more obscure errors:

Capybara::Poltergeist::StatusFailError - Request to <test server URL> failed to reach server, check DNS and/or server status

You can use wait methods and those can be very helpful inside of a test, but the best way is to absolutely ensure you are done with all requests in between tests. Rack request blocker is THE way to do this. It is just awesome. Can't get enough of it. intransient_capybara includes rack request blocker.

Do not have "hanging" requests at the end of a test. If you have a test that ends with click_on X or visit ABC this request is going to hang around, potentially into the next test and interfere with it. Don't do this - it is pointless! If it is worth doing, it is worth testing that it worked. If not, change it to assert the ability to do this instead of doing it (checking presence of link vs. clicking it for example). This is less important using intransient_capybara because it always waits for the previous test's requests before moving on.

Save yourself a headache. Try hard to solve all transient test problems. You'll still get them from time to time, though. If you've got a tool to tell you what they are, you don't need them to fail your test run for you to fix these things. Most likely you re-run tests and move on anyways, so why re-run the whole set of tests when you can automatically retry failed tests? You can use something like Minitest::Retry for this. Retrying failed tests is far from ideal, but so is having to re-run tests when you're trying to ship something. intransient_capybara has this included and has options for configuring or disabling this behavior.

Test performance

After stability, improving test performance is the next most important thing. There are a ton of things that are easy to do that make tests slow.

Look at your helpers. You have helpers for your tests. They log you in, they assert you have common headers, and all sorts of things. One of these is probably very slow and you haven't noticed. We were logging in nearly every test using the UI, and stubbing that method call instead of actually logging in cut test time in multiple projects anywhere from 40-90%.

Don't use assert !has_selector? This will wait for timeout (Capybara.default_max_wait_time) to complete. If you're expecting a selector, use assert has_selector?. If you aren't, use assert has_no_selector? Learn more from codeship.

Avoid external resources. This is mostly about performance, but is also an important stability improvement. It can help you avoid this:

Capybara::Poltergeist::StatusFailError - Request to <test server URL> failed to reach server, check DNS and/or server status - Timed out with the following resources still waiting for <some external URL>

Almost everyone is susceptible to hitting external stuff in tests. You might be loading jQuery from a CDN, or have javascript on your checkout page that queries a payment provider with a test key. These can timeout, be rate limited, and are properly tested in higher level system integration tests (acceptance testing). You should track these down and eliminate them. The code below can be included in the teardown method of your tests to help you debug your own network traffic. This method is included by default in intransient_capybara.

    def report_traffic
        if ENV.fetch('DEBUG_TEST_TRAFFIC', false) == 'true'
          puts "Downloaded #{ / 1.megabyte} megabytes"
          puts "Processed #{page.driver.network_traffic.size} network requests"

          grouped_urls ={|url| /\Ahttps?:\/\/(?:.*\.)?(?:localhost|127\.0\.0\.1)/.match(url).present?}
          internal_urls = grouped_urls[true]
          external_urls = grouped_urls[false]

          if internal_urls.present?
            puts "Local URLs queried: #{internal_urls}"

          if external_urls.present?
            puts "External URLs queried: #{external_urls}"

            if ENV.fetch('DEBUG_TEST_TRAFFIC_RAISE_EXTERNAL', false) == 'true'
              raise "Queried external URLs!  This will be slow! #{external_urls}"

Don’t repeat yourself. Lots of tests have overlap - try to test one thing. Tests with a copy/paste start pattern like visit X, click_on ABC are not required. One test can visit X and click_on ABC, and all the others can skip to that page that comes after clicking on ABC. This saves a lot of time - probably 10-20 seconds every time such a pattern is factored out.

Don’t revisit links. Try to assert links, but if you click them you pay a cost. Like the last point, let some other test assert that the page loads and has stuff correct, and it can pay that visit once only over there. assert has_link? instead of click_on link.

Don't use sleep. Sleeps are either too long or too short. Writing sleep 5 might make you look cool to your friends but it is damaging to your health and should be avoided. Don't get peer pressured into sleeps in your tests. You can assert_text to make it wait for the page to load or write a simple helper method wait_for_page_load!

  def wait_for_page_load!
    page.document.synchronize do

You can wait for ajax too. thoughtbot solved this. intransient_capybara includes these methods and uses them in teardown for you already, and makes them available for you to use inside of your own tests.

Avoid visit in your setup method. If you write visit in a setup method in a file that has a bunch of tests, you did something dangerous. One of our tests was visiting 3 pages before visiting more pages in the test itself. Try to break down what you want to test with regards to visiting pages so you can minimize this. Every visit call will be 2-10 seconds long, and it is easy to have pointless visits go unnoticed.

Delete all your skipped tests. We had so many it affected performance, and there was no point to them. Fix or create these stubbed tests today or just delete them.

Parallelize! By breaking your tests down into suites, you can run your tests in parallel a lot easier. You can have parallel test harnesses run SUITE=blah rake test. The matrix configuration in Jenkins makes this a lot easier. If you use something hosted like CircleCI, they can often run things in parallel even without creating suites (allowing you to specify directories per parallel container to be executed). You can try to balance out the tests run in each parallel container and get the fastest times. Our acceptance tests were almost twice as fast after less than an hour of parallelization work, and optimized parallelization with only 3 containers reduced our most important project's tests by more than half (and this again took less than a dev day - this is homerun level stuff).

Find your slowest tests. You need to find or create a tool that can monitor your performance over many test runs and highlight the slowest tests so you can tackle the problems in a targeted way. Once you've dealt with systemic problems, you're left optimizing test by test. There are gems that help you output test performance, such as minitest-ci and minitest-perf.


We're not perfect yet, but these tips and intransient_capybara have reduced the rate of transient failures in our tests from a whopping 40-50% of all test runs to virtually none (<1%). It only takes one failed transient test to fail the whole run, so things have to be really stable for it to start passing consistently. The performance has gone from more than one hour to about 16 minutes in CircleCI (and that is not the best it can be). Acceptance tests have gone from 15 minutes with around a 25% transient failure rate in pre-production environments, and a low transient rate but 10 minutes in production, to 2.5 minutes in pre-production and 1.5 minutes in production, with a test-caused transient rate of near 0 (transients today are due to pre-production environmental issues, not the tests or their framework).

Back in 60 seconds: Reprovisioning a server in about a minute

Avvo is a growing company. Like other growing companies, we started with a small server footprint which is now growing. As our small scale server environment grew, we found that installing the OS (ubuntu) is slow and takes a lot of time and effort.

The Pain

So everyone can really understand how painful our server provisioning process was, let me describe it to you.

First, we booted the host to an installation disk (iso image). That part in itself is difficult enough if you don't have a server provisioning stack, and your servers are in a remote datacenter. We typically used onboard management cards, or a network attached KVM in the datacenter to mount the iso as virtual media and get the process started. After the host booted the ISO, we manually entered all of the details such as partition layout, network info, packages to install, etc. Then we waited for the OS install to complete, at which point we logged in via ssh and ran chef on it to do the remaining configuration and install more software. In addition to that install process, we also had to manually assign an IP, then add that IP to DNS (bind).

There are a lot of options out there to solve this problem. There's full-blown server provisioning stacks that can handle much of the work. They're designed to automate the server provisioning process, make it more consistent, more reliable, etc. We evaluated many of them, including MAAS, Cobbler, OpenCrowbar, and Foreman. In general, we didn't actually dislike any of them, but none of them fit us quite right for various different reasons.

My Little Lie

Now that I've described the problem to you, let me now get something off my chest: I've lied to you. We don't actually fully reprovision a server in a minute. It's currently 88 seconds. But after a hard drive wipe, our OS boots up and is ssh-able in 37 seconds. The remaining 51 seconds is disk formatting, configuration, and software installation. By the time the host is ssh-able it has received an IP, and both forward and reverse DNS entries have been created automatically. But since Wikipedia defines software installation as part of the server provisioning process (and really, it is) I guess I lied to you.

To be fair, I'm certain that if we moved all of the software to a local repository instead of downloading it from the internet, we could get the entire process down to less than a minute. Shaving those extra 28 seconds off of the time didn't seem as important after we reduced server provisioning from a multiple hour manual process, to an 88 second automated process. Think about it, in the time it takes you to get through a standard TV commercial break, your server could be reformatted and running something completely new.

What's funny about this whole thing? Super fast server provisioning wasn't even our end-goal. Our main goals were just to build out a cluster, in an automated and maintainable way. Being able to re-provision a host rapidly is just a nice side effect of the design we chose.

The Devil in the Details

TL;DR I'll explain the little details that make this so fast:

  • I benchmarked on a VM. Some of our baremetal hosts take more than 88 seconds just to POST and initialize firmware. VMs conveniently skip that mess.
  • We're using a cloud-config file to do most of the configuration. Chef, ansible, salt, puppet, et-all are great, but for the simple stuff, cloud-init is faster.
  • Our software installation process actually boils down to a simple 'docker run' from the cloud-config file, and our container orchestration system.
  • Our network is reasonably fast (dual 10Gbit to each physical VM host)

And the last sorta-kinda little detail:

  • We don't actually install the OS on the drive. We're pxe-booting a live image, and the OS is only using the drive for data persistence. The OS does format the drive (if needed) on bootup, and store any applicable files to disk (including the 'software' mentioned above). We are still using persistent storage, and it's still part of our server model. We just don't use it for storing or running the OS (in this cluster anyway).

NOW I get it!

So you might be saying "Oh, well no wonder you get such great times. You're just pxe-booting a live image, in a VM, on a fast network. You're not doing much software installation, and your configuration is just a simple cloud-config."

And to you, unruly naysayer, I would say: "Yep, that's right!"

Why in the World Would You Do That in Production?

There are many benefits to this approach. But here are the highlights:

  • It's easier to maintain than a full server provisioning stack
  • It's fast. As in, boot-up is fast, and the OS binaries run from RAM instead of sloooooow magnetic drives.
  • It's as reliable as our old-school hand-crafted artisinal Ubuntu installs
  • We get better usage from our disk space (we don't install gigs of OS binaries and packages)

There's a ton of other reasons we're doing this, including all of the benefits of embracing the microservices revolution, such as easy software builds, reliable testing, and simple deployments.

But, aren't there a LOT of Drawbacks?

Ok, admittedly there's downsides to this approach:

Configuration Can't be Complex

Any configuration we have to do must be covered in the scope of cloud-config, which for our OS (RancherOS), is surprisingly limited. We aren't running chef, ansible, puppet, or any other major configuration management service on these hosts. We could, but that would kind of defeat the purpose of keeping these hosts as lightweight and disposable as possible.

You might notice I said RancherOS. If you're not familiar with that particular flavor of linux, then give it a try. Similarly to CoreOS, it basically just runs docker and doesn't come with all of the cruft you get from a full-blown server OS. The kernel image we're using clocks in at 3.7M, and the initrd is 25M. An OS footprint of 28.7 megabytes explains why that bootup process is so fast.

We're using stock images from RancherOS though. So it's not like we have any overhead in maintaining the images. If they release a new image, we try it out, and if it works for us then we use it. Since we're using stock images, and cloud-config, a full OS upgrade is literally 88 seconds away. Trying out a new OS version is similarly fast.

As an aside, I consider complex configuration the wrong way to go anyway. If you have complex configuration management, that means you need to manage your configuration management. Some people like that, but I like to keep things simple and work on the important stuff like keeping our website healthy. So really, enforcing simple configuration is actually a bonus! If for some reason we find that we really need more complex configuration, we'll probably move that into the docker images. The more we use docker images, the less need we have for a complex configuration management system. Why deal with configuration management, when you can just define the exact state of your docker images in a Dockerfile? I guess if you don't want to maintain lots of Dockerfiles (and their associated images), then you could maintain it in a tool like chef. I don't know that using chef to build or configure your docker images would buy you much in the complexity department though. Instead of maintaining Dockerfiles, you end up maintaining recipes, cookbooks, databags, roles, nodes, and environments.

Software Options are Limited

We're limited to just using software that can be run in a container. That's a lot of software actually, but anything proprietary will need to be packaged up in a container before we can run it on these hosts. There are also a lot of positive side effects from working with containers at the OS level, and cutting out the cruft of traditional package management. As one example, apt/yum both do a great job of building out a depedency chain and pulling them in during an install. However, they introduce their own issues with package conflicts and silly dependency chains that are difficult to work through. With docker images, the dependencies are in the image. Package conflicts effectively go away.

I should take a moment to mention the software security aspect here. Modern os package management systems (yum/apt/etc) have grown to support package signatures, trusted repos and maintainers, etc. Contrast that with downloading images from dockerhub, where the image isn't necessarily signed or maintained by a trusted person/group. Limiting which images can be downloaded from dockerhub, and/or using a trusted registry helps improve the software security aspect. Though, for the time being, this is one area where yum and apt have an advantage.

Data Persistence is Still Iffy

Our most difficult challenge so far, is figuring out how to reliably dockerize our SQL databases. Some people will be quick to say "well there's a container for mysql and a container for postgres, and a container for ..." But hold on there cowboy, if you put your entire database in a container where does that data go? If you store it in the container itself, that data goes away when the container is destroyed. If you have bind mounted a volume to your container, then your SQL container is joined at the hip to the host which originally ran the container. Using a "data container" and linking it to the SQL container is a popular solution, but has the same problem of being stuck on the host they started on.

We don't want any container to be stuck to a single host. We're aiming for lightweight and disposable hosts here. Less pets, more cattle. If the host stores some mission critical database, then it's no longer disposable. For that reason, we treat all on-host storage as volatile, and plan around the possibility of it being destroyed at any time without warning. Traditional approaches for SQL data reliability include backups, and slave DB servers, but translating those concepts to containers comes with a new set of complexity and problems.

One of Docker's approaches to solving that problem is with support for storage drivers, and we're currently looking into both Flocker and Rancher's Convoy which are two popular storage drivers. We've been discussing other ideas to solve sql data persistence, some of them more wildly experimental than others, such as an off-cluster "super-slave" for all database containers, or sending binlogs to Kafka, but so far haven't found a silver bullet here.

As an aside, the data persistence problem is more easily solved for companies that have an enterprise-grade SAN, which we don't have (yet).

We Have to Maintain a Custom Provisioning Stack

There's a lot of moving pieces in a server provisioning stack, and we have to maintain them. A tftp server, a web server, dhcp server, dns server ... and I'm sure I'm forgetting some others. In our case, we have that all maintained in Chef. I didn't say we have NO complex configuration anywhere, we just keep it away from our Rancher cluster and maintain it in Chef. Try to imagine a picture of the Dos Equis guy here, "I don't always have complex configuration, but when I do, I use Chef". Our server provisioning stack isn't really that complicated anyway. These are all standard services, and we're not configuring them in any off-the-wall ways. The most complicated part is actually how we generate cloud-config files. We created a quick CGI script that simply calls out to consul-template to generate cloud-config files on-the-fly. Any specific host configuration is stored in our consul cluster (such as hostname, environment, etc).


There's a lot of advantages and disadvantages to the cluster we've built. I highlighted the speed of reprovisioning as the topic for this article, but only because it's an interesting datapoint, not because it's important to our use case. Hosts in our Rancher Cluster are so disposable now, that even if server provisioning took 30 minutes instead of 88 seconds, I don't think we'd notice. If a disposable host dies without any impact to your services, do you really care anymore that it took 88 seconds or 30 minutes to replace it? Something we take for granted is that building a docker cluster enabled us to focus less on server maintenance and more on other issues that needed our attention. Using docker at the OS level and treating hosts as disposable, moved us to a more stable and maintainable platform overall, and maybe that's a topic worth discussing all on its own.

Bootstrapping with Docker in a Non Docker World

As our company begins to transition from an older Chef/Capistrano based deployment model and delving into Docker, those of us not familiar with Docker (myself included) have had to learn a lot to be able to keep up. Taking 20+ legacy projects, Docker-izing them all, standardizing the deployment model and getting all of the apps to play nicely together in this new paradigm is no small undertaking. However, even if you or your company aren't quite ready to fully commit to Docker, there's no reason you can't start using Docker today for your own development work, to both make your life easier and give you some insight into how powerful of a tool Docker can be.

A Brief Intro to Docker

For anyone not overly familiar with Docker, a brief introduction is in order.

Docker is a lot like a virtual machine, but without the overhead of having to virtualize all the basic system functionality. Any Docker application you run is granted more or less direct access to the system resources of the host computer, making it significantly faster to run and allowing for much smaller image sizes, since you don't have to duplicate all the OS stuff.

In practice, when you boot up something in Docker, you'll start with an image you either created yourself or downloaded off the internet. This image is basically a snapshot of what the application and surrounding system looked like at a given point in time. Docker takes this image and starts up a process that they call a container, using the image as the starting point. Once the container is running, you can treat it just like any normal server and application. You can modify the file system inside the container, access the running application, edit config files, start and stop the processes that are running... anything you'd do with a normal application. Then, at any point, you can discard the entire container and start a new container with the original image again.

These independent and disposable containers are at the heart of what makes Docker such a powerful tool. In production, this allows you to scale your system rapidly, as well as reducing the burden of configuring new hosts, since the majority of your application specific configuration will now be stored inside your container. In this manner, Docker images can conceivably be run on any system capable of running Docker, without any specific per application setup involved. Even if your company's applications aren't yet running on Docker, you can still leverage these traits to make you development environment trivially easy to set up.

Using Docker Compose to Bootstrap Your Computer

Setting up your workspace for the first time can be fairly tedious, depending on the number of services your application needs to have running in order to work. A simple Rails app could easily have several such dependencies, just to respond to simple requests. Most of our applications at Avvo require things like Redis, Memcached and MySQL... and that's before we even get into anything unusual that an application might require. When you jump into working on an application that you haven't touched before, it can sometimes take the better part of a day just to get the app to boot up locally. Luckily for us, Docker can help to greatly reduce this burden, with a little bit of help from Docker Compose.

While Docker itself gives us a great starting point for building and running images, starting up and configuring containers manually can be a little tricky. Docker Compose provides an easy and much more readable way to configure and run your containers. We can set up Docker Compose for those three services listed above, by creating a docker-compose.yml file like so:

    version: '2'
        image: mysql:5.6
          - "3306:3306"
          MYSQL_ROOT_PASSWORD: supersecretpassword

        image: memcached:latest
          - "11211:11211"

        image: redis:latest
          - "6379:6379"

Even if you're not all that familiar with Docker Compose, the above file is fairly self-explanatory.

  • We declare three services: mysql, memcached, and redis
  • Tell Docker Compose to use the Docker images of the corresponding names for these services.
  • Declare port numbers for each service that we want to be able to access from the host machine, so that we can access each service from outside of their containers.
  • Apply some small configuration settings via environment variables, such as MYSQL_ROOT_PASSWORD.

To start these services, you just need to give the "up" command to Docker Compose from the same directory as the docker-compose.yml file above:

   ~/workspace:> docker-compose up
   Pulling redis (redis:latest)...
   latest: Pulling from library/redis
   357ea8c3d80b: Pull complete
   7a9b1293eb21: Pull complete
   f306a5223db9: Pull complete
   18f7595fe693: Pull complete
   9e5327c259f9: Pull complete
   72669c48ab1f: Pull complete
   895c6b98a975: Pull complete
   Digest: sha256:82bb381627519709f458e1dd2d4ba36d61244368baf186615ab733f02363e211
   Status: Downloaded newer image for redis:latest
   Pulling memcached (memcached:latest)...
   latest: Pulling from library/memcached
   357ea8c3d80b: Already exists
   1ef673e51c1f: Pull complete
   5dfcd2189a7d: Pull complete
   32d0f07db7eb: Pull complete
   fced47673b60: Pull complete
   e7d3555f9ff2: Pull complete
   Digest: sha256:58f4d4aa5d9164516d8a51ba45577ba2df2a939a03e43b17cd2cb8b6d10e2e02
   Status: Downloaded newer image for memcached:latest
   Pulling mysql (mysql:5.6)...
   5.6: Pulling from library/mysql
   357ea8c3d80b: Already exists
   256a92f57ae8: Pull complete
   d5ee0325fe91: Pull complete
   a15deb03758b: Pull complete
   7b8a8ccc8d50: Pull complete
   1a40eeae36e9: Pull complete
   4a09128b6a34: Pull complete
   587b9302fad1: Pull complete
   c0c47ca2042a: Pull complete
   588a9948578d: Pull complete
   fd646c55baaa: Pull complete
   Digest: sha256:270e24abb445e1741c99251753d66e7c49a514007ec1b65b47f332055ef4a612
   Status: Downloaded newer image for mysql:5.6
   Creating redis
   Creating memcached
   Creating mysql
   Attaching to mysql, memcached, redis
   mysql        | Initializing database
   mysql        | 2016-08-30 22:51:47 0 [Note] /usr/sbin/mysqld (mysqld 5.6.32) starting as process 30 ...
   redis        | 1:C 30 Aug 22:51:48.345 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
   redis        | 1:M 30 Aug 22:51:48.346 * The server is now ready to accept connections on port 6379
   mysql        | 2016-08-30 22:51:48 30 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0
   mysql        | 2016-08-30 22:51:55 1 [Note] mysqld: ready for connections.
   mysql        | Version: '5.6.32'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)

You should get output similar to that above, indicating that all three services are running. If you didn't already have the images stored in Docker locally, they should get downloaded automatically from Docker Hub. At this point, you should have fully usable services running locally, without having had to do any manual downloading, configuring, compiling, etc... Docker takes care of everything for you. You can even easily share these Docker Compose files around your company for easy bootstrapping of new workstations.

Another added benefit of running Dockerized services is that it eliminates a pet peeve of mine, where MySQL will unexpectedly get itself into a bad state and refuse to restart. If you're running MySQL natively, you're probably going to have to either do some surgery on the MySQL file system to remedy things, or else completely uninstall/reinstall MySQL to get it back into a working state, which can be pretty tedious and error prone. With Docker, you simply delete the container and the next time you boot it up, it'll create a new container from the original image. You'll still have to set up your DB tables and data again, but it's still far simpler and faster than reinstalling the native application.

Using Docker Compose for bootstrapping can also be great in cases where you have some obscure service that's required for an application. For example, one app that we use depends on Neo4j. What does it do? I have no idea, something to do with graphs I think. And I'm pretty sure the 'j' stands for Java. But assuming I'm not touching any of the graph stuff in the code that I need to work on, it would be really nice to not have to spend hours getting this thing running locally. Docker Compose makes this a cinch, even if the application that depends on Neo4j isn't yet Dockerized:

version: '2'
    image: neo4j:3.0
      - 7474:7474
      NEO4J_AUTH: none

Now just a call to "docker-compose up" from the same directory as the above Compose file and we have a running Neo4j instance that our app can use.

In Summary

Just about every commonly used service will already have a Docker image publicly available. Combining that with the power of Docker Compose can make complicated and tedious bootstrapping of your workstation or project a thing of the past. Even if your company hasn't fully committed to Docker or if the particular application you're working on isn't Dockerized, getting immediate benefits out of Docker is something that you can start enjoying today.

Using Environment Variables with Elixir and Docker

If you're trying to run your shiny new Elixir app in a docker container, you'll find a few new problems over running Ruby. But you'll run into a few gotchas with environment variables.

The first problem

Elixir (being based on Erlang) is pretty cool, in that it gets compiled. The downside is that the environmental variables you use get hardcoded at compile time. That prevents you from using the same binary image (in a docker image) in your staging environment and production environment.

In our case, we're using Phoenix, so we've got this in our config/prod.exs (I moved this from the prod.secret.exs because with ENV it isn't secret!):

# config/prod.exs
config :FooApp, FooApp.Repo,
  adapter: Ecto.Adapters.MySQL,
  username: System.get_env("DB_USER"),
  password: System.get_env("DB_PASS"),
  database: "foo_app",
  hostname: System.get_env("DB_HOST"),
  pool_size: 20

In your Dockerfile you'll want to release with exrm:

RUN MIX_ENV=prod mix do deps.get, compile, release

But once you do that, your config has the DB_USER, etc, that was available when your built your docker image! That's not good. You'll see this when you try to run your image:

$ docker build -t foo .
$ docker run -e "PORT=4000" \
> -e "DB_USER=foo" \
> -e "DB_PASS=secret" \
> -e "DB_HOST=my.mysql.server" \
> foo ./rel/foo/bin/foo foreground
Exec: /rel/foo/erts-8.0/bin/erlexec -noshell -noinput +Bd -boot /rel/foo/releases/0.0.1/foo -mode embedded -config /rel/foo/running-config/sys.config -boot_var ERTS_LIB_DIR /rel/foo/erts-8.0/../lib -env ERL_LIBS /rel/foo/lib -pa /rel/foo/lib/foo-0.0.1/consolidated -args_file /rel/foo/running-config/vm.args -- foreground
Root: /rel/foo
17:27:20.429 [info] Running Foo.Endpoint with Cowboy using http://localhost:4000
17:27:20.435 [error] Mariaex.Protocol (#PID<0.1049.0>) failed to connect: ** (Mariaex.Error) tcp connect: nxdomain
17:27:20.435 [error] Mariaex.Protocol (#PID<0.1048.0>) failed to connect: ** (Mariaex.Error) tcp connect: nxdomain

You'll notice I was able to specify the http port to use, that's a Phoenix specific configuration option, not available for other mix configs.

The first solution

Luckily, Exrm has a solution for this:

  1. Specify your environmental variables with "${ENV_VAR}"
  2. Run your release with RELX_REPLACE_OS_VARS=true

So we'll replace our Ecto config block with:

# config/prod.exs
config :FooApp, FooApp.Repo,
  adapter: Ecto.Adapters.MySQL,
  username: "${DB_USER}",
  password: "${DB_PASS}",
  database: "foo_app",
  hostname: "${DB_HOST}",
  pool_size: 20

And run our release:

$ docker build -t foo .
$ docker run -e "PORT=4000" \
> -e "DB_USER=foo" \
> -e "DB_PASS=secret" \
> -e "DB_HOST=my.mysql.server" \
> -e "RELX_REPLACE_OS_VARS=true" \
> foo ./rel/foo/bin/foo foreground

And we get no errors!

The second problem

But now you want to create and migrate your database with mix do ecto.create --quiet, ecto.migrate.

$ docker run --link mysql \
> -e "PORT=4000" \
> -e "DB_HOST=mysql" \
> -e "DB_USER=foo" \
> -e "DB_PASS=secret" \
> -e "DB_HOST=mysql" \
> -e "MIX_ENV=prod" \
> foo mix do ecto.create --quiet, ecto.migrate
** (Mix) The database for Foo.Repo couldn't be created: ERROR 2005 (HY000): Unknown MySQL server host '${DB_HOST}' (110)

What's happening is that mix doesn't know to replace the "${ENV_VAR}" strings with our environmental variables. It isn't being run through the Exrm release (the mix tasks aren't even compiled into the release).

The second solution

This is easy to fix, we just add what we'd started with System.get_env:

# config/prod.exs
config :FooApp, FooApp.Repo,
  adapter: Ecto.Adapters.MySQL,
  username: System.get_env("DB_USER") || "${DB_USER}",
  password: System.get_env("DB_PASS") || "${DB_PASS}",
  database: "foo_app",
  hostname: System.get_env("DB_HOST") || "${DB_HOST}",
  pool_size: 20

This gives us both options; either we can set them at "compile" time in the mix script, or don't specify them, so when the release runs it uses the run time environmental variables.

TLDR; Summary

With Elixir, Phoenix, Exrm, and Docker, you can build a release and run the same binary in staging and production. To specify different runtime environments and be able to run mix tasks like migrate, you need to combile two solutions.

Run your release with RELX_REPLACE_OS_VARS=true and define your config variables with System.get_env("ENV_VAR") || "${ENV_VAR}".


Metrics that don't suck

Metrics, from a developer's perspective, are simultaneously one of the most compelling and one of the most boring things you can ever be asked to look into. There's an old adage that goes something like, "If you want to improve something, measure it." There's a certain amount of logic behind that. Our instincts and best guesses can take us pretty far when trying to develop useful and robust applications, but there's nothing quite like hard data to show us the real shape of the world.

Data can give you a ton of insight into your application and there's a certain amount of OCD-like satisfaction with creating the perfect six-table join statement in SQL that exactly captures the metric you were trying to shake out. However, it is a very particular kind of person that's going to be willing to do this day-to-day. It's one thing to export a CSV from the database, filter the content down, create a pivot table or pareto chart and email it out every once in a while. But the monotony of having to do that once per day or once per week will drive any developer to madness.

What we really want is a programmatic way to find out this information and share it in an easily consumable manner. Yes, we can generate CSV reports that get mailed out daily or look at slides at the quarterly company meeting; those can provide a lot of value. But for our daily lives, what we really want are metrics that people are going to want to look at. We want metrics that motivate people. We want metrics that inform you of your app's daily successes and failures. We want metrics that let you know when something has gone terribly, terribly wrong with your application. And we want metrics that you can read and understand at a glance.

Enter Status Board

What is Status Board

Conveniently named, Status Board is an iOS application written by a company named Panic that allows you to create custom status boards for anything you'd like. All you need is a TV, an iPad along with the right cables to connect the two and you're in business. Lucky for me, thanks to some careless luggage handlers at the Westin in Portland, OR, I had a slightly maimed iPad that was no longer suitable for personal use that I could sacrifice to the cause.

For getting live data into the app, there are basically two approaches you can take:

  1. You have your application upload a file with the data you want to a Dropbox account, then have the Status Board periodically pull the data down to be displayed. This method... has simplicity going for it, but that's about it. To get it to work, you're going to have to write some kind of CSV export and then have something like a cronjob running to push the file all day long. It works, but you're going to be really limited in what you're going to be able to display in the Status Board app, since it only allows you to present CSV data in a couple different formats.

  2. Rendering a web page outside of the app that the iPad can connect to, which gives you access to what they call the DIY panel. Basically, you give the app an HTML endpoint and it will pull down that page and render it in the app, every 10 minutes by default. Now that you're dealing in straight HTML, you can render almost anything you'd like. Tables, graphs, maps, images... anything you can render from the given web app, you can put straight onto your board.

Nitty gritty

At Avvo, I was lucky enough that we had a Rails app that was accessible from inside the office network that also had access to all the data I could conceivably want to show. For example, as app developers we might like to know what our daily sales stats were for the last week, so we can write a simple html page that shows the count for each of the last 7 days (with the previous week's sales for that day in parentheses, for comparison).

Presenter code:

class StatsPresenter
    def date(days_back)

    def day_of_week(days_back)
        return "Today" if days_back == 0

    def days_sales(days_back)
        Purchase.where(:created_at => days_back.days.ago.beginning_of_day..days_back.days.ago.end_of_day).count

View code (written in Slim):

- stats_presenter =

  - for days_back in 0..6
    tr = = stats_presenter.day_of_week(days_back)
      td = "#{stats_presenter.days_sales(days_back)} (#{stats_presenter.days_sales(days_back + 7)})"

This gives you a simple table that Status Board can render; you just have to create a new DIY panel in the app and point it at this page:

DIY configuration

And voila:

Week-over-week sales stats

Note: I found it necessary to tell the controller to skip layout rendering, since we had some standard headers/footers and such in our layouts that we didn't want rendered in each panel. So you may want to add this to your stats controller:

layout: false

You can get pretty fancy with what you display. Let's take it one step further and use the Google Maps API to display some data. For my team's product, we roll out our product to certain states first, so we wanted a simple way to track state-by-state progress on how things were going. We decided to highlight each state with a specific color to indicate its status. A red state would indicate that there's a problem with the state and blue would indicate that the state is ready to go.

Embedding a Google map onto your status board is fairly trivial:

<div id="map"></div>

<script src=""></script>
<script type="text/javascript">
function initialize() {
    var mapCanvas = document.getElementById('map');
    var mapOptions = {
      center: new google.maps.LatLng(38.0000, -96.0000),
      zoom: 4,
      mapTypeId: google.maps.MapTypeId.ROADMAP,
      disableDefaultUI: true
    var map = new google.maps.Map(mapCanvas, mapOptions);

google.maps.event.addDomListener(window, 'load', initialize);

This gets you a simple map of the United States.

Basic Google Map

Next, Google also allows you to overlay polygons of your own onto the map. You do this by passing in the latitude and longitude of each coordinate of the polygon, then Google strings them all together to get the desired shape in whatever color you specify. I found a rough listing of the necessary coordinates online and converted them into a yml file for the application to use. Now, you have to be able to show the polygon data in a format that is readable by the Google API. The simplest way I found was to create a separate XML endpoint that returns the polygons you want to display, along with their color, which can then be read via JavaScript and sent to Google.

Controller code:

  class StatsPresenter
    def state_status
      @states = []

      State.all.each do |state|
        state_data = {}

        state_data["name"] =
        state_data["points"] = state_polygons[" ", "_")]
        state_data["color"] = color_for_state(state)

        @states << state_data


    def state_polygons
      @state_polygons ||= YAML.load_file(Rails.root.join('db/', 'domain/', 'state_polygons.yml'))

    def color_for_state (state)
      if state_functional? state
        return "#0000ff"
        return "#ff0000"

Along with an XML builder:

  xml.states do
    @stats_presenter.state_status.each do |state|
      xml.state(:name => state["name"], :color => state["color"]) do
        state["points"].each do |point|
          xml.point(:lat => point["lat"], :lng => point["lng"])

This will create the XML output you need for the state polygons:

<?xml version="1.0" encoding="UTF-8"?>
  <state name="Arizona" color="#0000ff">
    <point lat="36.9993" lng="-112.5989"/>
    <point lat="37.0004" lng="-110.8630"/>
    <point lat="37.0004" lng="-109.0475"/>
    <point lat="31.3325" lng="-109.0503"/>
    <point lat="31.3325" lng="-111.0718"/>
    <point lat="32.4935" lng="-114.8126"/>
    <point lat="32.5184" lng="-114.8099"/>
    <point lat="32.5827" lng="-114.8044"/>
    <point lat="32.6246" lng="-114.7992"/>

Finally, you need the javascript to retrieve, process and send the polygons to Google. You do this with a simple jQuery call to your XML status endpoint, which you add to the Google Maps configuration from above:

<script src=""></script>
<script src="//"></script>
<script type="text/javascript">
  function initialize() {
    var polys = [];
    var mapCanvas = document.getElementById('map');
    var mapOptions = {
      center: new google.maps.LatLng(38.0000, -96.0000),
      zoom: 4,
      mapTypeId: google.maps.MapTypeId.ROADMAP,
      disableDefaultUI: true
    var map = new google.maps.Map(mapCanvas, mapOptions);

    jQuery.get("/api/1/stats/state_status.xml", {}, function (data) {
      jQuery(data).find("state").each(function () {
        var color = this.getAttribute('color');
        var points = this.getElementsByTagName("point");
        var pts = [];
        for (var i = 0; i < points.length; i++) {
          pts [i] = new google.maps.LatLng(parseFloat(points [i].getAttribute("lat")),
          parseFloat(points [i].getAttribute("lng")
        var poly = new google.maps.Polygon({
          paths: pts,
          strokeColor: '#000000',
          strokeOpacity: 1,
          fillColor: color,
          fillOpacity: 0.35

  google.maps.event.addDomListener(window, 'load', initialize);

And now you can watch the status of each of your states in near-realtime:

Map with status overlays

Using DIY panels along with the versatility of your web applications, you can make a rich set of panels and pages for your status board that let you see how your app is performing at any given time, at a glance.

Sample stats

(Note that all the data is faked out for the purposes of these screenshots)

What's behind the microservices trend?

A growing development team can be super exciting. Imagine all the new stuff we'll get done! There'll be so many more people to learn from. And it probably means that the business is moving in the right direction.

But at some point, you'll find yourself getting less done with more people. Maybe you're starting to hear about how the work feels like it's dragging. The words "unmaintainable" and "stuffed full of bugs" start to get thrown around.

Just as communication gets exponentially harder as a team grows, code can become exponentially harder to work with as the team building it grows. How can you tell if it's happening?

  • Onboarding is harder. The goal was having a new dev shipping on the day they join. But now it takes up to a week to get their environment set up, and to learn enough about the code to make a simple text change.

  • You can't easily make a change without breaking unrelated code. Even if the team has been good about reducing coupling, over a few years, if code can be coupled, code will be accidentally coupled.

    In the best case, this costs dev time. In the worst, it causes errors or site downtime.

  • Shipping takes longer. When everyone's working in the same codebase, you'll have a decision to make: Do you want to batch changes and ship them all at once? Or would you rather have everyone wait in line to ship?

    If you decide to batch changes, you'll have a lot of integration pain. If you decide to have a ship queue, that queue will grow as you add more devs and ship more often. Instead of working on the next thing, devs spend half their day at half-attention, waiting for their code to go out.

  • Tests take longer to run. You want all the tests to pass before you deploy, right? So not only does this make you lose more time when you're shipping, it also delays everyone in line behind you. Hopefully your tests pass consistently!

  • Ownership becomes unclear. When something breaks inside the code, who's responsible for investigating and fixing it? When ownership is muddled, entire features become "someone else's problem," and don't get the care they deserve.

What do we want?

What would a better world look like? It'd be awesome if we had:

  • Apps based around simple ideas that are easy to understand.
  • Isolated sections of code, that can't affect each other.
  • Loose dependencies, so small pieces of the app can ship independently.
  • Fast tests.
  • Clear ownership.

Small apps or services, coordinating to get work done, hit every one of those factors. Because, in general:

  • A smaller app is easier to understand than a large one. There's less code to worry about.
  • If you isolate code inside a service, another app can't mess with it. You can only make changes through the interface.
  • If you know what a service's clients are expecting from it, you should be able to change and ship it independently.
  • If you're shipping a smaller piece of a large app, you'd only have to run some of the larger app's tests to feel comfortable shipping it.
  • It's easy to assign ownership of all of a small app: "Hey, Katie, you're responsible for the Account service."

All of these benefits have contributed to Microservices becoming a trend.

The traditional fix to exponentially growing communication is to break large teams into smaller autonomous teams. This works for software, too. It can turn exponential growth into more linear growth. But it's good to understand why this works -- what problems microservices are meant to solve. Because, like every decision in software development, it involves tradeoffs.

If you're not having any of those problems I described earlier, it's probably not a great idea to transition to services. Because there are some pretty big problems with a service-based architecture:

  • The code may be simpler, but the relationships between coordinating services become more complicated. This is something your language probably won't be able to help with. You'll need to work to make that coordination visible, so you can see those connections and make sure they make sense.

  • Services can add a lot of busywork when you start building a new app. If you have an app, a client, a contract, and a service, you'll sometimes have to tweak and ship four repositories to get any work done.

  • Tests are harder to write, and can be more brittle. A lot of your tests will depend on the network. You'll have to accept the brittleness of relying on code running somewhere else, or mock the connections out. If you're mocking, your tests might pass when they should fail, which is a big problem.

  • You and the team have to define solid patterns for communicating. You'll decide how to communicate errors and metadata, how and when to retry failed requests, deal with caching, serialize and deserialize data, and lots of other things. This is stuff that a big monolithic app will give you for free.

  • You'll face problems like cascading failures, thundering herds, and stale data, which always seem to come at the worst time.

  • You've now built a distributed system. Distributed systems, especially when things fail, act unpredictably.

  • If you're using HTTP to communicate between services, you've probably added latency. Your app might be slower, or you might add caching to try to speed things up. If you do that, you have to worry more about one of the two hardest problems in computer science -- cache invalidation. Not only that, but cache invalidation across multiple apps.

And there's even more! These sound like huge problems, and they are -- but then, so are the problems related to a growing team on a growing app.

The key is recognizing when the problems of team and code growth are on the path to outweighing the problems of a service-based architecture.

Most of the companies that have moved to a microservice architecture seem very happy about it. We're still in the process of making the transition, but so far we've seen huge wins from it.

If helping us make that transition sounds interesting to you, let me know -- we're always looking for great new people to join our team.

Building JSON-based Rails SOA (part 1)

Let me tell you a story: Once upon a time there was a rails app. It was a good rails app, like all rails new apps. But over time a darkenss started shrouding the land. The app grew. Models started piling up. The developers pushed the models deep into the dungeons of namespaces. And there they sat, in the darkness, quetly growing old and crusty with unsupported code.

But one day, there was great upheaval. The knights of SOA have arrived! They took those models out of the dungeons and cleaned them. They wrapped these models in new rails apps, smaller and shinier, with APIs to protect the users from the raw power of the models. The old app was rebuilt on top of the APIs with clearly defined boundaries. And everyone lived happily ever after.

Or something like that. Let's talk about what SOA means in the Rails ecosystem. Or, rather, what it means in the Avvo ecosystem. Over time, as our app grew, we decided that it would be best to cut it up into services. This will be a brief overview of our stack.

Our main app is still a rails app and our services are gently modified rails-api apps. They communicate through JSON. The client uses the JsonApiClient gem to formulate the requests and parse the responses.


JsonApiClient is a neato little gem that Jeff Ching wrote. It makes it easy for the client to never have to worry about paths or request parsing. All of that is done by the gem.

Modified rails-api

Our implementation, while unfortunately not public, is a thin wrapper around rails-api that formats the responses as:

    # ActiveModel::Serializer serialized array of objects
    meta: {
      status: HTTP status,
      page: current page,
      per_page: count of entries per response page,
      total_pages: toal number of pages with the above per_page count,
      total_entries: total number of records


It's easier to explain how this works with an example. Let's pretend that this blog is a rails app that has an API backend and examine how we would render the first page.

How we talk to one another

The blog makes a call to the blagablag API to ask for all the posts that may be available (or, really, the first page). Let's look at what the code would look like.

Main blog app


The controller looks much like any other controller.

class BlogController
  def index
    # first page of results.
    @posts = Blagablag::Post.order(:updated_at).to_a

  def show
    @post = Blagablag::Post.find(params[:id])

With the exception that you're now asking the Blagablag::Post class for info, which is a JsonApiClient class.


module Blagablag
  class Post < JsonApiClient::Resource
    # this would normally be in a base class = ""

Sure is empty in there, huh? That's because JsonApiClient::Resource is handling all of the routing for us. It knows how to build all the standard CRUD routes, so all you have to do is call them.

Note: It's possible (and easy) to build custom routes, but you have to define those in the client. RTM for more details.

blagablag API

The API side is a bit more involved (for you, because we do not currently have an open source implementation of the server), but a simplified version would look something like:

module Blagablag
  module V1
    class PostsController < ActionController::API
      # main controller to handle all where requests
      def index
        @posts = Post.scope.where(params)
        @posts = @posts.order(params[:order]) if params[:order]
        # process page params
        @posts = paginate(@posts)

        render json: format(@posts)

      # where client.find(id) goes
      def show
        @post = Post.find(params[:id])

        render json: format(@post)


      # this is all greatly simplified for the example

      def format(objects)
          posts: Array(objects),
            status: 200,
            page: @page,
            per_page: @per_page,
            total_pages: @total/@per_page,
            total_entries: @total

And that's pretty much it. There's a little bit of magic that we do with our servers that makes some of this a bit cleaner, but that is our stack. We're still learning how to manage all the servers and interdependencies, but this has sped up our system dramatically and has forced us to really think about system design in a way that our old monolith never could.

As a developer, what do you value?

Once a company starts to grow, it gets harder for developers to get to know each other. But without some kind of compatibility between how people on the team make decisions, things get chaotic, quickly.

The problem starts to show up between teams, when it takes way too many meetings to get systems built by the same company to talk to each other. Or there's culture shock when people move between teams, which leads to less movement. That makes this problem even worse. Soon, meetings become arguments. People can't even understand the position of the person across the table because they're starting from different fundamental assumptions. And nothing spawns useless, frustrating meetings than arguments between people who, deep down, believe different things.

What's the solution?

There are a few ways to solve this problem. Management could dictate how decisions must be made. "You will use Java to solve all problems." "Every team will communicate using JSON." This works, but at a cost -- every decision you take away from someone removes a reason you hired that person to begin with. Besides, you don't need everyone to make the same decision, you need compatible decision-making.

Maintaining a strong culture is often a better option. But "culture," within a company, has a fuzzy meaning. Maybe it's how you hire (which, in the worst case, means "we hire people like ourselves.") It could mean Whiskey Fridays, or Ping-Pong in the lunchroom, or just about anything else you could say about a company. It has something to do with improving shared decision-making. But that's often buried under everything else that falls under the category of "culture."

Context and Taste

There's one specific part of culture, though, that has a strong impact on how people make decisions. I think I first saw the idea in Netflix's culture deck, where they refer to it as "Context" (starting on slide 79).

Context is about creating shared understanding. It's about agreeing on things that we, as a team, typically value. It's about sharing our assumptions, and what's behind them. It's being open about our priorities and goals. It's about building a framework for making good, compatible decisions, without dictating the decision from the top down.

GitHub used a similar concept Kyle Neath called "taste." Here's how he describes it:

I’d argue that an organization’s taste is defined by the process and style in which they make design decisions. What features belong in our product? Which prototype feels better? Do we need more iterations, or is this good enough? Are these questions answered by tools? By a process? By a person? Those answers are the essence of taste. In other words, an organization’s taste is the way the organization makes design decisions.


To help capture taste, Kyle borrowed an idea from Python, and defined GitHub's "Zen" -- that is, a short list of statements that answer the question, "Why did you do that?" This is what he came up with:

  • Responsive is better than fast.
  • It’s not fully shipped until it’s fast.
  • Anything added dilutes everything else.
  • Practicality beats purity.
  • Approachable is better than simple.
  • Mind your words, they are important.
  • Speak like a human.
  • Half measures are as bad as nothing at all.
  • Encourage flow.
  • Non-blocking is better than blocking.
  • Favor focus over features.
  • Avoid administrative distraction.
  • Design for failure.
  • Keep it logically awesome.

Bringing it back to Avvo

To us, this seemed like a great way of capturing some of the core things we, as a dev team, fundamentally value. Some part of our context, or taste. Here's the newest version of what we came up with:

  • Invest in yourself and your tools
  • Defend the user
  • Your teammate’s problems are your problems
  • Attack the noise
  • Take ownership
  • Leave code better than you found it
  • Favor convention over reinvention
  • Explicit is better than implicit
  • Understand why
  • Prove it
  • Keep moving forward
  • Deliver business value

And, like GitHub, ours now exist in a git repository -- ready for updates, pull requests, and comments.

Once you have these values written down, it's amazing where they pop up. During disagreements, it's helpful to tie positions back to these values. In interviews, it's interesting to see how candidates' responses relate to these. They come up during feedback, in meetings, and in 1:1s. They immediately tell people outside of Avvo how compatible they'll be with the kind of decisions we make. And all of these are clear enough that they can inform good decisions, without enforcing a specific decision.

At Avvo, our development team is growing to the size where people can't know everything that's going on. Through ideas like this, we're beginning to define what it means to be a developer at Avvo. It's an opportunity that only comes around a few times in a company's lifetime! If that sounds exciting to you, and you'd like to help us shape this team, get in touch -- through either our careers page or a simple email to me. I'd love to hear from you.

Parsing JSON requests with deserializers

Once upon a time, you have a Rails server. This server does all kinds of wonderful things, surely. One of those things is it takes json, parses it and stores it somewhere.

Here's my question to you: how do you parse the JSON and where? Do you do it in the controller? Do you do it like this?

class SomethingJsonController < ApplicationController
  def create

  def update


  def strong_params
    params.require(:blah).permit(:blah, :blah, :blah)

Sure. You can. But what happens when the JSON you get is all weird and nested? Even worse, what if it isn't a 1:1 mapping of your model? You end up with some intesnely unpleasant controller code. Let's look at an example to show you what I mean.

An example

OK. We'll say that you have an endpoint that talks to a service/device and takes a JSON blob that looks like this:

  "restaurant_id" : 13,
        "user_id" : 6,
      "dish_name" : "risotto con funghi",
    "description" : "repulsive beyond belief",
        "ratings" : {
                        "taste" : "terrible",
                        "color" : "horrendous",
                      "texture" : "vile",
                        "smell" : "delightful, somehow"

But your model doesn't directly map to that. In fact, it's flat and boring, like

# DishReview model

t.belongs_to  :restaurant
t.belongs_to  :user
t.string      :name # field name different from API (dish_name)
t.string      :description
t.string      :taste
t.string      :color
t.string      :texture
t.string      :smell

The problem

What many people would do (assuming you can't change the incoming JSON) is try to parse and modify the params in the controller, which ends up looking roughly like this:

class DishReviewController < BaseController

  def create
    review_params = get_review_params(params)
    @review =
      # return review
      # return sad errors splody

  # rest of RUD


  def permitted_params

  def get_review_params(params)
    review_params = params.require(:review)

    review_params[:name] ||= review_params.delete(:dish_name)

    ratings = review_params.delete(:ratings)
    if (ratings.present?)
      ratings.each{|rating, value| review_params[rating] = value if valid_rating?(rating) }


  def valid_rating?(rating)
    [ "taste", "color", "texture", "smell" ].include? rating

Man, that sure is a lot of non-controller code inside that controller. 30 lines, in fact. And that's if you have the same params coming in for all the actions. What if your update and create take different params? Then it'll get all nasty and you'll start shoving code into concerns; it'll be hard to read, hard to follow, maintain and refactor.

Our solution

But enough of this Negative Nancy talk. I have options! Well, just one, really. It's a gem, we call "Deserializer"!

Here at Avvo, my team are working on cross-server communication and building out new APIs as we scale our product. We serialize data using ActiveModelSerializer and it felt very frustrating having to parse the generated JSON by hand on the receiving side. So to ease the pain of perpetually having to mangle hashes in the controller, I wrote this gem.

So what does this "deserializer" of yours do, exaclty?

The Deserializer acts as the opposite of AMS. AMS takes an object, and converts it into JSON. The deserializer takes in params (incoming JSON), and converts them into model consumable data. It does not create an object out of those params. Really, it's a glorified hash mangler.

Great. Whatever. How does that help me?

Using the example above, let's look at what our code will look like with a deserializer


class DishReviewsController < YourApiController::Base
  def create
    review_params = DishReviewDeserailzer.from_params(params)
    DishReview.create( review_params )

  # RUD

"Wow!", you say, "That's so tidy and neat!". You are correct. And the deserializers aren't too bad either. Let's have a look


# DishReviewDeserializer

module MyApi
  module V1
    class DishReviewDeserializer < Deserializer::Base
      attributes  :restaurant_id,

      attribute   :name, key: :dish_name

      has_one :ratings, :deserializer => RatingsDeserializer

      def ratings


# RatingsDeserializer:

module MyApi
  module V1
    class RatingsDeserializer < Deserializer::Base

      attributes  :taste,

"Hot dog!", you exclaim, "Those look just like my serializers on the other side! It's as if the interface is written to match that of AMS!" They sure do. And it sure is.

Now, not only are your concerns separated, but you can reason about what your code does and understand what data is coming in by just looking at the deserializers.

As a nice bonus, since the deserializer ignores undefined keys, you no longer have to strong param anything - but you still can if you want (because it's just a hash mangler). There's even a function to help you, MyDeserializer.permitted_params will give you the list of paramaters that the deserializer expects to get.

For more detailed info, feel free to RTM and contribute.

Switching from Resque to ActiveJob

At Avvo we wanted to add some metrics and logging to our background job processing. We're using Resque, so it wouldn't take much to write a module to wrap enqueue and perform to do the logging. After investigating the new ActiveJob abstraction in Rails, we decided there would be benefits to switching.

This post is intended as a brief guide to switching from plain Resque to ActiveJob. In addition, it covers the Minitest test helper and integration with Resque::Scheduler.

What does ActiveJob do?

ActiveJob is built into Rails, and provides a common interface for background jobs. It also adds callback support around enqueuing and performing jobs. By default this is used to add logging to your jobs.

Coming from Resque

Plain Resque was a good starting point years ago, but these days we expect more from our libraries. There's no logging and no metrics hooks. By switching to ActiveJob we get all that for free. Plus it might make switching to Sidekiq easier.


Remember to keep in mind during upgrading that you may have job in the queue when you deploy your new ActiveJobified code. Because of this, we want to keep the old jobs around, and have them call the new ActiveJob code. Just have old self.perform methods instantiate the new job class and call perform.

Mechanics of using ActiveJob

Switching involves a few steps:

  • Inherit from ActiveJob::Base
  • use queue_as instead of @queue =
  • configure in config/application.rb: ruby config.active_job.queue_adapter = :resque config.active_job.queue_name_prefix = "my_app"
  • Change perform methods from class to instance.
  • Switch from using Resque.enqueue KlassName to KlassName.perform_later.

For instance, we'd change the following class:

class OldCsvParser
  @queue = :csv_parsing

  def self.perform(filename)
    # ... do stuff


class CsvParserJob < ActiveJob::Base
  queue_as :csv_parsing

  def perform(filename)
    # ... do the stuff

and update the original class to just call the new one:

class OldCsvParser
  @queue = :csv_parsing

  def self.perform(filename)

Testing with ActiveJob::TestHelper

Coming from using ResqueUnit the switch to ActiveJob::TestHelper was easy. Include the module in your TestCase and using methods like assert_enqueued_jobs.

For instance, if we enqueued a CsvParserJob from a controller action, our test might look like this:

class CsvParsingControllerTest < ActiveSupport::TestCase
  test "enqueues the job to parse the csv" do
    filename = "/path/to/csv/file.csv"

    assert_enqueued_with(job: CsvParserJob, args: [filename]) do
      post :create, filename: filename

Integration with ResqueScheduler

Resque::Scheduler enqueues jobs directly with Resque. You'll need to either change that behavior or wrap jobs in the schedule with a JobWrapper.

Luckily the latter work has already been done, and using the gem ActiveScheduler will wrap the jobs so callbacks are called.

Installation is a snap, just follow the directions in the project's readme to update your Resque::Scheduler initializer.

Downsides: ResqueWeb only shows JobWrapper job classes

Due to the ActiveJob JobWrapper viewing the running jobs in resque-web will no longer show the class of actual job running on the dashboard. Clicking through and viewing the arguments does show the class. This could be a hassle if you often view the job queue.

Similarily, the Schedule tab in resque-web is a little cluttered. But still readable.

How swappable storage and fakes lead to cleaner, more reliable tests

Let's say you've written a Rails app that runs background jobs using Resque. How do you test those jobs? You can run them inline and check that what you expected to happen, actually happened:

setup do
  Resque.inline = true

def test_account_updated
  assert_equal :updated, test_account.status

That works, but it's missing something. You backgrounded that job for a reason -- maybe it's slow, or you don't want it to happen right away. If you put that code in a background job, you probably care more about the job being queued, and less about what it does when it runs. (Besides, you can test that part separately).

I wrote resque_unit to solve this problem. resque_unit is a fake version of Resque. It intercepts your jobs and gives you some more assertions:

def test_account_updated
  assert_queued UpdateAccount, []

On the inside, resque_unit rebuilt part of Resque's API to change how jobs were queued.

This was great for an initial implementation. It was fast, you didn't need Redis on your continuous integration server, and it was easy to understand. But, as reimplementations of an API tend to do, it fell behind. It got way more complicated. More bugs popped into GitHub Issues, and more code had to be borrowed from Resque itself.

Besides all that, there was a really big gotcha for new users: if you loaded resque_unit before you loaded Resque, resque_unit would stop working.

Looking at other options

My favorite way to write an easily testable client is to build swappable storage and network layers. For example, imagine if Resque had a RedisJobStore and an InMemoryJobStore, that each implemented to the same API. You could write most of your unit tests against the InMemoryJobStore, and avoid the dependency and complication of Redis. But, since Resque is designed to work specifically with Redis, this wasn't an option.

Instead, the answer was to go a level deeper. What if the Redis client itself had both a RedisStore and an InMemoryStore? It turns out this is a thing that exists, called fakeredis.

fakeredis re-implements the entire Redis API. But instead of talking to a running Redis server, it works entirely in-memory. This is really impressive, and seemed worth a try.

Bringing fakeredis to resque_unit

If you were working with real Resque in development or production, how would you check that a job was queued?

You wouldn't have to write much extra code. You'd queue a job normally. You could look for a queued job with peek. You'd check queue size with size. And you'd clean up after yourself with remove_queue or flushdb. If you wanted to run the jobs inside a queue you'd have to pretend you were a worker. But for the most part, you'd barely have to write code.

To bring fakeredis to resque_unit, it was almost that easy. I got to remove a ton of code. And the rest of the code is a lot smaller, a lot simpler, and a lot less likely to break.

One last quirk

There was one last problem: fakeredis entirely takes over your connection to Redis. That makes it a pretty terrible dependency for a gem to have. What if you wanted to use real Redis for most of your tests? If you require resque_unit, all of a sudden you've changed a lot about how your tests run! And it gives me a headache to think about how hard that would be to debug.

So, when you require resque_unit, it does a little dance to be as unobtrusive as it can:

# This is a little weird. Fakeredis registers itself as the default
# redis driver after you load it. This might not be what you want,
# though -- resque_unit needs fakeredis, but you may have a reason to
# use a different redis driver for the rest of your test code. So
# we'll store the old default here, and restore it afer we're done
# loading fakeredis. Then, we'll point resque at fakeredis
# specifically.
default_redis_driver = Redis::Connection.drivers.pop
require 'fakeredis'
Redis::Connection.drivers << default_redis_driver if default_redis_driver
module Resque
  module TestExtensions
    # A redis connection that always uses fakeredis.
    def fake_redis
      @fake_redis ||= :memory)

    # Always return the fake redis.
    def redis

What can you take away from this?

When you're building a library that depends on a data store or service, think about making it swappable. It'll make your own tests easier to write, and it'll be clearer to your readers which features of the service you use.

Don't reimplement APIs on the surface level. Especially if you don't own it, and you don't control revisions to it. You'll do nothing but chase changes to it, and your implementation will usually be behind and a little broken.

And a good in-memory fake, in the right place, can make testing and development so much easier.

Using La Maquina to solve complex cache dependencies

Let's talk about caching a page that has a ton of database objects that need to be pulled. Specifically, because that page render is centered around a monster model that has way too many dependencies. This is going to be lengthy, so get yourself some coffee and settle in for an emotional rollercoaster.


Alright, let's start off with an example. So let’s say you have a Rails project. And let’s say you have a “master” model all up in there. A “god” model if you will. It’s probably User. It’s User, isn’t it? Be honest. Anyway. You probably have a bunch of stuff in that user.rb file of yours. Something like

class User < ActiveRecord::Base
  has_one  :headshot
  has_many :articles

  # 500 more associations over here

and your app/views/user.html.slim file is riddled with cache blocks to look like

/ I want a cache block here for @user : ((
  - cache [@user.headshot, :headshot] do
    = image_tag @user.headshot.image.small

  - cache [@user, :details] do
    h2 =
    p = user.about_me

  - cache [@user.articles, :other_things] do
    @user.articles.each do |article|
      - cache [@article, :thumb_description]
        = image_tag @article.header_image
        = @article.short_description
/ etc

Which means that, while sure, you have some caching in there and the HTML renders will be faster, you're still making quite a few database calls to verify all those caches. You could solve this by adding belongs_to :user, touch:true to all of your models that are part of the above block, but then 1: that doesn't work because touch is not suported for all associations and 2: you're making db updates to user when you are updating unrelated objects. Also: this becomes quite involved for through associations.

This is where I sell you stuff. Specifically, I wrote a gem called La Maquina. This bit of code allows you to define arbitrary associations between models and notify about updates. This is a hard sentence to parse, so lemme show you some code. Using the example above for only article, we can set up our models as follows

class User < ActiveRecord::Base
  include LaMaquina::Notifier
  notifies_about :self

  has_many :articles
  # etc

class Article < ActiveRecord::Base
  include LaMaquina::Notifier
  notifies_about :user

  belongs_to :user
  # etc

# with all the other associations like headshot would be set up the same way as Article

This is the very basic of plumbing. If you were to examine the input into the LaMaquina engine at this point, you'd see the :self notification fire (conceptually) as "a user is notifying about user #{id}"; and for the article, you'd see "an article is notifying about user #{id}".

This is the interesting bit. Now that we have the notifications flowing, we have to have some code to process those. The way LaMaquina does this is with plugins called Pistons. They are code that take the caller and callee class names and then can process them in as simple or complex way as you need. Just as an example, here's what a piston that implements the old touch functinoality would look like.

class TouchPiston < LaMaquina::Piston::Base
  class << self
    # for Article, we'd get "user", user_id, "article"
    def fire!( notified_class, id, notifier_class = "" )
      # User
      updated_class = notified_class.camelize.constantize

      # User.find(id)
      object = updated_class.find(id)


While this is not recommended for cache invalidation (LaMaquina::Piston::CachePiston is probably what you want to use), this will allow you to update your slim to be more like this:

/ this is the important bit right here
- cache [@user, :profile]
    - cache [@user.headshot, :headshot] do
      = image_tag @user.headshot.image.small

    - cache [@user, :details] do
      h2 =
      p = user.about_me

    - cache [@user.articles, :other_things] do
      @user.articles.each do |article|
        - cache [@article, :thumb_description]
          = image_tag @article.header_image
          = @article.short_description

So now, so long as the user hasn't been touched, your page will render with a single db call. One. Isn't that exciting? I think that's pretty neat.

As a sidenote, you'll probably want to keep the inner cache blocks as they were, as they'll help when there are partial page rebuilds. Like, if a single article is added/updated, you won't want to rebuild the entire page.


Ok. So that's great. But I added all of this code and nothing is happening. What gives?

You need to do some minor setup. In your config/intializers, you'll need to add a la_maquina.rb that sets up all of this stuff. Something along the lines of

LaMaquina::Piston::CachePiston.redis =, redis:
LaMaquina::Engine.install LaMaquina::Piston::CacheAssemblerPiston

LaMaquina.error_notifier = LaMaquina::ErrorNotifier::HoneybadgerNotifier

For a more thorough explanation of what all of that means, plz RTM.

Important note: if using CachePiston or your own custon cache key generator, don't forget to add a cache_key method to your target models

class User < ActiveRecor::Base
  def cache_key
    LaMaquina::Piston::CachePiston.cache_key(:user, id)

otherwise rails will default to model/id/updated_at, which will of course ignore your shiny new key and you will be very sad.

Bonus round

So this is all great and you're using the CachePiston and all of you're views are blazing fast.


So you know how you set up that piston to update your cache when a model changed? Well, you can add an arbitrary number of pistons that do all sorts of things. You're using Solr and want user to be reindexed when it's updated? You can do that (there's actually a proto-piston for that already). You want to fire Kafka notifications when articles get created? You can do that too. RSS? Why not. Push notifications? Sure! All kinds of things can happen. You can even rebuild the views on the backend if you want. The world is your oyster now.

You're welcome. Now go, make your app radical.

Solving Redis timeouts with a little fundamental CS

Redis is a handy place to keep data. With all the commands Redis supports, you can solve a ton of really common problems.

For example, do you need a queue you can safely use from a bunch of different processes? LPUSH and BRPOP have you covered. That's actually how Sidekiq works! (Resque pushes and pops in the other direction).

In fact, Redis is such an easy place to stuff data, that it could become your first choice for storing miscellaneous things. You'll find the command that does what you want, call it, and everything will be great! That is, until you store more and more data under certain keys.

Maybe you want to see if a job has already been queued, so you don't queue it again:

index = 0
while (payload = Resque.redis.lindex("queue:#{queue_name}", index)) do
  # ... see if payload matches the job we're looking for ... 

And all of a sudden, you'll notice that all of your communication with Redis just got slower. You might even start seeing Redis::TimeoutErrors. What happened?

In order to understand how this code broke, you need to understand a little bit about Redis, and a little fundamental Computer Science.


Take another look at Redis' command documentation, and you'll notice something on each page:


Yep, it's Big-O notation -- that thing you studied to prepare for your last interview. In your day-to-day development, you probably don't think about it too much. In Redis, though, slower algorithms can destroy your app's overall performance.

Most Redis commands are pretty fast: O(1) or O(log(N)). But a few, like that LINDEX from up above, are O(N). (Some, like ZUNIONSTORE, are even worse. Don't ask me how I know that).

That means that if you add twice the elements to your queue, that call will probably run (roughly) twice as slow.

And because Redis is single-threaded, a slow Redis command can keep other commands from running. When that happens, you'll start to see your error tracker fill up with Redis errors from totally unrelated places.

How do you detect and fix it?

Redis has SLOWLOG, which can show you which queries are taking the longest:

~ jweiss$ redis-cli> slowlog get 3
1) 1) (integer) 26
   2) (integer) 1436856612
   3) (integer) 13286
   4) 1) "get"
      2) "key2"
2) 1) (integer) 25
   2) (integer) 1436856610
   3) (integer) 41114
   4) 1) "get"
      2) "key2"
3) 1) (integer) 24
   2) (integer) 1436856609
   3) (integer) 10891
   4) 1) "get"
      2) "key2"

That's a lot of numbers. The first one is automatically generated -- you don't need to worry about it. The second is a timestamp. The third and fourth are the most important -- the amount of time (in microseconds) it took for the command to run, and the command + arguments you sent.

The slowlog can be noisy, and won't always point you to exactly the right place. But it's a good first place to look for performance problems.

After you've identified some slow queries, though, how do you fix them?

Unfortunately, there's no approach that works in every situation. But here are a few that have helped me:

  • Scan through the other Redis commands that work with the data structure you're using. If any of them are an almost as good a fit, but faster, you can try to find a way to use those instead.

  • Store an extra copy of the data in a way that makes it fast to look up later. This works especially well for commands like LINDEX. It's like having data in an array to make it easy to iterate, with a lookup table in a hash to make specific elements easy to find.

    While this can work well, you do have to be careful. It's easy to create weird situations where the duplicate copy wasn't added right, or removed right, or got into an inconsistent state. MULTI and EXEC can help, but still take some care to use correctly.

  • Don't use Redis for that data at all. A traditional database might be a better solution, or there might be a more clever way of solving the problem.

In your day-to-day work, you probably don't think too much about algorithmic complexity and O() notation.

But Redis is a single-threaded server where speed is incredibly important. The change in performance as your data grows makes a huge difference. And if you're not paying enough attention, it can hurt your entire app.

So, watch for it. Understand how to compare different algorithms, so you can pick the right one. And build intuition about how fast good algorithms should be, and the tradeoffs between them. That intuition will help you more often than you'd expect.