Skip to main content

Where to start when fixing tests

Where to start when fixing tests

My test suite isn't horrible, but it isn't great either...

You have a test suite that runs decently well, but you have some transient failures and the suite has been taking progressively longer as time goes on. At some point, you realize that you are spending the first 15 minutes of a deploy crossing your fingers hoping the tests pass, and the next hour re-running the suite to get the tests to pass "transiently". You have a problem that should be addressed, but where do you start?

Generally, you should start with stabilizing your suite. Consistently passing in 1 hour is a better situation than having to run a test suite 2-3 times at 45 minutes each.

How do you know which tests to tackle first?

Approach

Our testing stack: minitest, Capybara, PhantomJS, Poltergeist, and Jenkins or CircleCI.

We use Jenkins and CircleCI as part of our continuous integration process, which means they are on the critical path to deploying. If our tests pass quickly and consistently locally, but not in our CI environment, we still can't (or shouldn't) ship. "It works on my machine" is rarely a good enough defense. To solve our slow and flaky problem, we want to be sure we are looking at our test performance on servers in our deploy path.

How big of a problem do you have?

How often does your test suite pass? Are there particular suites within the project that fail more frequently? Jenkins and CircleCI can show you this history, but we couldn't find summary level data like, "this suite has passed 75% of the time in the last month".

How do you find your flakiest tests?

We couldn't find an easy way. You can have people document failing tests when they come across them, but manual processes are destined to fail.

How do you find your slowest tests?

There are a few gems that can help you identify your slowest tests locally, like minitest-perf, but we want to know how our tests perform in the continuous integration environment. Jenkins and CircleCI provide some of this data, but it is pretty limited.

Solution

We created JUnit Visualizer to help collect the data we want

Gathering test data

Jenkins and CircleCI support the JUnit.xml format, which includes test timing, test status, and number of assertions. With JUnit.xml, we can leverage an industry standard, and CircleCI maintains a gem, minitest-ci, that exports minitest data to the format. The gem can basically be dropped into an existing project using minitest. It creates an xml file per test file that is run, and saves it in the "test/reports" directory by default.

To standardize our integration with Jenkins and CircleCI, we push the xml files to S3, using a directory per build. We use the following to accomplish pushing to S3:

S3 upload configuration

Displaying test data

The main categories of test data we want to view:

  1. Historical information that shows how frequently our tests pass or fail. This is broken down by suites if we have more than 1 suite within a project. This is helpful in focusing our attention to the worst offenders.
  2. Single list of failures that shows all of the test failures, across suites, on one page. This is a convenient way to see all of the failures without having to click into the details of each suite.
  3. Unstable tests list that shows which tests fail the most. This allows us to see our "transient" test failures, as well as identify areas of our code that may be fragile. This provides guidance on where to start fixing tests.
  4. Slowest tests list that shows which tests are taking the most time. There is no point in speeding up a test that takes 1 second, if you have a test that is taking 45 seconds.
  5. Duration trends that show how your test duration is changing over time. It is helpful to see that we are making progress.

For screenshots of how these look in JUnit Visualizer, check out the section at the bottom of this post.

Next Steps

We have made great progress on our stability and speed since starting on JUnit Visualizer, how we addressed the test issues is chronicled here.

Some potential next steps for JUnit Visualizer:

  • Enhance the trend charts to account for outliers
  • Be able to reset the unstable test list when we think we have fixed an unstable test

Check out the code for JUnit Visualizer here: https://github.com/avvo/junit_visualizer

Screen shots

Historical Information

We wanted to show how often our tests pass, broken down by project and the suite within the project.

Project view with suites

Single list of failures

We wanted a better summary view of the tests that failed. In Jenkins v1, you can only see the failures within a suite, which means there is a lot of clicking around.

View of the errors for a specific build, where skips and errors are on top

Failures across suites

Unstable tests

We wanted to find the tests that failed most frequently.

View of the tests, with most frequent failures on top

Unstable Tests

Slowest tests

The tests that take the most time, sorted slowest to fastest.

View of the tests for a specific build, slowest test at the top.

Slowest Tests

Duration trends

As we started to fix slow tests, we wanted to be able to see how our test duration changed over time.

There is a simple graph that shows the duration (in seconds)

Test duration over time

Performance and stability in capybara tests

If you've got flaky or very slow UI tests this is the post for you. Do any of these problems sound familiar?

  • Unexplainable exceptions during tests

  • Capybara is timing out

  • Capybara cannot find elements on the page that are clearly present

  • I hate writing tests this is awful please send help

  • Tests take forever to do things that are fast manually

  • Order of tests is affecting stability

  • PhantomJS raises DeadClient exception

  Capybara::Poltergeist::DeadClient: PhantomJS client died while processing
  • None of this is consistently reproducible, if at all

  • Existential dread
    Existential dread

Most of the specifics discussed here will be about rails, minitest, capybara, poltergeist, and phantomJS. This is a common stack but the principles here are useful elsewhere.

A test that tests something correctly is the first priority in writing a test. I can't help you with getting the test right, but after that comes stability, then performance. We created a gem that includes most of the things we're going to cover here, and most of the code snippets are directly from this gem.

intransient_capybara

intransient_capybara is a rails gem that combines all of the ideas presented here (and more). By inheriting from IntransientCapybaraTest in your integration tests, you can write capybara tests that are far, far less flaky. The README explains more on how to use it and exactly what it does.

The goals of intransient_capybara are debuggability, correctly configuring and using minitest, capybara, poltergeist, and phantomJS, and improving on some of those things where there are gaps (most notably with the genius rack request blocker). This combines a ton of helpful stuff out there into a gem that will take you 10 minutes to set up.

Test stability

Test stability is monstrously difficult to nail down. Flaky tests come from race conditions, test ordering, framework failures, and obscure app-specific issues like class variable usage and setup/teardown hygiene. Almost nothing is reproducible. We can all stop writing tests, or we can try to understand these core issues a little bit and at least alleviate this pain.

Use one type of setup and teardown. Tests use both setup do and def setup and it matters which you pick, because it affects the order things are called. I recommend always using def setup and def teardown in all tests, because when you have to manually call super, you can choose to run the parent method before or after your own. The example below shows the two options.

class MyTest < MyCapybaraBaseClass
  # Option 1
  def setup
    # I can do my setup stuff here, before MyCapybaraBaseClass's setup method
    super # You MUST call this
    # ... or I can call it after
  end

  # Option 2
  setup do
    # I do not have to call super because I am not overriding the parent method...
    # but am I before or after MyCapybaraBaseClass's setup method??
  end
end

Use setup and teardown correctly. Your setup and teardown methods will invariably contain critical test pre- and post-conditions. They must be called. It is very easy to override one or both in a specific test and forget to call super. This creates frustrating issues and is very hard to track down. Fix these in your app, and add some code to raise exceptions if you haven't called these methods in the base test class. intransient_capybara does this for you.

Warm up your asset cache. The very first integration test fails transiently a lot? That is suspicious. Gitlab had the same problem. Use a solution like theirs to warm up your asset cache before trying to run integration tests. intransient_capybara does this for you. Wow!

Wait on requests before moving on. Tests can leave around AJAX queries even if you don't have "hanging" queries at the end of a test, and these create two issues. First you might be missing stuff these requests need in order to complete successfully, because you are awesome and have all the right stuff in teardown and are calling it correctly. Now you get obscure things like "missing mock for blahblah" in the next test that is completely unrelated! Second, these use up your test server's likely sole connection and produce even more obscure errors:

Capybara::Poltergeist::StatusFailError - Request to <test server URL> failed to reach server, check DNS and/or server status

You can use wait methods and those can be very helpful inside of a test, but the best way is to absolutely ensure you are done with all requests in between tests. Rack request blocker is THE way to do this. It is just awesome. Can't get enough of it. intransient_capybara includes rack request blocker.

Do not have "hanging" requests at the end of a test. If you have a test that ends with click_on X or visit ABC this request is going to hang around, potentially into the next test and interfere with it. Don't do this - it is pointless! If it is worth doing, it is worth testing that it worked. If not, change it to assert the ability to do this instead of doing it (checking presence of link vs. clicking it for example). This is less important using intransient_capybara because it always waits for the previous test's requests before moving on.

Save yourself a headache. Try hard to solve all transient test problems. You'll still get them from time to time, though. If you've got a tool to tell you what they are, you don't need them to fail your test run for you to fix these things. Most likely you re-run tests and move on anyways, so why re-run the whole set of tests when you can automatically retry failed tests? You can use something like Minitest::Retry for this. Retrying failed tests is far from ideal, but so is having to re-run tests when you're trying to ship something. intransient_capybara has this included and has options for configuring or disabling this behavior.

Test performance

After stability, improving test performance is the next most important thing. There are a ton of things that are easy to do that make tests slow.

Look at your helpers. You have helpers for your tests. They log you in, they assert you have common headers, and all sorts of things. One of these is probably very slow and you haven't noticed. We were logging in nearly every test using the UI, and stubbing that method call instead of actually logging in cut test time in multiple projects anywhere from 40-90%.

Don't use assert !has_selector? This will wait for timeout (Capybara.default_max_wait_time) to complete. If you're expecting a selector, use assert has_selector?. If you aren't, use assert has_no_selector? Learn more from codeship.

Avoid external resources. This is mostly about performance, but is also an important stability improvement. It can help you avoid this:

Capybara::Poltergeist::StatusFailError - Request to <test server URL> failed to reach server, check DNS and/or server status - Timed out with the following resources still waiting for <some external URL>

Almost everyone is susceptible to hitting external stuff in tests. You might be loading jQuery from a CDN, or have javascript on your checkout page that queries a payment provider with a test key. These can timeout, be rate limited, and are properly tested in higher level system integration tests (acceptance testing). You should track these down and eliminate them. The code below can be included in the teardown method of your tests to help you debug your own network traffic. This method is included by default in intransient_capybara.

    def report_traffic
        if ENV.fetch('DEBUG_TEST_TRAFFIC', false) == 'true'
          puts "Downloaded #{page.driver.network_traffic.map(&:response_parts).flatten.map(&:body_size).compact.sum / 1.megabyte} megabytes"
          puts "Processed #{page.driver.network_traffic.size} network requests"

          grouped_urls = page.driver.network_traffic.map(&:url).group_by{|url| /\Ahttps?:\/\/(?:.*\.)?(?:localhost|127\.0\.0\.1)/.match(url).present?}
          internal_urls = grouped_urls[true]
          external_urls = grouped_urls[false]

          if internal_urls.present?
            puts "Local URLs queried: #{internal_urls}"
          end

          if external_urls.present?
            puts "External URLs queried: #{external_urls}"

            if ENV.fetch('DEBUG_TEST_TRAFFIC_RAISE_EXTERNAL', false) == 'true'
              raise "Queried external URLs!  This will be slow! #{external_urls}"
            end
          end
        end
      end

Don’t repeat yourself. Lots of tests have overlap - try to test one thing. Tests with a copy/paste start pattern like visit X, click_on ABC are not required. One test can visit X and click_on ABC, and all the others can skip to that page that comes after clicking on ABC. This saves a lot of time - probably 10-20 seconds every time such a pattern is factored out.

Don’t revisit links. Try to assert links, but if you click them you pay a cost. Like the last point, let some other test assert that the page loads and has stuff correct, and it can pay that visit once only over there. assert has_link? instead of click_on link.

Don't use sleep. Sleeps are either too long or too short. Writing sleep 5 might make you look cool to your friends but it is damaging to your health and should be avoided. Don't get peer pressured into sleeps in your tests. You can assert_text to make it wait for the page to load or write a simple helper method wait_for_page_load!

  def wait_for_page_load!
    page.document.synchronize do
      current_path
      true
    end
  end

You can wait for ajax too. thoughtbot solved this. intransient_capybara includes these methods and uses them in teardown for you already, and makes them available for you to use inside of your own tests.

Avoid visit in your setup method. If you write visit in a setup method in a file that has a bunch of tests, you did something dangerous. One of our tests was visiting 3 pages before visiting more pages in the test itself. Try to break down what you want to test with regards to visiting pages so you can minimize this. Every visit call will be 2-10 seconds long, and it is easy to have pointless visits go unnoticed.

Delete all your skipped tests. We had so many it affected performance, and there was no point to them. Fix or create these stubbed tests today or just delete them.

Parallelize! By breaking your tests down into suites, you can run your tests in parallel a lot easier. You can have parallel test harnesses run SUITE=blah rake test. The matrix configuration in Jenkins makes this a lot easier. If you use something hosted like CircleCI, they can often run things in parallel even without creating suites (allowing you to specify directories per parallel container to be executed). You can try to balance out the tests run in each parallel container and get the fastest times. Our acceptance tests were almost twice as fast after less than an hour of parallelization work, and optimized parallelization with only 3 containers reduced our most important project's tests by more than half (and this again took less than a dev day - this is homerun level stuff).

Find your slowest tests. You need to find or create a tool that can monitor your performance over many test runs and highlight the slowest tests so you can tackle the problems in a targeted way. Once you've dealt with systemic problems, you're left optimizing test by test. There are gems that help you output test performance, such as minitest-ci and minitest-perf.

Results

We're not perfect yet, but these tips and intransient_capybara have reduced the rate of transient failures in our tests from a whopping 40-50% of all test runs to virtually none (<1%). It only takes one failed transient test to fail the whole run, so things have to be really stable for it to start passing consistently. The performance has gone from more than one hour to about 16 minutes in CircleCI (and that is not the best it can be). Acceptance tests have gone from 15 minutes with around a 25% transient failure rate in pre-production environments, and a low transient rate but 10 minutes in production, to 2.5 minutes in pre-production and 1.5 minutes in production, with a test-caused transient rate of near 0 (transients today are due to pre-production environmental issues, not the tests or their framework).

Back in 60 seconds: Reprovisioning a server in about a minute

Avvo is a growing company. Like other growing companies, we started with a small server footprint which is now growing. As our small scale server environment grew, we found that installing the OS (ubuntu) is slow and takes a lot of time and effort.

The Pain

So everyone can really understand how painful our server provisioning process was, let me describe it to you.

First, we booted the host to an installation disk (iso image). That part in itself is difficult enough if you don't have a server provisioning stack, and your servers are in a remote datacenter. We typically used onboard management cards, or a network attached KVM in the datacenter to mount the iso as virtual media and get the process started. After the host booted the ISO, we manually entered all of the details such as partition layout, network info, packages to install, etc. Then we waited for the OS install to complete, at which point we logged in via ssh and ran chef on it to do the remaining configuration and install more software. In addition to that install process, we also had to manually assign an IP, then add that IP to DNS (bind).

There are a lot of options out there to solve this problem. There's full-blown server provisioning stacks that can handle much of the work. They're designed to automate the server provisioning process, make it more consistent, more reliable, etc. We evaluated many of them, including MAAS, Cobbler, OpenCrowbar, and Foreman. In general, we didn't actually dislike any of them, but none of them fit us quite right for various different reasons.

My Little Lie

Now that I've described the problem to you, let me now get something off my chest: I've lied to you. We don't actually fully reprovision a server in a minute. It's currently 88 seconds. But after a hard drive wipe, our OS boots up and is ssh-able in 37 seconds. The remaining 51 seconds is disk formatting, configuration, and software installation. By the time the host is ssh-able it has received an IP, and both forward and reverse DNS entries have been created automatically. But since Wikipedia defines software installation as part of the server provisioning process (and really, it is) I guess I lied to you.

To be fair, I'm certain that if we moved all of the software to a local repository instead of downloading it from the internet, we could get the entire process down to less than a minute. Shaving those extra 28 seconds off of the time didn't seem as important after we reduced server provisioning from a multiple hour manual process, to an 88 second automated process. Think about it, in the time it takes you to get through a standard TV commercial break, your server could be reformatted and running something completely new.

What's funny about this whole thing? Super fast server provisioning wasn't even our end-goal. Our main goals were just to build out a cluster, in an automated and maintainable way. Being able to re-provision a host rapidly is just a nice side effect of the design we chose.

The Devil in the Details

TL;DR I'll explain the little details that make this so fast:

  • I benchmarked on a VM. Some of our baremetal hosts take more than 88 seconds just to POST and initialize firmware. VMs conveniently skip that mess.
  • We're using a cloud-config file to do most of the configuration. Chef, ansible, salt, puppet, et-all are great, but for the simple stuff, cloud-init is faster.
  • Our software installation process actually boils down to a simple 'docker run' from the cloud-config file, and our container orchestration system.
  • Our network is reasonably fast (dual 10Gbit to each physical VM host)

And the last sorta-kinda little detail:

  • We don't actually install the OS on the drive. We're pxe-booting a live image, and the OS is only using the drive for data persistence. The OS does format the drive (if needed) on bootup, and store any applicable files to disk (including the 'software' mentioned above). We are still using persistent storage, and it's still part of our server model. We just don't use it for storing or running the OS (in this cluster anyway).

NOW I get it!

So you might be saying "Oh, well no wonder you get such great times. You're just pxe-booting a live image, in a VM, on a fast network. You're not doing much software installation, and your configuration is just a simple cloud-config."

And to you, unruly naysayer, I would say: "Yep, that's right!"

Why in the World Would You Do That in Production?

There are many benefits to this approach. But here are the highlights:

  • It's easier to maintain than a full server provisioning stack
  • It's fast. As in, boot-up is fast, and the OS binaries run from RAM instead of sloooooow magnetic drives.
  • It's as reliable as our old-school hand-crafted artisinal Ubuntu installs
  • We get better usage from our disk space (we don't install gigs of OS binaries and packages)

There's a ton of other reasons we're doing this, including all of the benefits of embracing the microservices revolution, such as easy software builds, reliable testing, and simple deployments.

But, aren't there a LOT of Drawbacks?

Ok, admittedly there's downsides to this approach:

Configuration Can't be Complex

Any configuration we have to do must be covered in the scope of cloud-config, which for our OS (RancherOS), is surprisingly limited. We aren't running chef, ansible, puppet, or any other major configuration management service on these hosts. We could, but that would kind of defeat the purpose of keeping these hosts as lightweight and disposable as possible.

You might notice I said RancherOS. If you're not familiar with that particular flavor of linux, then give it a try. Similarly to CoreOS, it basically just runs docker and doesn't come with all of the cruft you get from a full-blown server OS. The kernel image we're using clocks in at 3.7M, and the initrd is 25M. An OS footprint of 28.7 megabytes explains why that bootup process is so fast.

We're using stock images from RancherOS though. So it's not like we have any overhead in maintaining the images. If they release a new image, we try it out, and if it works for us then we use it. Since we're using stock images, and cloud-config, a full OS upgrade is literally 88 seconds away. Trying out a new OS version is similarly fast.

As an aside, I consider complex configuration the wrong way to go anyway. If you have complex configuration management, that means you need to manage your configuration management. Some people like that, but I like to keep things simple and work on the important stuff like keeping our website healthy. So really, enforcing simple configuration is actually a bonus! If for some reason we find that we really need more complex configuration, we'll probably move that into the docker images. The more we use docker images, the less need we have for a complex configuration management system. Why deal with configuration management, when you can just define the exact state of your docker images in a Dockerfile? I guess if you don't want to maintain lots of Dockerfiles (and their associated images), then you could maintain it in a tool like chef. I don't know that using chef to build or configure your docker images would buy you much in the complexity department though. Instead of maintaining Dockerfiles, you end up maintaining recipes, cookbooks, databags, roles, nodes, and environments.

Software Options are Limited

We're limited to just using software that can be run in a container. That's a lot of software actually, but anything proprietary will need to be packaged up in a container before we can run it on these hosts. There are also a lot of positive side effects from working with containers at the OS level, and cutting out the cruft of traditional package management. As one example, apt/yum both do a great job of building out a depedency chain and pulling them in during an install. However, they introduce their own issues with package conflicts and silly dependency chains that are difficult to work through. With docker images, the dependencies are in the image. Package conflicts effectively go away.

I should take a moment to mention the software security aspect here. Modern os package management systems (yum/apt/etc) have grown to support package signatures, trusted repos and maintainers, etc. Contrast that with downloading images from dockerhub, where the image isn't necessarily signed or maintained by a trusted person/group. Limiting which images can be downloaded from dockerhub, and/or using a trusted registry helps improve the software security aspect. Though, for the time being, this is one area where yum and apt have an advantage.

Data Persistence is Still Iffy

Our most difficult challenge so far, is figuring out how to reliably dockerize our SQL databases. Some people will be quick to say "well there's a container for mysql and a container for postgres, and a container for ..." But hold on there cowboy, if you put your entire database in a container where does that data go? If you store it in the container itself, that data goes away when the container is destroyed. If you have bind mounted a volume to your container, then your SQL container is joined at the hip to the host which originally ran the container. Using a "data container" and linking it to the SQL container is a popular solution, but has the same problem of being stuck on the host they started on.

We don't want any container to be stuck to a single host. We're aiming for lightweight and disposable hosts here. Less pets, more cattle. If the host stores some mission critical database, then it's no longer disposable. For that reason, we treat all on-host storage as volatile, and plan around the possibility of it being destroyed at any time without warning. Traditional approaches for SQL data reliability include backups, and slave DB servers, but translating those concepts to containers comes with a new set of complexity and problems.

One of Docker's approaches to solving that problem is with support for storage drivers, and we're currently looking into both Flocker and Rancher's Convoy which are two popular storage drivers. We've been discussing other ideas to solve sql data persistence, some of them more wildly experimental than others, such as an off-cluster "super-slave" for all database containers, or sending binlogs to Kafka, but so far haven't found a silver bullet here.

As an aside, the data persistence problem is more easily solved for companies that have an enterprise-grade SAN, which we don't have (yet).

We Have to Maintain a Custom Provisioning Stack

There's a lot of moving pieces in a server provisioning stack, and we have to maintain them. A tftp server, a web server, dhcp server, dns server ... and I'm sure I'm forgetting some others. In our case, we have that all maintained in Chef. I didn't say we have NO complex configuration anywhere, we just keep it away from our Rancher cluster and maintain it in Chef. Try to imagine a picture of the Dos Equis guy here, "I don't always have complex configuration, but when I do, I use Chef". Our server provisioning stack isn't really that complicated anyway. These are all standard services, and we're not configuring them in any off-the-wall ways. The most complicated part is actually how we generate cloud-config files. We created a quick CGI script that simply calls out to consul-template to generate cloud-config files on-the-fly. Any specific host configuration is stored in our consul cluster (such as hostname, environment, etc).

Conclusion

There's a lot of advantages and disadvantages to the cluster we've built. I highlighted the speed of reprovisioning as the topic for this article, but only because it's an interesting datapoint, not because it's important to our use case. Hosts in our Rancher Cluster are so disposable now, that even if server provisioning took 30 minutes instead of 88 seconds, I don't think we'd notice. If a disposable host dies without any impact to your services, do you really care anymore that it took 88 seconds or 30 minutes to replace it? Something we take for granted is that building a docker cluster enabled us to focus less on server maintenance and more on other issues that needed our attention. Using docker at the OS level and treating hosts as disposable, moved us to a more stable and maintainable platform overall, and maybe that's a topic worth discussing all on its own.

Bootstrapping with Docker in a Non Docker World

As our company begins to transition from an older Chef/Capistrano based deployment model and delving into Docker, those of us not familiar with Docker (myself included) have had to learn a lot to be able to keep up. Taking 20+ legacy projects, Docker-izing them all, standardizing the deployment model and getting all of the apps to play nicely together in this new paradigm is no small undertaking. However, even if you or your company aren't quite ready to fully commit to Docker, there's no reason you can't start using Docker today for your own development work, to both make your life easier and give you some insight into how powerful of a tool Docker can be.

A Brief Intro to Docker

For anyone not overly familiar with Docker, a brief introduction is in order.

Docker is a lot like a virtual machine, but without the overhead of having to virtualize all the basic system functionality. Any Docker application you run is granted more or less direct access to the system resources of the host computer, making it significantly faster to run and allowing for much smaller image sizes, since you don't have to duplicate all the OS stuff.

In practice, when you boot up something in Docker, you'll start with an image you either created yourself or downloaded off the internet. This image is basically a snapshot of what the application and surrounding system looked like at a given point in time. Docker takes this image and starts up a process that they call a container, using the image as the starting point. Once the container is running, you can treat it just like any normal server and application. You can modify the file system inside the container, access the running application, edit config files, start and stop the processes that are running... anything you'd do with a normal application. Then, at any point, you can discard the entire container and start a new container with the original image again.

These independent and disposable containers are at the heart of what makes Docker such a powerful tool. In production, this allows you to scale your system rapidly, as well as reducing the burden of configuring new hosts, since the majority of your application specific configuration will now be stored inside your container. In this manner, Docker images can conceivably be run on any system capable of running Docker, without any specific per application setup involved. Even if your company's applications aren't yet running on Docker, you can still leverage these traits to make you development environment trivially easy to set up.

Using Docker Compose to Bootstrap Your Computer

Setting up your workspace for the first time can be fairly tedious, depending on the number of services your application needs to have running in order to work. A simple Rails app could easily have several such dependencies, just to respond to simple requests. Most of our applications at Avvo require things like Redis, Memcached and MySQL... and that's before we even get into anything unusual that an application might require. When you jump into working on an application that you haven't touched before, it can sometimes take the better part of a day just to get the app to boot up locally. Luckily for us, Docker can help to greatly reduce this burden, with a little bit of help from Docker Compose.

While Docker itself gives us a great starting point for building and running images, starting up and configuring containers manually can be a little tricky. Docker Compose provides an easy and much more readable way to configure and run your containers. We can set up Docker Compose for those three services listed above, by creating a docker-compose.yml file like so:

    version: '2'
    services:
      mysql:
        image: mysql:5.6
        ports:
          - "3306:3306"
        environment:
          MYSQL_ROOT_PASSWORD: supersecretpassword

      memcached:
        image: memcached:latest
        ports:
          - "11211:11211"

      redis:
        image: redis:latest
        ports:
          - "6379:6379"

Even if you're not all that familiar with Docker Compose, the above file is fairly self-explanatory.

  • We declare three services: mysql, memcached, and redis
  • Tell Docker Compose to use the Docker images of the corresponding names for these services.
  • Declare port numbers for each service that we want to be able to access from the host machine, so that we can access each service from outside of their containers.
  • Apply some small configuration settings via environment variables, such as MYSQL_ROOT_PASSWORD.

To start these services, you just need to give the "up" command to Docker Compose from the same directory as the docker-compose.yml file above:

   ~/workspace:> docker-compose up
   Pulling redis (redis:latest)...
   latest: Pulling from library/redis
   357ea8c3d80b: Pull complete
   7a9b1293eb21: Pull complete
   f306a5223db9: Pull complete
   18f7595fe693: Pull complete
   9e5327c259f9: Pull complete
   72669c48ab1f: Pull complete
   895c6b98a975: Pull complete
   Digest: sha256:82bb381627519709f458e1dd2d4ba36d61244368baf186615ab733f02363e211
   Status: Downloaded newer image for redis:latest
   Pulling memcached (memcached:latest)...
   latest: Pulling from library/memcached
   357ea8c3d80b: Already exists
   1ef673e51c1f: Pull complete
   5dfcd2189a7d: Pull complete
   32d0f07db7eb: Pull complete
   fced47673b60: Pull complete
   e7d3555f9ff2: Pull complete
   Digest: sha256:58f4d4aa5d9164516d8a51ba45577ba2df2a939a03e43b17cd2cb8b6d10e2e02
   Status: Downloaded newer image for memcached:latest
   Pulling mysql (mysql:5.6)...
   5.6: Pulling from library/mysql
   357ea8c3d80b: Already exists
   256a92f57ae8: Pull complete
   d5ee0325fe91: Pull complete
   a15deb03758b: Pull complete
   7b8a8ccc8d50: Pull complete
   1a40eeae36e9: Pull complete
   4a09128b6a34: Pull complete
   587b9302fad1: Pull complete
   c0c47ca2042a: Pull complete
   588a9948578d: Pull complete
   fd646c55baaa: Pull complete
   Digest: sha256:270e24abb445e1741c99251753d66e7c49a514007ec1b65b47f332055ef4a612
   Status: Downloaded newer image for mysql:5.6
   Creating redis
   Creating memcached
   Creating mysql
   Attaching to mysql, memcached, redis
   mysql        | Initializing database
   mysql        | 2016-08-30 22:51:47 0 [Note] /usr/sbin/mysqld (mysqld 5.6.32) starting as process 30 ...
   redis        | 1:C 30 Aug 22:51:48.345 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
   ...
   redis        | 1:M 30 Aug 22:51:48.346 * The server is now ready to accept connections on port 6379
   mysql        | 2016-08-30 22:51:48 30 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0
   ...
   mysql        | 2016-08-30 22:51:55 1 [Note] mysqld: ready for connections.
   mysql        | Version: '5.6.32'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)

You should get output similar to that above, indicating that all three services are running. If you didn't already have the images stored in Docker locally, they should get downloaded automatically from Docker Hub. At this point, you should have fully usable services running locally, without having had to do any manual downloading, configuring, compiling, etc... Docker takes care of everything for you. You can even easily share these Docker Compose files around your company for easy bootstrapping of new workstations.

Another added benefit of running Dockerized services is that it eliminates a pet peeve of mine, where MySQL will unexpectedly get itself into a bad state and refuse to restart. If you're running MySQL natively, you're probably going to have to either do some surgery on the MySQL file system to remedy things, or else completely uninstall/reinstall MySQL to get it back into a working state, which can be pretty tedious and error prone. With Docker, you simply delete the container and the next time you boot it up, it'll create a new container from the original image. You'll still have to set up your DB tables and data again, but it's still far simpler and faster than reinstalling the native application.

Using Docker Compose for bootstrapping can also be great in cases where you have some obscure service that's required for an application. For example, one app that we use depends on Neo4j. What does it do? I have no idea, something to do with graphs I think. And I'm pretty sure the 'j' stands for Java. But assuming I'm not touching any of the graph stuff in the code that I need to work on, it would be really nice to not have to spend hours getting this thing running locally. Docker Compose makes this a cinch, even if the application that depends on Neo4j isn't yet Dockerized:

version: '2'
services:
  neo4j:
    image: neo4j:3.0
    ports:
      - 7474:7474
    environment:
      NEO4J_AUTH: none

Now just a call to "docker-compose up" from the same directory as the above Compose file and we have a running Neo4j instance that our app can use.

In Summary

Just about every commonly used service will already have a Docker image publicly available. Combining that with the power of Docker Compose can make complicated and tedious bootstrapping of your workstation or project a thing of the past. Even if your company hasn't fully committed to Docker or if the particular application you're working on isn't Dockerized, getting immediate benefits out of Docker is something that you can start enjoying today.

Using Environment Variables with Elixir and Docker

If you're trying to run your shiny new Elixir app in a docker container, you'll find a few new problems over running Ruby. But you'll run into a few gotchas with environment variables.

The first problem

Elixir (being based on Erlang) is pretty cool, in that it gets compiled. The downside is that the environmental variables you use get hardcoded at compile time. That prevents you from using the same binary image (in a docker image) in your staging environment and production environment.

In our case, we're using Phoenix, so we've got this in our config/prod.exs (I moved this from the prod.secret.exs because with ENV it isn't secret!):

# config/prod.exs
config :FooApp, FooApp.Repo,
  adapter: Ecto.Adapters.MySQL,
  username: System.get_env("DB_USER"),
  password: System.get_env("DB_PASS"),
  database: "foo_app",
  hostname: System.get_env("DB_HOST"),
  pool_size: 20

In your Dockerfile you'll want to release with exrm:

RUN MIX_ENV=prod mix do deps.get, compile, release

But once you do that, your config has the DB_USER, etc, that was available when your built your docker image! That's not good. You'll see this when you try to run your image:

$ docker build -t foo .
$ docker run -e "PORT=4000" \
> -e "DB_USER=foo" \
> -e "DB_PASS=secret" \
> -e "DB_HOST=my.mysql.server" \
> foo ./rel/foo/bin/foo foreground
Exec: /rel/foo/erts-8.0/bin/erlexec -noshell -noinput +Bd -boot /rel/foo/releases/0.0.1/foo -mode embedded -config /rel/foo/running-config/sys.config -boot_var ERTS_LIB_DIR /rel/foo/erts-8.0/../lib -env ERL_LIBS /rel/foo/lib -pa /rel/foo/lib/foo-0.0.1/consolidated -args_file /rel/foo/running-config/vm.args -- foreground
Root: /rel/foo
17:27:20.429 [info] Running Foo.Endpoint with Cowboy using http://localhost:4000
17:27:20.435 [error] Mariaex.Protocol (#PID<0.1049.0>) failed to connect: ** (Mariaex.Error) tcp connect: nxdomain
17:27:20.435 [error] Mariaex.Protocol (#PID<0.1048.0>) failed to connect: ** (Mariaex.Error) tcp connect: nxdomain
...

You'll notice I was able to specify the http port to use, that's a Phoenix specific configuration option, not available for other mix configs.

The first solution

Luckily, Exrm has a solution for this:

  1. Specify your environmental variables with "${ENV_VAR}"
  2. Run your release with RELX_REPLACE_OS_VARS=true

So we'll replace our Ecto config block with:

# config/prod.exs
config :FooApp, FooApp.Repo,
  adapter: Ecto.Adapters.MySQL,
  username: "${DB_USER}",
  password: "${DB_PASS}",
  database: "foo_app",
  hostname: "${DB_HOST}",
  pool_size: 20

And run our release:

$ docker build -t foo .
$ docker run -e "PORT=4000" \
> -e "DB_USER=foo" \
> -e "DB_PASS=secret" \
> -e "DB_HOST=my.mysql.server" \
> -e "RELX_REPLACE_OS_VARS=true" \
> foo ./rel/foo/bin/foo foreground

And we get no errors!

The second problem

But now you want to create and migrate your database with mix do ecto.create --quiet, ecto.migrate.

$ docker run --link mysql \
> -e "PORT=4000" \
> -e "DB_HOST=mysql" \
> -e "DB_USER=foo" \
> -e "DB_PASS=secret" \
> -e "DB_HOST=mysql" \
> -e "MIX_ENV=prod" \
> foo mix do ecto.create --quiet, ecto.migrate
** (Mix) The database for Foo.Repo couldn't be created: ERROR 2005 (HY000): Unknown MySQL server host '${DB_HOST}' (110)

What's happening is that mix doesn't know to replace the "${ENV_VAR}" strings with our environmental variables. It isn't being run through the Exrm release (the mix tasks aren't even compiled into the release).

The second solution

This is easy to fix, we just add what we'd started with System.get_env:

# config/prod.exs
config :FooApp, FooApp.Repo,
  adapter: Ecto.Adapters.MySQL,
  username: System.get_env("DB_USER") || "${DB_USER}",
  password: System.get_env("DB_PASS") || "${DB_PASS}",
  database: "foo_app",
  hostname: System.get_env("DB_HOST") || "${DB_HOST}",
  pool_size: 20

This gives us both options; either we can set them at "compile" time in the mix script, or don't specify them, so when the release runs it uses the run time environmental variables.

TLDR; Summary

With Elixir, Phoenix, Exrm, and Docker, you can build a release and run the same binary in staging and production. To specify different runtime environments and be able to run mix tasks like migrate, you need to combile two solutions.

Run your release with RELX_REPLACE_OS_VARS=true and define your config variables with System.get_env("ENV_VAR") || "${ENV_VAR}".

🎉

Metrics that don't suck

Metrics, from a developer's perspective, are simultaneously one of the most compelling and one of the most boring things you can ever be asked to look into. There's an old adage that goes something like, "If you want to improve something, measure it." There's a certain amount of logic behind that. Our instincts and best guesses can take us pretty far when trying to develop useful and robust applications, but there's nothing quite like hard data to show us the real shape of the world.

Data can give you a ton of insight into your application and there's a certain amount of OCD-like satisfaction with creating the perfect six-table join statement in SQL that exactly captures the metric you were trying to shake out. However, it is a very particular kind of person that's going to be willing to do this day-to-day. It's one thing to export a CSV from the database, filter the content down, create a pivot table or pareto chart and email it out every once in a while. But the monotony of having to do that once per day or once per week will drive any developer to madness.

What we really want is a programmatic way to find out this information and share it in an easily consumable manner. Yes, we can generate CSV reports that get mailed out daily or look at slides at the quarterly company meeting; those can provide a lot of value. But for our daily lives, what we really want are metrics that people are going to want to look at. We want metrics that motivate people. We want metrics that inform you of your app's daily successes and failures. We want metrics that let you know when something has gone terribly, terribly wrong with your application. And we want metrics that you can read and understand at a glance.

Enter Status Board

What is Status Board

Conveniently named, Status Board is an iOS application written by a company named Panic that allows you to create custom status boards for anything you'd like. All you need is a TV, an iPad along with the right cables to connect the two and you're in business. Lucky for me, thanks to some careless luggage handlers at the Westin in Portland, OR, I had a slightly maimed iPad that was no longer suitable for personal use that I could sacrifice to the cause.

For getting live data into the app, there are basically two approaches you can take:

  1. You have your application upload a file with the data you want to a Dropbox account, then have the Status Board periodically pull the data down to be displayed. This method... has simplicity going for it, but that's about it. To get it to work, you're going to have to write some kind of CSV export and then have something like a cronjob running to push the file all day long. It works, but you're going to be really limited in what you're going to be able to display in the Status Board app, since it only allows you to present CSV data in a couple different formats.

  2. Rendering a web page outside of the app that the iPad can connect to, which gives you access to what they call the DIY panel. Basically, you give the app an HTML endpoint and it will pull down that page and render it in the app, every 10 minutes by default. Now that you're dealing in straight HTML, you can render almost anything you'd like. Tables, graphs, maps, images... anything you can render from the given web app, you can put straight onto your board.

Nitty gritty

At Avvo, I was lucky enough that we had a Rails app that was accessible from inside the office network that also had access to all the data I could conceivably want to show. For example, as app developers we might like to know what our daily sales stats were for the last week, so we can write a simple html page that shows the count for each of the last 7 days (with the previous week's sales for that day in parentheses, for comparison).

Presenter code:

class StatsPresenter
    def date(days_back)
        "#{days_back.days.ago.month}/#{days_back.days.ago.day}"
    end

    def day_of_week(days_back)
        return "Today" if days_back == 0
        Date::DAYNAMES[days_back.days.ago.beginning_of_day.wday]
    end

    def days_sales(days_back)
        Purchase.where(:created_at => days_back.days.ago.beginning_of_day..days_back.days.ago.end_of_day).count
    end
end

View code (written in Slim):

- stats_presenter = StatsPresenter.new

table
  - for days_back in 0..6
    tr
      td.date = stats_presenter.date(days_back)
      td.day = stats_presenter.day_of_week(days_back)
      td = "#{stats_presenter.days_sales(days_back)} (#{stats_presenter.days_sales(days_back + 7)})"

This gives you a simple table that Status Board can render; you just have to create a new DIY panel in the app and point it at this page:

DIY configuration

And voila:

Week-over-week sales stats

Note: I found it necessary to tell the controller to skip layout rendering, since we had some standard headers/footers and such in our layouts that we didn't want rendered in each panel. So you may want to add this to your stats controller:

layout: false

You can get pretty fancy with what you display. Let's take it one step further and use the Google Maps API to display some data. For my team's product, we roll out our product to certain states first, so we wanted a simple way to track state-by-state progress on how things were going. We decided to highlight each state with a specific color to indicate its status. A red state would indicate that there's a problem with the state and blue would indicate that the state is ready to go.

Embedding a Google map onto your status board is fairly trivial:

<div id="map"></div>

<script src="https://maps.googleapis.com/maps/api/js"></script>
<script type="text/javascript">
function initialize() {
    var mapCanvas = document.getElementById('map');
    var mapOptions = {
      center: new google.maps.LatLng(38.0000, -96.0000),
      zoom: 4,
      mapTypeId: google.maps.MapTypeId.ROADMAP,
      disableDefaultUI: true
    }
    var map = new google.maps.Map(mapCanvas, mapOptions);
}

google.maps.event.addDomListener(window, 'load', initialize);
</script>

This gets you a simple map of the United States.

Basic Google Map

Next, Google also allows you to overlay polygons of your own onto the map. You do this by passing in the latitude and longitude of each coordinate of the polygon, then Google strings them all together to get the desired shape in whatever color you specify. I found a rough listing of the necessary coordinates online and converted them into a yml file for the application to use. Now, you have to be able to show the polygon data in a format that is readable by the Google API. The simplest way I found was to create a separate XML endpoint that returns the polygons you want to display, along with their color, which can then be read via JavaScript and sent to Google.

Controller code:

  class StatsPresenter
    def state_status
      @states = []

      State.all.each do |state|
        state_data = {}

        state_data["name"] = state.name
        state_data["points"] = state_polygons[state.name.downcase.tr(" ", "_")]
        state_data["color"] = color_for_state(state)

        @states << state_data
      end

      @states
    end

    def state_polygons
      @state_polygons ||= YAML.load_file(Rails.root.join('db/', 'domain/', 'state_polygons.yml'))
    end

    def color_for_state (state)
      if state_functional? state
        return "#0000ff"
      else
        return "#ff0000"
      end
    end
  end

Along with an XML builder:

  xml.instruct!
  xml.states do
    @stats_presenter.state_status.each do |state|
      xml.state(:name => state["name"], :color => state["color"]) do
        state["points"].each do |point|
          xml.point(:lat => point["lat"], :lng => point["lng"])
        end
      end
    end
  end

This will create the XML output you need for the state polygons:

<?xml version="1.0" encoding="UTF-8"?>
<states>
  <state name="Arizona" color="#0000ff">
    <point lat="36.9993" lng="-112.5989"/>
    <point lat="37.0004" lng="-110.8630"/>
    <point lat="37.0004" lng="-109.0475"/>
    <point lat="31.3325" lng="-109.0503"/>
    <point lat="31.3325" lng="-111.0718"/>
    <point lat="32.4935" lng="-114.8126"/>
    <point lat="32.5184" lng="-114.8099"/>
    <point lat="32.5827" lng="-114.8044"/>
    <point lat="32.6246" lng="-114.7992"/>
    ...

Finally, you need the javascript to retrieve, process and send the polygons to Google. You do this with a simple jQuery call to your XML status endpoint, which you add to the Google Maps configuration from above:

<script src="https://maps.googleapis.com/maps/api/js"></script>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<script type="text/javascript">
  function initialize() {
    var polys = [];
    var mapCanvas = document.getElementById('map');
    var mapOptions = {
      center: new google.maps.LatLng(38.0000, -96.0000),
      zoom: 4,
      mapTypeId: google.maps.MapTypeId.ROADMAP,
      disableDefaultUI: true
    }
    var map = new google.maps.Map(mapCanvas, mapOptions);

    jQuery.get("/api/1/stats/state_status.xml", {}, function (data) {
      jQuery(data).find("state").each(function () {
        var color = this.getAttribute('color');
        var points = this.getElementsByTagName("point");
        var pts = [];
        for (var i = 0; i < points.length; i++) {
          pts [i] = new google.maps.LatLng(parseFloat(points [i].getAttribute("lat")),
          parseFloat(points [i].getAttribute("lng")
        ))
          ;
        }
        var poly = new google.maps.Polygon({
          paths: pts,
          strokeColor: '#000000',
          strokeOpacity: 1,
          fillColor: color,
          fillOpacity: 0.35
        });
        polys.push(poly);
        poly.setMap(map);
      });
    });
  }

  google.maps.event.addDomListener(window, 'load', initialize);
</script>

And now you can watch the status of each of your states in near-realtime:

Map with status overlays

Using DIY panels along with the versatility of your web applications, you can make a rich set of panels and pages for your status board that let you see how your app is performing at any given time, at a glance.

Sample stats

(Note that all the data is faked out for the purposes of these screenshots)

What's behind the microservices trend?

A growing development team can be super exciting. Imagine all the new stuff we'll get done! There'll be so many more people to learn from. And it probably means that the business is moving in the right direction.

But at some point, you'll find yourself getting less done with more people. Maybe you're starting to hear about how the work feels like it's dragging. The words "unmaintainable" and "stuffed full of bugs" start to get thrown around.

Just as communication gets exponentially harder as a team grows, code can become exponentially harder to work with as the team building it grows. How can you tell if it's happening?

  • Onboarding is harder. The goal was having a new dev shipping on the day they join. But now it takes up to a week to get their environment set up, and to learn enough about the code to make a simple text change.

  • You can't easily make a change without breaking unrelated code. Even if the team has been good about reducing coupling, over a few years, if code can be coupled, code will be accidentally coupled.

    In the best case, this costs dev time. In the worst, it causes errors or site downtime.

  • Shipping takes longer. When everyone's working in the same codebase, you'll have a decision to make: Do you want to batch changes and ship them all at once? Or would you rather have everyone wait in line to ship?

    If you decide to batch changes, you'll have a lot of integration pain. If you decide to have a ship queue, that queue will grow as you add more devs and ship more often. Instead of working on the next thing, devs spend half their day at half-attention, waiting for their code to go out.

  • Tests take longer to run. You want all the tests to pass before you deploy, right? So not only does this make you lose more time when you're shipping, it also delays everyone in line behind you. Hopefully your tests pass consistently!

  • Ownership becomes unclear. When something breaks inside the code, who's responsible for investigating and fixing it? When ownership is muddled, entire features become "someone else's problem," and don't get the care they deserve.

What do we want?

What would a better world look like? It'd be awesome if we had:

  • Apps based around simple ideas that are easy to understand.
  • Isolated sections of code, that can't affect each other.
  • Loose dependencies, so small pieces of the app can ship independently.
  • Fast tests.
  • Clear ownership.

Small apps or services, coordinating to get work done, hit every one of those factors. Because, in general:

  • A smaller app is easier to understand than a large one. There's less code to worry about.
  • If you isolate code inside a service, another app can't mess with it. You can only make changes through the interface.
  • If you know what a service's clients are expecting from it, you should be able to change and ship it independently.
  • If you're shipping a smaller piece of a large app, you'd only have to run some of the larger app's tests to feel comfortable shipping it.
  • It's easy to assign ownership of all of a small app: "Hey, Katie, you're responsible for the Account service."

All of these benefits have contributed to Microservices becoming a trend.


The traditional fix to exponentially growing communication is to break large teams into smaller autonomous teams. This works for software, too. It can turn exponential growth into more linear growth. But it's good to understand why this works -- what problems microservices are meant to solve. Because, like every decision in software development, it involves tradeoffs.

If you're not having any of those problems I described earlier, it's probably not a great idea to transition to services. Because there are some pretty big problems with a service-based architecture:

  • The code may be simpler, but the relationships between coordinating services become more complicated. This is something your language probably won't be able to help with. You'll need to work to make that coordination visible, so you can see those connections and make sure they make sense.

  • Services can add a lot of busywork when you start building a new app. If you have an app, a client, a contract, and a service, you'll sometimes have to tweak and ship four repositories to get any work done.

  • Tests are harder to write, and can be more brittle. A lot of your tests will depend on the network. You'll have to accept the brittleness of relying on code running somewhere else, or mock the connections out. If you're mocking, your tests might pass when they should fail, which is a big problem.

  • You and the team have to define solid patterns for communicating. You'll decide how to communicate errors and metadata, how and when to retry failed requests, deal with caching, serialize and deserialize data, and lots of other things. This is stuff that a big monolithic app will give you for free.

  • You'll face problems like cascading failures, thundering herds, and stale data, which always seem to come at the worst time.

  • You've now built a distributed system. Distributed systems, especially when things fail, act unpredictably.

  • If you're using HTTP to communicate between services, you've probably added latency. Your app might be slower, or you might add caching to try to speed things up. If you do that, you have to worry more about one of the two hardest problems in computer science -- cache invalidation. Not only that, but cache invalidation across multiple apps.

And there's even more! These sound like huge problems, and they are -- but then, so are the problems related to a growing team on a growing app.

The key is recognizing when the problems of team and code growth are on the path to outweighing the problems of a service-based architecture.

Most of the companies that have moved to a microservice architecture seem very happy about it. We're still in the process of making the transition, but so far we've seen huge wins from it.

If helping us make that transition sounds interesting to you, let me know -- we're always looking for great new people to join our team.

Building JSON-based Rails SOA (part 1)

Let me tell you a story: Once upon a time there was a rails app. It was a good rails app, like all rails new apps. But over time a darkenss started shrouding the land. The app grew. Models started piling up. The developers pushed the models deep into the dungeons of namespaces. And there they sat, in the darkness, quetly growing old and crusty with unsupported code.

But one day, there was great upheaval. The knights of SOA have arrived! They took those models out of the dungeons and cleaned them. They wrapped these models in new rails apps, smaller and shinier, with APIs to protect the users from the raw power of the models. The old app was rebuilt on top of the APIs with clearly defined boundaries. And everyone lived happily ever after.


Or something like that. Let's talk about what SOA means in the Rails ecosystem. Or, rather, what it means in the Avvo ecosystem. Over time, as our app grew, we decided that it would be best to cut it up into services. This will be a brief overview of our stack.

Our main app is still a rails app and our services are gently modified rails-api apps. They communicate through JSON. The client uses the JsonApiClient gem to formulate the requests and parse the responses.

JsonApiClient

JsonApiClient is a neato little gem that Jeff Ching wrote. It makes it easy for the client to never have to worry about paths or request parsing. All of that is done by the gem.

Modified rails-api

Our implementation, while unfortunately not public, is a thin wrapper around rails-api that formats the responses as:

  {
    # ActiveModel::Serializer serialized array of objects
    meta: {
      status: HTTP status,
      page: current page,
      per_page: count of entries per response page,
      total_pages: toal number of pages with the above per_page count,
      total_entries: total number of records
    }
  }

Example

It's easier to explain how this works with an example. Let's pretend that this blog is a rails app that has an API backend and examine how we would render the first page.

How we talk to one another

The blog makes a call to the blagablag API to ask for all the posts that may be available (or, really, the first page). Let's look at what the code would look like.

Main blog app

Controller

The controller looks much like any other controller.

class BlogController
  def index
    # first page of results.
    @posts = Blagablag::Post.order(:updated_at).to_a
  end

  def show
    @post = Blagablag::Post.find(params[:id])
  end
end

With the exception that you're now asking the Blagablag::Post class for info, which is a JsonApiClient class.

Client

module Blagablag
  class Post < JsonApiClient::Resource
    # this would normally be in a base class
    self.site = "blagablag.avvo.com/api/1"
  end
end

Sure is empty in there, huh? That's because JsonApiClient::Resource is handling all of the routing for us. It knows how to build all the standard CRUD routes, so all you have to do is call them.

Note: It's possible (and easy) to build custom routes, but you have to define those in the client. RTM for more details.

blagablag API

The API side is a bit more involved (for you, because we do not currently have an open source implementation of the server), but a simplified version would look something like:

module Blagablag
  module V1
    class PostsController < ActionController::API
      # main controller to handle all where requests
      def index
        @posts = Post.scope.where(params)
        @posts = @posts.order(params[:order]) if params[:order]
        # process page params
        @posts = paginate(@posts)

        render json: format(@posts)
      end

      # where client.find(id) goes
      def show
        @post = Post.find(params[:id])

        render json: format(@post)
      end

      private

      # this is all greatly simplified for the example

      def format(objects)
        {
          posts: Array(objects),
          meta:
            status: 200,
            page: @page,
            per_page: @per_page,
            total_pages: @total/@per_page,
            total_entries: @total
          }
        }
      end
    end
  end
end

And that's pretty much it. There's a little bit of magic that we do with our servers that makes some of this a bit cleaner, but that is our stack. We're still learning how to manage all the servers and interdependencies, but this has sped up our system dramatically and has forced us to really think about system design in a way that our old monolith never could.

As a developer, what do you value?

Once a company starts to grow, it gets harder for developers to get to know each other. But without some kind of compatibility between how people on the team make decisions, things get chaotic, quickly.

The problem starts to show up between teams, when it takes way too many meetings to get systems built by the same company to talk to each other. Or there's culture shock when people move between teams, which leads to less movement. That makes this problem even worse. Soon, meetings become arguments. People can't even understand the position of the person across the table because they're starting from different fundamental assumptions. And nothing spawns useless, frustrating meetings than arguments between people who, deep down, believe different things.

What's the solution?

There are a few ways to solve this problem. Management could dictate how decisions must be made. "You will use Java to solve all problems." "Every team will communicate using JSON." This works, but at a cost -- every decision you take away from someone removes a reason you hired that person to begin with. Besides, you don't need everyone to make the same decision, you need compatible decision-making.

Maintaining a strong culture is often a better option. But "culture," within a company, has a fuzzy meaning. Maybe it's how you hire (which, in the worst case, means "we hire people like ourselves.") It could mean Whiskey Fridays, or Ping-Pong in the lunchroom, or just about anything else you could say about a company. It has something to do with improving shared decision-making. But that's often buried under everything else that falls under the category of "culture."

Context and Taste

There's one specific part of culture, though, that has a strong impact on how people make decisions. I think I first saw the idea in Netflix's culture deck, where they refer to it as "Context" (starting on slide 79).

Context is about creating shared understanding. It's about agreeing on things that we, as a team, typically value. It's about sharing our assumptions, and what's behind them. It's being open about our priorities and goals. It's about building a framework for making good, compatible decisions, without dictating the decision from the top down.

GitHub used a similar concept Kyle Neath called "taste." Here's how he describes it:

I’d argue that an organization’s taste is defined by the process and style in which they make design decisions. What features belong in our product? Which prototype feels better? Do we need more iterations, or is this good enough? Are these questions answered by tools? By a process? By a person? Those answers are the essence of taste. In other words, an organization’s taste is the way the organization makes design decisions.

"Zen."

To help capture taste, Kyle borrowed an idea from Python, and defined GitHub's "Zen" -- that is, a short list of statements that answer the question, "Why did you do that?" This is what he came up with:

  • Responsive is better than fast.
  • It’s not fully shipped until it’s fast.
  • Anything added dilutes everything else.
  • Practicality beats purity.
  • Approachable is better than simple.
  • Mind your words, they are important.
  • Speak like a human.
  • Half measures are as bad as nothing at all.
  • Encourage flow.
  • Non-blocking is better than blocking.
  • Favor focus over features.
  • Avoid administrative distraction.
  • Design for failure.
  • Keep it logically awesome.

Bringing it back to Avvo

To us, this seemed like a great way of capturing some of the core things we, as a dev team, fundamentally value. Some part of our context, or taste. Here's the newest version of what we came up with:

  • Invest in yourself and your tools
  • Defend the user
  • Your teammate’s problems are your problems
  • Attack the noise
  • Take ownership
  • Leave code better than you found it
  • Favor convention over reinvention
  • Explicit is better than implicit
  • Understand why
  • Prove it
  • Keep moving forward
  • Deliver business value

And, like GitHub, ours now exist in a git repository -- ready for updates, pull requests, and comments.

Once you have these values written down, it's amazing where they pop up. During disagreements, it's helpful to tie positions back to these values. In interviews, it's interesting to see how candidates' responses relate to these. They come up during feedback, in meetings, and in 1:1s. They immediately tell people outside of Avvo how compatible they'll be with the kind of decisions we make. And all of these are clear enough that they can inform good decisions, without enforcing a specific decision.


At Avvo, our development team is growing to the size where people can't know everything that's going on. Through ideas like this, we're beginning to define what it means to be a developer at Avvo. It's an opportunity that only comes around a few times in a company's lifetime! If that sounds exciting to you, and you'd like to help us shape this team, get in touch -- through either our careers page or a simple email to me. I'd love to hear from you.

Parsing JSON requests with deserializers

Once upon a time, you have a Rails server. This server does all kinds of wonderful things, surely. One of those things is it takes json, parses it and stores it somewhere.

Here's my question to you: how do you parse the JSON and where? Do you do it in the controller? Do you do it like this?

class SomethingJsonController < ApplicationController
  def create
    Model.create(strong_params)
  end

  def update
    Model.update_attributes(strong_params)
  end

  private

  def strong_params
    params.require(:blah).permit(:blah, :blah, :blah)
  end
end

Sure. You can. But what happens when the JSON you get is all weird and nested? Even worse, what if it isn't a 1:1 mapping of your model? You end up with some intesnely unpleasant controller code. Let's look at an example to show you what I mean.

An example

OK. We'll say that you have an endpoint that talks to a service/device and takes a JSON blob that looks like this:

{
  "restaurant_id" : 13,
        "user_id" : 6,
      "dish_name" : "risotto con funghi",
    "description" : "repulsive beyond belief",
        "ratings" : {
                        "taste" : "terrible",
                        "color" : "horrendous",
                      "texture" : "vile",
                        "smell" : "delightful, somehow"
                    }
}

But your model doesn't directly map to that. In fact, it's flat and boring, like

# DishReview model

t.belongs_to  :restaurant
t.belongs_to  :user
t.string      :name # field name different from API (dish_name)
t.string      :description
t.string      :taste
t.string      :color
t.string      :texture
t.string      :smell

The problem

What many people would do (assuming you can't change the incoming JSON) is try to parse and modify the params in the controller, which ends up looking roughly like this:

class DishReviewController < BaseController

  def create
    review_params = get_review_params(params)
    @review = DishReview.new(review_params)
    if @review.save
      # return review
    else
      # return sad errors splody
    end
  end

  # rest of RUD

  protected

  def permitted_params
   [
      :restaurant_id,
      :user_id
      :dish_name,
      :description,
      :taste,
      :color,
      :texture,
      :smell
    ]
  end

  def get_review_params(params)
    review_params = params.require(:review)

    review_params[:name] ||= review_params.delete(:dish_name)

    ratings = review_params.delete(:ratings)
    if (ratings.present?)
      ratings.each{|rating, value| review_params[rating] = value if valid_rating?(rating) }
    end

    review_params.permit(permitted_params)
  end

  def valid_rating?(rating)
    [ "taste", "color", "texture", "smell" ].include? rating
  end
end

Man, that sure is a lot of non-controller code inside that controller. 30 lines, in fact. And that's if you have the same params coming in for all the actions. What if your update and create take different params? Then it'll get all nasty and you'll start shoving code into concerns; it'll be hard to read, hard to follow, maintain and refactor.

Our solution

But enough of this Negative Nancy talk. I have options! Well, just one, really. It's a gem, we call "Deserializer"!

Here at Avvo, my team are working on cross-server communication and building out new APIs as we scale our product. We serialize data using ActiveModelSerializer and it felt very frustrating having to parse the generated JSON by hand on the receiving side. So to ease the pain of perpetually having to mangle hashes in the controller, I wrote this gem.

So what does this "deserializer" of yours do, exaclty?

The Deserializer acts as the opposite of AMS. AMS takes an object, and converts it into JSON. The deserializer takes in params (incoming JSON), and converts them into model consumable data. It does not create an object out of those params. Really, it's a glorified hash mangler.

Great. Whatever. How does that help me?

Using the example above, let's look at what our code will look like with a deserializer

controller

class DishReviewsController < YourApiController::Base
  def create
    review_params = DishReviewDeserailzer.from_params(params)
    DishReview.create( review_params )
  end

  # RUD
end

"Wow!", you say, "That's so tidy and neat!". You are correct. And the deserializers aren't too bad either. Let's have a look

deserializers

# DishReviewDeserializer

module MyApi
  module V1
    class DishReviewDeserializer < Deserializer::Base
      attributes  :restaurant_id,
                  :user_id,
                  :description

      attribute   :name, key: :dish_name

      has_one :ratings, :deserializer => RatingsDeserializer

      def ratings
        object
      end

    end
  end
end

# RatingsDeserializer:

module MyApi
  module V1
    class RatingsDeserializer < Deserializer::Base

      attributes  :taste,
                  :color,
                  :texture,
                  :smell
    end
  end
end

"Hot dog!", you exclaim, "Those look just like my serializers on the other side! It's as if the interface is written to match that of AMS!" They sure do. And it sure is.

Now, not only are your concerns separated, but you can reason about what your code does and understand what data is coming in by just looking at the deserializers.

As a nice bonus, since the deserializer ignores undefined keys, you no longer have to strong param anything - but you still can if you want (because it's just a hash mangler). There's even a function to help you, MyDeserializer.permitted_params will give you the list of paramaters that the deserializer expects to get.

For more detailed info, feel free to RTM and contribute.

Switching from Resque to ActiveJob

At Avvo we wanted to add some metrics and logging to our background job processing. We're using Resque, so it wouldn't take much to write a module to wrap enqueue and perform to do the logging. After investigating the new ActiveJob abstraction in Rails, we decided there would be benefits to switching.

This post is intended as a brief guide to switching from plain Resque to ActiveJob. In addition, it covers the Minitest test helper and integration with Resque::Scheduler.

What does ActiveJob do?

ActiveJob is built into Rails, and provides a common interface for background jobs. It also adds callback support around enqueuing and performing jobs. By default this is used to add logging to your jobs.

Coming from Resque

Plain Resque was a good starting point years ago, but these days we expect more from our libraries. There's no logging and no metrics hooks. By switching to ActiveJob we get all that for free. Plus it might make switching to Sidekiq easier.

Warning

Remember to keep in mind during upgrading that you may have job in the queue when you deploy your new ActiveJobified code. Because of this, we want to keep the old jobs around, and have them call the new ActiveJob code. Just have old self.perform methods instantiate the new job class and call perform.

Mechanics of using ActiveJob

Switching involves a few steps:

  • Inherit from ActiveJob::Base
  • use queue_as instead of @queue =
  • configure in config/application.rb: ruby config.active_job.queue_adapter = :resque config.active_job.queue_name_prefix = "my_app"
  • Change perform methods from class to instance.
  • Switch from using Resque.enqueue KlassName to KlassName.perform_later.

For instance, we'd change the following class:

class OldCsvParser
  @queue = :csv_parsing

  def self.perform(filename)
    # ... do stuff
  end
end

to:

class CsvParserJob < ActiveJob::Base
  queue_as :csv_parsing

  def perform(filename)
    # ... do the stuff
  end
end

and update the original class to just call the new one:

class OldCsvParser
  @queue = :csv_parsing

  def self.perform(filename)
    CsvParserJob.new.perform(filename)
  end
end

Testing with ActiveJob::TestHelper

Coming from using ResqueUnit the switch to ActiveJob::TestHelper was easy. Include the module in your TestCase and using methods like assert_enqueued_jobs.

For instance, if we enqueued a CsvParserJob from a controller action, our test might look like this:

class CsvParsingControllerTest < ActiveSupport::TestCase
  test "enqueues the job to parse the csv" do
    filename = "/path/to/csv/file.csv"

    assert_enqueued_with(job: CsvParserJob, args: [filename]) do
      post :create, filename: filename
    end
  end
end

Integration with ResqueScheduler

Resque::Scheduler enqueues jobs directly with Resque. You'll need to either change that behavior or wrap jobs in the schedule with a JobWrapper.

Luckily the latter work has already been done, and using the gem ActiveScheduler will wrap the jobs so callbacks are called.

Installation is a snap, just follow the directions in the project's readme to update your Resque::Scheduler initializer.

Downsides: ResqueWeb only shows JobWrapper job classes

Due to the ActiveJob JobWrapper viewing the running jobs in resque-web will no longer show the class of actual job running on the dashboard. Clicking through and viewing the arguments does show the class. This could be a hassle if you often view the job queue.

Similarily, the Schedule tab in resque-web is a little cluttered. But still readable.

How swappable storage and fakes lead to cleaner, more reliable tests

Let's say you've written a Rails app that runs background jobs using Resque. How do you test those jobs? You can run them inline and check that what you expected to happen, actually happened:

setup do
  Resque.inline = true
end

def test_account_updated
  do_something_that_queues_a_background_job_to_update_an_account(test_account)
  assert_equal :updated, test_account.status
end

That works, but it's missing something. You backgrounded that job for a reason -- maybe it's slow, or you don't want it to happen right away. If you put that code in a background job, you probably care more about the job being queued, and less about what it does when it runs. (Besides, you can test that part separately).

I wrote resque_unit to solve this problem. resque_unit is a fake version of Resque. It intercepts your jobs and gives you some more assertions:

def test_account_updated
  do_something_that_queues_a_background_job_to_update_an_account(test_account)
  assert_queued UpdateAccount, [test_account.id]
end

On the inside, resque_unit rebuilt part of Resque's API to change how jobs were queued.

This was great for an initial implementation. It was fast, you didn't need Redis on your continuous integration server, and it was easy to understand. But, as reimplementations of an API tend to do, it fell behind. It got way more complicated. More bugs popped into GitHub Issues, and more code had to be borrowed from Resque itself.

Besides all that, there was a really big gotcha for new users: if you loaded resque_unit before you loaded Resque, resque_unit would stop working.

Looking at other options

My favorite way to write an easily testable client is to build swappable storage and network layers. For example, imagine if Resque had a RedisJobStore and an InMemoryJobStore, that each implemented to the same API. You could write most of your unit tests against the InMemoryJobStore, and avoid the dependency and complication of Redis. But, since Resque is designed to work specifically with Redis, this wasn't an option.

Instead, the answer was to go a level deeper. What if the Redis client itself had both a RedisStore and an InMemoryStore? It turns out this is a thing that exists, called fakeredis.

fakeredis re-implements the entire Redis API. But instead of talking to a running Redis server, it works entirely in-memory. This is really impressive, and seemed worth a try.

Bringing fakeredis to resque_unit

If you were working with real Resque in development or production, how would you check that a job was queued?

You wouldn't have to write much extra code. You'd queue a job normally. You could look for a queued job with peek. You'd check queue size with size. And you'd clean up after yourself with remove_queue or flushdb. If you wanted to run the jobs inside a queue you'd have to pretend you were a worker. But for the most part, you'd barely have to write code.

To bring fakeredis to resque_unit, it was almost that easy. I got to remove a ton of code. And the rest of the code is a lot smaller, a lot simpler, and a lot less likely to break.

One last quirk

There was one last problem: fakeredis entirely takes over your connection to Redis. That makes it a pretty terrible dependency for a gem to have. What if you wanted to use real Redis for most of your tests? If you require resque_unit, all of a sudden you've changed a lot about how your tests run! And it gives me a headache to think about how hard that would be to debug.

So, when you require resque_unit, it does a little dance to be as unobtrusive as it can:

# This is a little weird. Fakeredis registers itself as the default
# redis driver after you load it. This might not be what you want,
# though -- resque_unit needs fakeredis, but you may have a reason to
# use a different redis driver for the rest of your test code. So
# we'll store the old default here, and restore it afer we're done
# loading fakeredis. Then, we'll point resque at fakeredis
# specifically.
default_redis_driver = Redis::Connection.drivers.pop
require 'fakeredis'
Redis::Connection.drivers << default_redis_driver if default_redis_driver
module Resque
  module TestExtensions
    # A redis connection that always uses fakeredis.
    def fake_redis
      @fake_redis ||= Redis.new(driver: :memory)
    end

    # Always return the fake redis.
    def redis
      fake_redis
    end

What can you take away from this?

When you're building a library that depends on a data store or service, think about making it swappable. It'll make your own tests easier to write, and it'll be clearer to your readers which features of the service you use.

Don't reimplement APIs on the surface level. Especially if you don't own it, and you don't control revisions to it. You'll do nothing but chase changes to it, and your implementation will usually be behind and a little broken.

And a good in-memory fake, in the right place, can make testing and development so much easier.

Using La Maquina to solve complex cache dependencies

Let's talk about caching a page that has a ton of database objects that need to be pulled. Specifically, because that page render is centered around a monster model that has way too many dependencies. This is going to be lengthy, so get yourself some coffee and settle in for an emotional rollercoaster.

Example

Alright, let's start off with an example. So let’s say you have a Rails project. And let’s say you have a “master” model all up in there. A “god” model if you will. It’s probably User. It’s User, isn’t it? Be honest. Anyway. You probably have a bunch of stuff in that user.rb file of yours. Something like

class User < ActiveRecord::Base
  has_one  :headshot
  has_many :articles

  # 500 more associations over here
end

and your app/views/user.html.slim file is riddled with cache blocks to look like

/ I want a cache block here for @user : ((
div 
  - cache [@user.headshot, :headshot] do
    = image_tag @user.headshot.image.small

  - cache [@user, :details] do
    h2 = user.name
    p = user.about_me

  - cache [@user.articles, :other_things] do
    @user.articles.each do |article|
      - cache [@article, :thumb_description]
        = image_tag @article.header_image
        = @article.short_description
/ etc

Which means that, while sure, you have some caching in there and the HTML renders will be faster, you're still making quite a few database calls to verify all those caches. You could solve this by adding belongs_to :user, touch:true to all of your models that are part of the above block, but then 1: that doesn't work because touch is not suported for all associations and 2: you're making db updates to user when you are updating unrelated objects. Also: this becomes quite involved for through associations.

This is where I sell you stuff. Specifically, I wrote a gem called La Maquina. This bit of code allows you to define arbitrary associations between models and notify about updates. This is a hard sentence to parse, so lemme show you some code. Using the example above for only article, we can set up our models as follows

class User < ActiveRecord::Base
  include LaMaquina::Notifier
  notifies_about :self

  has_many :articles
  # etc
end

class Article < ActiveRecord::Base
  include LaMaquina::Notifier
  notifies_about :user

  belongs_to :user
  # etc
end

# with all the other associations like headshot would be set up the same way as Article

This is the very basic of plumbing. If you were to examine the input into the LaMaquina engine at this point, you'd see the :self notification fire (conceptually) as "a user is notifying about user #{id}"; and for the article, you'd see "an article is notifying about user #{id}".

This is the interesting bit. Now that we have the notifications flowing, we have to have some code to process those. The way LaMaquina does this is with plugins called Pistons. They are code that take the caller and callee class names and then can process them in as simple or complex way as you need. Just as an example, here's what a piston that implements the old touch functinoality would look like.

class TouchPiston < LaMaquina::Piston::Base
  class << self
    # for Article, we'd get "user", user_id, "article"
    def fire!( notified_class, id, notifier_class = "" )
      # User
      updated_class = notified_class.camelize.constantize

      # User.find(id)
      object = updated_class.find(id)

      object.touch
    end
  end
end

While this is not recommended for cache invalidation (LaMaquina::Piston::CachePiston is probably what you want to use), this will allow you to update your slim to be more like this:

/ this is the important bit right here
- cache [@user, :profile]
  div
    - cache [@user.headshot, :headshot] do
      = image_tag @user.headshot.image.small

    - cache [@user, :details] do
      h2 = user.name
      p = user.about_me

    - cache [@user.articles, :other_things] do
      @user.articles.each do |article|
        - cache [@article, :thumb_description]
          = image_tag @article.header_image
          = @article.short_description

So now, so long as the user hasn't been touched, your page will render with a single db call. One. Isn't that exciting? I think that's pretty neat.

As a sidenote, you'll probably want to keep the inner cache blocks as they were, as they'll help when there are partial page rebuilds. Like, if a single article is added/updated, you won't want to rebuild the entire page.

Setup

Ok. So that's great. But I added all of this code and nothing is happening. What gives?

You need to do some minor setup. In your config/intializers, you'll need to add a la_maquina.rb that sets up all of this stuff. Something along the lines of

LaMaquina::Piston::CachePiston.redis = Redis::Namespace.new(:cache_piston, redis: Redis.new)
LaMaquina::Engine.install LaMaquina::Piston::CacheAssemblerPiston

LaMaquina.error_notifier = LaMaquina::ErrorNotifier::HoneybadgerNotifier

For a more thorough explanation of what all of that means, plz RTM.

Important note: if using CachePiston or your own custon cache key generator, don't forget to add a cache_key method to your target models

class User < ActiveRecor::Base
  def cache_key
    LaMaquina::Piston::CachePiston.cache_key(:user, id)
  end
end

otherwise rails will default to model/id/updated_at, which will of course ignore your shiny new key and you will be very sad.

Bonus round

So this is all great and you're using the CachePiston and all of you're views are blazing fast.

BUT WAIT, THERE'S MORE

So you know how you set up that piston to update your cache when a model changed? Well, you can add an arbitrary number of pistons that do all sorts of things. You're using Solr and want user to be reindexed when it's updated? You can do that (there's actually a proto-piston for that already). You want to fire Kafka notifications when articles get created? You can do that too. RSS? Why not. Push notifications? Sure! All kinds of things can happen. You can even rebuild the views on the backend if you want. The world is your oyster now.

You're welcome. Now go, make your app radical.

Solving Redis timeouts with a little fundamental CS

Redis is a handy place to keep data. With all the commands Redis supports, you can solve a ton of really common problems.

For example, do you need a queue you can safely use from a bunch of different processes? LPUSH and BRPOP have you covered. That's actually how Sidekiq works! (Resque pushes and pops in the other direction).

In fact, Redis is such an easy place to stuff data, that it could become your first choice for storing miscellaneous things. You'll find the command that does what you want, call it, and everything will be great! That is, until you store more and more data under certain keys.

Maybe you want to see if a job has already been queued, so you don't queue it again:

index = 0
while (payload = Resque.redis.lindex("queue:#{queue_name}", index)) do
  # ... see if payload matches the job we're looking for ... 

And all of a sudden, you'll notice that all of your communication with Redis just got slower. You might even start seeing Redis::TimeoutErrors. What happened?

In order to understand how this code broke, you need to understand a little bit about Redis, and a little fundamental Computer Science.

O(slow)

Take another look at Redis' command documentation, and you'll notice something on each page:

O(N)-ish

Yep, it's Big-O notation -- that thing you studied to prepare for your last interview. In your day-to-day development, you probably don't think about it too much. In Redis, though, slower algorithms can destroy your app's overall performance.

Most Redis commands are pretty fast: O(1) or O(log(N)). But a few, like that LINDEX from up above, are O(N). (Some, like ZUNIONSTORE, are even worse. Don't ask me how I know that).

That means that if you add twice the elements to your queue, that call will probably run (roughly) twice as slow.

And because Redis is single-threaded, a slow Redis command can keep other commands from running. When that happens, you'll start to see your error tracker fill up with Redis errors from totally unrelated places.

How do you detect and fix it?

Redis has SLOWLOG, which can show you which queries are taking the longest:

~ jweiss$ redis-cli
127.0.0.1:6379> slowlog get 3
1) 1) (integer) 26
   2) (integer) 1436856612
   3) (integer) 13286
   4) 1) "get"
      2) "key2"
2) 1) (integer) 25
   2) (integer) 1436856610
   3) (integer) 41114
   4) 1) "get"
      2) "key2"
3) 1) (integer) 24
   2) (integer) 1436856609
   3) (integer) 10891
   4) 1) "get"
      2) "key2"

That's a lot of numbers. The first one is automatically generated -- you don't need to worry about it. The second is a timestamp. The third and fourth are the most important -- the amount of time (in microseconds) it took for the command to run, and the command + arguments you sent.

The slowlog can be noisy, and won't always point you to exactly the right place. But it's a good first place to look for performance problems.

After you've identified some slow queries, though, how do you fix them?

Unfortunately, there's no approach that works in every situation. But here are a few that have helped me:

  • Scan through the other Redis commands that work with the data structure you're using. If any of them are an almost as good a fit, but faster, you can try to find a way to use those instead.

  • Store an extra copy of the data in a way that makes it fast to look up later. This works especially well for commands like LINDEX. It's like having data in an array to make it easy to iterate, with a lookup table in a hash to make specific elements easy to find.

    While this can work well, you do have to be careful. It's easy to create weird situations where the duplicate copy wasn't added right, or removed right, or got into an inconsistent state. MULTI and EXEC can help, but still take some care to use correctly.

  • Don't use Redis for that data at all. A traditional database might be a better solution, or there might be a more clever way of solving the problem.


In your day-to-day work, you probably don't think too much about algorithmic complexity and O() notation.

But Redis is a single-threaded server where speed is incredibly important. The change in performance as your data grows makes a huge difference. And if you're not paying enough attention, it can hurt your entire app.

So, watch for it. Understand how to compare different algorithms, so you can pick the right one. And build intuition about how fast good algorithms should be, and the tradeoffs between them. That intuition will help you more often than you'd expect.