Bug-Driven Development

The Bug

The other day, I got a bug report from our customer service team. One of our newest customers was using our public API and reporting inconsistencies for our endpoint that lists account users. The issue was that for some users, their role in the account and their active flag was being reported incorrectly – compared with the value visible through our web interface. This was surprising: That endpoint has been in place for years and is used regularly by many of our customers. Nonetheless, the customer support team had verified the report from the customer and I had examples of the JSON output from their API interactions.

The code base for this projects is a mono-Rails application. I’ve worked on it for more than five years. It has evolved significantly throughout the years (including the addition of the public API in question). It has been worked on by more than a dozen engineers. The root cause of the bug could be anywhere. My go-to move in this kind of situation is to reproduce in the development environment, following the rule-of-thumb that reproducing an issue is half the battle. In this case, reproducing the bug was extremely painful. I spot checked the API call for some of my users in my development environment and – with QA’s help – tested on a staging environment. I finally found some instances of the issue, but it wasn’t at all clear to me what the root cause was.

Next, I took a dive through the code. I followed the code starting from the controller action in question and followed the calls all the way to the database and back. Eventually, I found the problem in a decorator class of all things. In this particular project, we use classes to serialize responses into hashes and then let rails serialize those into JSON. The code is functionally similar to this¹:

class UsersController < ApplicationController
  def index
    @users = #... somehow load users.
    render json: @users.map { |u| UserDecorator.new(u).to_h(current_account) }
  end
end

class UserDecorator
  def initialize(user)
    @user = user
  end

  def to_h(account)
    user_account = user.user_accounts.detect { |ua| ua.account_id = account.id }

    {
      id: user.id,
      first_name: user.given_name,
      last_name: user.family_name,
      email_address: user.primary_email,
      role: user_account.role,
      enabled: user_account.enabled?
    }
  end

  private

  attr_reader :user
end

The above code does some mapping between our internal names for some fields and what we decided to call them in the public API. It also includes some information from the UserAccount model. Spot the bug yet? I didn’t.

The user, account and user_account models in question are typical ActiveRecord models. The relevant pieces look like this:

class Account < ActiveRecord::Base
  has_many :user_accounts
  has_many :users, nil, through: :user_accounts
end

class UserAccount < ActiveRecord::Base
  belongs_to :user
  belongs_to :account
end

class User < ActiveRecord::Base
  has_many :user_accounts
  has_many :accounts, nil, through: :user_accounts
end

If you are a keen reader, you probably noticed the same thing I did: In the decorator code user.user_accounts will return an ActiveRecord::Relation. Why is the code filtering for that relation in Ruby? Why is the work being done after loading the ActiveRecord objects instead of filtering in the much-more efficient database? While I was pondering at, it finally hit me: The block inside of detect should be using a ==

user_account = user.user_accounts.detect { |ua| ua.account_id = account.id }
# Should really be
user_account = user.user_accounts.detect { |ua| ua.account_id == account.id }

What is happening? Instead of finding the user_account that corresponds to the passed in account, the code block is assigning a new account_id to the UserAccount. The assignment in ActiveRecord returns the same assigned value, the account id in this case. That value is truthy, so the detect call always returns the first user_account read from the database. By default, ActiveRecord will load the records in an association ordered by the primary key (i.e. id). In our case, this made the detection of the code particularly tricky. When using auto-incrementing ids – the default Rails case – the first account returned will also be the first account inserted: I had considered that as a possible cause for the bug, but rejected it because it didn’t fit my observations on production or staging servers. What I neglected to consider was that our models do not use auto-increment ids: They use UUIDs, stored in the database as strings and generated by SecureRandom. This throws-off any ordering:

SecureRandom.uuid # => "34a22bb9-e648-4264-ad13-3fbb07e9efc6"
SecureRandom.uuid # => "8b355635-ab2a-4b3e-936d-cbec824ab48d"
SecureRandom.uuid # => "39b2bc99-ee56-41e1-9e08-cc2cd65bc374"

The Fix

At this point, I was ready to create a failing test case that exposes the bug: We need a user with multiple UserAccounts (and thus multiple Accounts), we need the UserAccount to have different values for role and enabled?, and we need for them to be loaded in the correct order from the database so that we can expose the bug. My tests looks something like this:

describe UserDecorator do
  subject { described_class.new(user) }
  # This project uses FactoryGirl for model factories, instead of Rails' fixtures.
  let(:user) { FactoryGirl.create(:user, accounts_count: 2) }
  let(:user_account) {
    # We need the last user_account to be the one that we check for
    user.user_accounts.last.tap { |user_account|
      user_account.update!(role: role, enabled: true)
    }
  }
  let(:account) { user_account.account }
  let(:role) { :admin }

  before {
    # We need the first user account to have different values, so that it exposes the bug
    user.user_accounts.first.tap { |user_account|
      user_account.update!(role: :other, enabled: false)
    }
  }

  it "returns the correct role name for the user_account" do
    expect(subject.to_h[:role]).to be == role
  end

  it "returns the correct enabled value for the user_account" do
    expect(subject.to_h[:enabled]).to be == true
  end
end

With the bug exposed, I was committed a fix and opened a pull request, figuring I would fix the bug and get better performance by limiting the number of ActiveRecord objects loaded from the database.

# This:
user_account = user_accounts.detect { |ua| ua.account_uid = account.uid }
# Became
user_account = user_accounts.where(account: account).first

I wasn’t quite done. One of my co-workers pointed out that there was a reason that the code was using Ruby to filter the user_accounts. The previous commit message – the one that brought the Ruby filtering – had a helpful message that I neglected to read:

Find correct user_account in ruby for performance...

The user_accounts are already preloaded, but ActiveRecord goes
to the database again, which causes an N+1 query. In this case,
doing the filtering in Ruby is more efficient

I shouldn’t have assumed that the person that coded the filtering in Ruby was doing it wrong. In this case, there seem to be a good reason behind the unusual code. By the way, the author of that previous commit was me in 2015. I was smart enough to leave a note for my future self, but not humble enough to read it.

Design Implications

The bug is now fixed, yet I am not quite comfortable with the code. Two things bother me:

The specs needs a relatively complicated setup to expose the bug
The production code is loading more objects from the database, only to discard them in Ruby. This somehow is faster.

Why is a our test complicated? A decorator is supposed to be a simple class. It really is about serializing an object into JSON. The class is called a UserDecorator. It’s instantiated with User, which makes total sense. However, we have two collaborators that snuck-up on the serialization. An instance of Account, being passed in and a UserAccount that we find based on the other two. Where does the account come from? Originally, I glossed over how the users are loaded. The code is a bit complicated because it searches the database based on the parameters passed in. However, in it’s simplest form and removing a few layers of abstraction, it’s functionally equivalent to:

class UsersController < ApplicationController
  def index
    @users = SearchUsers.new(params).perform.preload(:user_accounts)
    render json: @users.map { |u| UserDecorator.new(u).to_h(current_account) }
  end
end

It’s not particularly important how SearchUsers works, but it’s sufficient to know that #perform returns an ActiveRecord::Relation that will load User records upon execution. The preloading here is what causes the user_accounts perviously alluded to in the commit message to be already loaded. If we stop to think about it, something else tickles my code nose: If we are searching for users within an account, and then pre-loading all user_accounts for those users, we are loading some data for other accounts. This is unacceptable. On requests that are account-scoped, like this one, data for other other accounts should never be loaded from the database.

Coming back to our decorator: Why are we even passing a parameter to the method? If #to_h is supposed to just construct a hash, let’s make the code just do that:

class UserDecorator
  def initialize(user, account)
    @user = user
    @account = account
  end

  def to_h
    {
      id: user.id,
      first_name: user.given_name,
      last_name: user.family_name,
      email_address: user.primary_email,
      role: user_account.role,
      enabled: user_account.enabled?
    }
  end

  private

  attr_reader :user, :account

  def user_account
    @user_account ||= user.user_accounts.detect { |ua| ua.account_id == account.id }
  end

This is better: It moves the searching for a user account out on it’s own method. It also now needs a user and an account to initialize the decorator. A User and an Account together are… a UserAccount. And at this point, is where the flaws in the original design become evident. We are searching for, decorating and listing users, but working in an account scope. This concept already exists in our system, in the form of the a UserAccount and we should be leveraging it:

class UsersController < ApplicationController
  def index
    @users = SearchUserAccounts.new(params).perform
    render json: @users.map { |u| UserAccountDecorator.new(u).to_h }
  end
end

class UserAccountDecorator
  def initialize(user_account)
    @user_account = user_account
  end

  def to_h
    {
      id: user.id,
      first_name: user.given_name,
      last_name: user.family_name,
      email_address: user.primary_email,
      role: user_account.role,
      enabled: user_account.enabled?
    }
  end

  private

  attr_reader :user_account

  def user
    user_account.user
  end
end

This is markedly better. We are not loading extra objects from the database anymore. The only UserAccounts loaded are the ones pertaining to the account in scope. The decorator now doesn’t know or care about how the user_accounts are loaded. As always, we need to be careful that we don’t introduce N+1 queries when the user_account reaches out to the user. A judicious user of ActiveRecord::Relation#includes can take care of that.

What about the complicated spec setup? Well, we got rid of the UserDecorator and it’s specs. Sarah Mei writes about testing serving many purposes or – as she calls them – factors. One of those is informing our design. Our specs above clearly did that: The complicated setup lead to the realization that we were dealing with the wrong data model.

Another test factor is to prevent future regressions. The same issue we had before is unlikely to crop-up again in the new design. However, production bugs are to be taken seriously – especially ones where data is being improperly reported for an account. Where should this regression-prevention spec live? We could move it to the controller and make it an integration spec of sorts. I expect the setup to be at least as complicated as the one shown above – and quite possible much worse. Rails controllers are notoriously difficult to test. I would gladly pay that price before laving no regression spec in place, though.

Fortunately, there is a better place for our complicated spec: SearchUserAccounts. As the name implies, this class’ responsibility is to load data from our database based on the input in the params. I haven’t shown that code at all, but I hope it’s clear that it deals with the ActiveRecord querying API. This is the correct level to test that only the correct data has been loaded. The regression spec then becomes:

describe SearchUserAccounts do
  subject { described_class.new(params) }
  let(:params) { { account: account } }
  let(:account) { user_account.account }
  let(:user_account) { user.user_accounts.first }
  let(:user) { FactoryGirl.create(:user, accounts_count: 2) }

  let(:other_user_account) { user.user_accounts.last }

  it "only loads user_accounts belonging to the scoped account" do
    aggregate_failures do
      expect(subject.perform).to include(user_account)
      expect(subject.perform).not_to include(other_user_account)
    end
  end
end

The spec is now simpler and testing at the correct level of abstraction.

Conclusion

Fixing bugs is a part of programming. Sometimes, they are simple and emerge because of our failure to consider all inputs or internal state. I hope that the takeaway for you, the reader, is that sometimes bugs can expose subtle design deficiencies. Improving the design can do much more than just fix the bug: It can bring simplicity to both the production code and it’s tests.

All the code in this post is inspired by actual production code in a project I work on. It has been simplified considerably to make it easier to follow, protect copyright, and make it easier for the reader to see the proverbial forest for the trees. ↩