The other day, I got a bug report from our customer service team. One of our newest customer’s was using our public API and was reporting inconsistencies for our endpoint that lists account users. The issue was that for some users, their role in the account and their active flag was being reported incorrectly – compared with the value visible through our web interface. This was surprising: That endpoint has been in place for years and is used regularly by many of our customers. Nonetheless, the customer support team had verified the report from the customer and I had examples of the JSON output from their API interactions.
The code base for this projects is a mono-Rails application. I’ve worked on it for more than five years. It has evolved significantly throughout the years (including the addition of the public API in question). It has been worked on by more than a dozen engineers. The root cause of the bug could be anywhere. My go-to move in this kind of situation is to reproduce in the development environment, following the rule-of-thumb that reproducing an issue is half the battle. In this case, reproducing the bug was extremely painful. I spot checked the API call for some of my users in my development environment and – with QA’s help – tested on a staging environment. I finally found some instances of the issue, but it wasn’t at all clear to me what the root cause was.
Next, I took a dive through the code. I followed the code starting from the controller action in question and followed the calls all the way to the database and back. Eventually, I found the problem in a decorator class of all things. In this particular project, we use classes to serialize responses into hashes and then let rails serialize those into JSON. The code is functionally similar to this1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
The above code does some mapping between our internal names for some fields and what we decided to call them in the public API. It also includes some information from the
UserAccount model. Spot the bug yet? I didn’t.
user_account models in question are typical
ActiveRecord models. The relevant pieces look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
If you are a keen reader, you probably noticed the same thing I did: In the decorator code
user.user_accounts will return an
ActiveRecord::Relation. Why is the code filtering for that relation in Ruby? Why is the work being done after loading the
ActiveRecord objects instead of filtering in the much-more efficient database? While I was pondering at, it finally hit me: The block inside of
detect should be using a
1 2 3
What is happening? Instead of finding the user_account that corresponds to the passed in account, the code block is assigning a new
account_id to the
UserAccount. The assignment in ActiveRecord returns the same assigned value, the account id in this case. That value is truthy, so the
detect call always returns the first user_account read from the database. By default, ActiveRecord will load the records in an association ordered by the primary key (i.e. id). In our case, this made the detection of the code particularly tricky. When using auto-incrementing ids – the default Rails case – the first account returned will also be the first account inserted: I had considered that as a possible cause for the bug, but rejected it because it didn’t fit my observations on production or staging servers. What I neglected to consider was that our models do not use auto-increment ids: They use UUIDs, stored in the database as strings and generated by
SecureRandom. This throws-off any ordering:
1 2 3
At this point, I was ready to create a failing test case that exposes the bug: We need a user with multiple
UserAccounts (and thus multiple
Accounts), we need the
UserAccount to have different values for
enabled?, and we need for them to be loaded in the correct order from the database so that we can expose the bug. My tests looks something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
With the bug exposed, I was committed a fix and opened a pull request, figuring I would fix the bug and get better performance by limiting the number of ActiveRecord objects loaded from the database.
1 2 3 4
I wasn’t quite done. One of my co-workers pointed out that there was a reason that the code was using Ruby to filter the user_accounts. The previous commit message – the one that brought the Ruby filtering – had a helpful message that I neglected to read:
1 2 3 4 5
I shouldn’t have assumed that the person that coded the filtering in Ruby was doing it wrong. In this case, there seem to be a good reason behind the unusual code. By the way, the author of that previous commit was me in 2015. I was smart enough to leave a note for my future self, but not humble enough to read it.
The bug is now fixed, yet I am not quite comfortable with the code. Two things bother me: - The specs needs a relatively complicated setup to expose the bug - The production code is loading more objects from the database, only to discard them in Ruby. This somehow is faster.
Why is a our test complicated? A decorator is supposed to be a simple class. It really is about serializing an object into JSON. The class is called a
UserDecorator. It’s instantiated with
User, which makes total sense. However, we have two collaborators that snuck-up on the serialization. An instance of
Account, being passed in and a
UserAccount that we find based on the other two. Where does the account come from? Originally, I glossed over how the users are loaded. The code is a bit complicated because it searches the database based on the parameters passed in. However, in it’s simplest form and removing a few layers of abstraction, it’s functionally equivalent to:
1 2 3 4 5 6
It’s not particularly important how
SearchUsers works, but it’s sufficient to know that
#perform returns an
ActiveRecord::Relation that will load
User records upon execution. The preloading here is what causes the user_accounts perviously alluded to in the commit message to be already loaded. If we stop to think about it, something else tickles my code nose: If we are searching for users within an account, and then pre-loading all user_accounts for those users, we are loading some data for other accounts. This is unacceptable. On requests that are account-scoped, like this one, data for other other accounts should never be loaded from the database.
Coming back to our decorator: Why are we even passing a parameter to the method? If
#to_h is supposed to just construct a hash, let’s make the code just do that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
This is better: It moves the searching for a user account out on it’s own method. It also now needs a user and an account to initialize the decorator. A
User and an
Account together are… a
UserAccount. And at this point, is where the flaws in the original design become evident. We are searching for, decorating and listing users, but working in an account scope. This concept already exists in our system, in the form of the a
UserAccount and we should be leveraging it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
This is markedly better. We are not loading extra objects from the database anymore. The only
UserAccounts loaded are the ones pertaining to the account in scope. The decorator now doesn’t know or care about how the user_accounts are loaded. As always, we need to be careful that we don’t introduce N+1 queries when the user_account reaches out to the user. A judicious user of
ActiveRecord::Relation#includes can take care of that.
What about the complicated spec setup? Well, we got rid of the
UserDecorator and it’s specs. Sarah Mei writes about testing serving many purposes or – as she calls them – factors. One of those is informing our design. Our specs above clearly did that: The complicated setup lead to the realization that we were dealing with the wrong data model.
Another test factor is to prevent future regressions. The same issue we had before is unlikely to crop-up again in the new design. However, production bugs are to be taken seriously – especially ones where data is being improperly reported for an account. Where should this regression-prevention spec live? We could move it to the controller and make it an integration spec of sorts. I expect the setup to be at least as complicated as the one shown above – and quite possible much worse. Rails controllers are notoriously difficult to test. I would gladly pay that price before laving no regression spec in place, though.
Fortunately, there is a better place for our complicated spec:
SearchUserAccounts. As the name implies, this class' responsibility is to load data from our database based on the input in the
params. I haven’t shown that code at all, but I hope it’s clear that it deals with the ActiveRecord querying API. This is the correct level to test that only the correct data has been loaded. The regression spec then becomes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The spec is now simpler and testing at the correct level of abstraction.
Fixing bugs is a part of programming. Sometimes, they are simple and emerge because of our failure to consider all inputs or internal state. I hope that the takeaway for you, the reader, is that sometimes bugs can expose subtle design deficiencies. Improving the design can do much more than just fix the bug: It can bring simplicity to both the production code and it’s tests.
All the code in this post is inspired by actual production code in a project I work on. It has been simplified considerably to make it easier to follow, protect copyright, and make it easier for the reader to see the proverbial forest for the trees.↩