The REPL: Issue 85 - September 2021

Oct 5, 2021 • Ylan Segal • the repl

Understanding AWK

awk is powerful, and standard on all unix flavors. Its syntax is not immediately obvious. Adam Gordon Bell wrote an excellent tutorial on how to get started and write increasingly useful awk programs. I particularly found the summary boxes with “What I learned” at the end of each section. They serve to cement the knowledge, and quick reference when coming back to the tutorial later.

How We Got to LiveView

Chris McCord, maintainer of Elixir’s Phoenix framework explains how LiveView came to be, and why it leans on the features of the BEAM (which powers Elixir). I am continue to be excited about the feature for Elixir and Phoenix.

The Two Types of Knowledge: The Max Planck/Chauffeur Test

The anecdote in this post is cute. It is funny because the chauffeur is quick-witted. The point is that real understanding about a topic, is more than just knowing the words, or memorizing some of the tenets.

True experts recognize the limits of what they know and what they do not know. If they find themselves outside their circle of competence, they keep quiet or simply say, “I don’t know.”

Read on →

Leaning on ActiveRecord Relations for Query Optimization

Sep 14, 2021 • Ylan Segal • rails, databases, postgres

I previously wrote about using views to create abstractions in Rails. As it happens, one of the queries using that view had some performance issues.

The offending query was generated by the following code:

SignatureRequirement
  .joins(
    missing_signatures: {participant: :households},
    programs: {program_participants: {participant: :households}}
  )
  .merge(household.people)
  .distinct

See the original post for an ERD diagram of the models involved.

The EXPLAIN ANALYZE for the generated query reported:

Planning Time: 28.302 ms
Execution Time: 9321.261 ms

The EXPLAIN ANALYZE output has a lot of information, and the noise can easily drown the signal. The excellent explain.depesz.com provides a way to visualize the output and color code the important information, to guide the discovery.

New Relic Performance Graph

The proceeding image is too small to discern. At a glance though, its stands out (to me at least) that there is a lonely cell in dark colors at the bottom. The dark cell is for number of rows, and it contains 709,200 rows. I found that surprising. I know the system contains about 400 program participants and 5 signature requirements. Without any filtering, I expect the missing signatures view to return at most ~2,000 rows. Where are 700K rows coming from?

Cartesian Product

Wikipedia defines a Cartesion Product as:

In mathematics, specifically set theory, the Cartesian product of two sets A and B, denoted A × B is the set of all ordered pairs (a, b) where a is in A and b is in B.

A cartesian product represents all the possible combinations of members from two different sets. It comes into play in databases, when a query is constructed in such a way that it returns all possible combinations of multiple tables, potentially returning many rows.

As an example, lets assume we have two tables with the following data:

SELECT * FROM people;
-- id
-- ----------
-- 1

SELECT * FROM colors;
-- person_id   name
-- ----------  ----------
-- 1           red
-- 1           blue
-- 1           white

SELECT * FROM flavors;
-- person_id   name
-- ----------  ----------
-- 1           chocolate
-- 1           vanilla
-- 1           strawberry

A cartesian product combines all possible combinations (3 X 3 = 9):

SELECT
    people.id,
    colors.name,
    flavors.name
FROM
    people
    JOIN colors ON people.id = colors.person_id
    JOIN flavors ON people.id = flavors.person_id;

-- id          name        name
-- ----------  ----------  ----------
-- 1           blue        chocolate
-- 1           blue        strawberry
-- 1           blue        vanilla
-- 1           red         chocolate
-- 1           red         strawberry
-- 1           red         vanilla
-- 1           white       chocolate
-- 1           white       strawberry
-- 1           white       vanilla

My hunch was that a cartesian product was being generated by unnecessary joins.

Query Optimization

Luckily, the query to optimize has a very good test suite around it, ensuring that the data returned is correctly scoped and de-duplicated. That gave me confidence to aggressively refactor.

SignatureRequirement
  .joins(
    missing_signatures: {participant: :households},
    programs: {program_participants: {participant: :households}}
  )
  .merge(household.people)
  .distinct

The original code is trying to find signature requirements that have missing signatures for members of a household. Inspecting the generated SQL reveals 11 joins, not counting the underlying joins in the missing_signatures view.

After taking a step back, I noticed that the code I really wanted to write was:

household.missing_signature_requirements.distinct

After realizing that, it became obvious that has_many with its through options should be more than equal to the task. Here are the associations that make that code possible:

class Household < ApplicationRecord
  has_many :household_people
  has_many :people, through: :household_people
  has_many :missing_signature_requirements, through: :people
end

class HouseholdPerson < ApplicationRecord
  belongs_to :household
  belongs_to :person
end

class Person < ApplicationRecord
  has_many :missing_signatures, inverse_of: :participant, foreign_key: :participant_id
  has_many :missing_signature_requirements, through: :missing_signatures, source: :signature_requirement
end

class MissingSignature < ApplicationRecord
  belongs_to :signature_requirement
  belongs_to :participant, class_name: "Person"
end

class SignatureRequirement < ApplicationRecord
  has_many :missing_signatures
end

The new query generates only 3 joins, and looks like this:

SELECT DISTINCT
    "signature_requirements".*
FROM
    "signature_requirements"
    INNER JOIN "missing_signatures" ON "signature_requirements"."id" = "missing_signatures"."signature_requirement_id"
    INNER JOIN "people" ON "missing_signatures"."participant_id" = "people"."id"
    INNER JOIN "household_people" ON "people"."id" = "household_people"."person_id"
WHERE
    "household_people"."household_id" = 267

Results

The resulting query is more legible, and more importantly more performant. Here is the bottom line from EXPLAIN ANALYZE:

Before:
Planning Time: 28.302 ms
Execution Time: 9321.261 ms

After:
Planning Time: 2.716 ms
Execution Time: 23.580 ms

The new query is less than 1% of the original query. The New Relic dashboard shows the performance improvement:

New Relic Performance Graph

Read on →

The REPL: Issue 84 - August 2021

Sep 3, 2021 • Ylan Segal • the repl

We Will Do Anything to Get You to Work for Us Except Pay You Enough

Eli Grober writes tongue-in-cheek, but the essay is clearly funny because there is some truth behind it. Companies go to great lengths to not raise wages and raise side benefits instead, while pretending that you should be happy about it. Side benefits are cheaper, and also much easier to cut without too much protest.

The other side, of course, is that employees accept this bargain.

Naval Architecture

Bartosz Ciechanowski presents an excellent guide on the physics of boats. Most diagrams are interactive, allowing the reader to simulate different conditions. It makes it easier to grasp the underlying concepts. The author starts from first principles and builds, exposing a lot of complexity in naval design. Not a small feat. It even covers the basics of propulsion. Kudos.

Better coordination, or better software?

Jessica Kerr (aka Jessitron) thesis is that better coordination between software teams leads to more coordination work. The alternative is establishing better boundaries better boundaries between software, to make it unnecessary to have that much coordination work.

Read on →
Code spelunking with rg and join

Aug 13, 2021 • Ylan Segal • unix
The power of the unix philosphy is that you can compose single-purpose tools together, to great effect. For example, let’s say we are working on a Rails app. We want to find all controllers that use current_user, and also inherit from ApplicationContrller.

I would run the following:
```
$ join \
 <(rg '< ApplicationController' app/controllers --files-with-matches | sort) \
 <(rg 'current_user' app/controllers --files-with-matches | sort)
```
Let’s break it down.
```
$ rg '< ApplicationController' app/controllers --files-with-matches
app/controllers/pdfs_controller.rb
app/controllers/language_selection_controller.rb
app/controllers/info/pages_controller.rb
app/controllers/secure/base_controller.rb
```
ripgrep (rg) is an excellent replacement for grep: It searches the content of files for a matching regex. The expression above searches for the regex < ApplicationController to find classes inheriting from ApplicationController inside the app/controllers directory. Like grep, rg can return both file, line number and match information. In this case, I am directing it only return filenames with --files-with-matches.

We now have a list of files in app/controllers that have classes inheriting from ApplicationController.

Next, we want to find uses of current_user in those files. There are a few ways we can accomplish that. I decided to find all controllers that use current_user, and later compare the two lists. The second list of files is found with:
```
$ rg 'current_user' app/controllers --files-with-matches
app/controllers/secure/password_resets_controller.rb
app/controllers/secure/medical_profiles_controller.rb
app/controllers/secure/bus_trips_controller.rb
app/controllers/secure/profiles_controller.rb
app/controllers/secure/signatures_controller.rb
app/controllers/secure/bus_reservations_controller.rb
app/controllers/secure/base_controller.rb
```
With the two lists in hand, we can then turn to join. From the man page:

join – relational database operator

The join utility performs an ``equality join’’ on the specified files and writes the result to the standard output.

Much like a JOIN in SQL which find corresponding records in two tables, join can find matching records in two files. Its usage typically requires specifying which field in each line to use for the join. Our usage is very simple though: Our lists of files only have a single field: The file name.

join expects two files as arguments. We could redirect the output of each of our rg calls to a file, and use those files as input to join. However, bash (and other shells too) allow for process substitution: It can take care of presenting the output of a subprocess to another process as if it was a file. That is done via the <() syntax, used twice: Once for each rg search.

The last bit is the usage of sort. join expects the files to be sorted:

When the default field delimiter characters are used, the files to be joined should be ordered in the collating sequence of sort(1)

And there it is! We used rg, sort, join, and a bit of bash plumbing to find files that have lines matching two different regexes:
```
$ join \
 <(rg '< ApplicationController' app/controllers --files-with-matches | sort) \
 <(rg 'current_user' app/controllers --files-with-matches | sort)
app/controllers/secure/base_controller.rb
```
Read on →
The REPL: Issue 83 - July 2021

Aug 2, 2021 • Ylan Segal • the repl, good_job

GoodJob

I recently found GoodJob, a background-job library for Ruby. It’s compatible with ActiveJob. Its main selling points is that it takes advantage of Postgres features. For projects already on Postgres this means two things: There is no need for another data store, and job scheduling can be transactional with the rest of the application.

Sidekiq, the leading background job library requires the use of Redis. It scales exceptionally well. For many applications managing another data store in production is burdensome, and provides little tangible benefits, especially if the load on the database is low. GoodJob even has a mode that runs the workers in the same process as the Rails server. For smaller apps running on Heroku, this can remove the need of a separate dyno.

Regarding the transactional nature: Suppose you want to store a record and queue a background job as part of some business operation. If you write to your main database first you run the risk of failing when enqueuing the job. Enqueuing inside a transaction doesn’t work either. In case of a transaction rollback, the job will still be published, like Brandur explains. Keeping jobs in the same database as the rest of the data, allows for transactional semantics – much easier to code against.

I’ve only tried this library on a side project with little traffic, but so far I am very impressed.

Idempotency-Key IETF Standards Draft

Speaking of Brandur: He notes that the IETF has a new draft standard for an Idempotency-Key, already in use at Stripe and other places. He previously explains in more detail why it’s important.

Read on →

« Older Newer »