on.code && such

Postgres default values as a backfill method

2024-10-17T10:43:44-07:00

Often, I want to add a new column to a Postgres table with a default value for new records, but also want existing records to have a different value. Changing Postgres default value can make this a very fast operation.

Let’s see an example. Let’s assume we have a songs table, and we want to add a liked column. Existing records need to have the value be set to false, while new values have it set to true.

Table and initial data setup:

CREATE TABLE songs (
  name character varying NOT NULL
);
-- CREATE TABLE
-- Time: 16.084 ms

INSERT INTO songs(name) VALUES ('Stairway To Heaven');
-- INSERT 0 1
-- Time: 0.590 ms

SELECT * FROM songs;
--         name
-- --------------------
--  Stairway To Heaven
-- (1 row)
--
-- Time: 0.652 ms

Now, let’s add the new column with a default value of false. That is not our end-goal, but it will add that value to existing records¹:

ALTER TABLE songs
ADD COLUMN liked boolean DEFAULT false;
-- ALTER TABLE
-- Time: 3.745 ms

SELECT * FROM songs;
--         name        | liked
-- --------------------+-------
--  Stairway To Heaven | f
-- (1 row)
--
-- Time: 0.672 ms

ALTER TABLE songs
ALTER COLUMN liked SET NOT NULL;
-- ALTER TABLE
-- Time: 1.108 ms

Now, if we change the default value to true, and insert a new record:

ALTER TABLE songs ALTER COLUMN liked SET DEFAULT true;
-- ALTER TABLE
-- Time: 4.664 ms

INSERT INTO songs(name) VALUES ('Hotel California');
-- INSERT 0 1
-- Time: 1.447 ms

SELECT * FROM songs;
--
--         name        | liked
-- --------------------+-------
--  Stairway To Heaven | f
--  Hotel California   | t
-- (2 rows)
--
-- Time: 0.791 ms

As we can see, we have the schema in the shape that we want, and the correct data stored in it, without needing a “traditional” backfill to modify each existing row manually. The default value method is much faster, since Postgres doesn’t need to update each record, just check the default value when they were created. 👍🏻

Stairway To Heaven is excellent. I’m not implying that I don’t like it. I do. It’s an anthem. ↩

The REPL: Issue 121 - September 2024

2024-10-03T11:45:07-07:00

OAuth from First Principles

The articles explains how the problem of sharing access between servers evolves into OAuth, as one starts trying to solve the security issues in the naive implementation.

The “email is authentication” pattern

For some people, logging into a website means using the “forgot your password” flow every time they want to log in. They do in lieu of other schemes like using the same password, using a password manager, using a password generation scheme, etc.

Are people well informed about the options? From the website’s perspective it doesn’t matter much: Essentially, having access to an email address grants you access to the website. As long as that is the case, we might as well use “magic links” for authentication and do away with passwords all together.

In fact, in many places, email is now also used as 2-factor authentication. If the website has a “forgot my password” flow via email, then 2-factor via email only adds the illusion of security.

Solid Queue 1.0 released

I’m happy about this development: Rails should definitely have a canonical queue implementation. I’m also interested in it’s performance because of the UPDATE FOR SKIP LOCKED usage. I plan on evaluating it in in the future vs GoodJob. I noticed a few things about the announcement:

37 Signals production setup uses claims 20M jobs per day with 800 workers. That seems like a lot of workers, but without the context of what those workers are doing.

They are using a separate database for the queue. While I get that it alleviates some of the performance concerns with the main database, you also loose transactionality between your jobs and the rest of your writes: To me, transactionality is one of the main selling points of using a db-based queueing system. I’ve chased many production issue where using a separate data store for the jobs ends up causing the queue workers to look for records that are not visible, either temporarily due to a race condition, or permanently because of a roll-back. Using 2 separate databases also means that each Rails process (web or worker) needs a connection to each database.

In the announcement there is a link to a Postgres-only issue recently fixed, that made me realize that Solid Queue has concurrency controls built-in, and uses INSERT ON CONFLICT DO NOTHING to enforce them. That is clever, and more efficient than checking for existence of the concurrency key before inserting.

Using expect to tailor environment

2024-09-19T16:30:53-07:00

I recently learned to leverage expect. According to the man pages it is:

expect - programmed dialogue with interactive programs

I often need to ssh into remote hosts that both are ephemeral and I don’t have much control over. Any environment customization that I do is one-time only, because odds are that next time I connect, it will be to a different host. Yet, I’d like to be in a familiar environment. That is where expect comes in.

expect provides a DSL of sorts to interactively use programs in the command line. I use it to front ssh and do a few common tasks every time I log in:

#!/usr/bin/env expect
# Front ssh and automate env settings once on remote machine

set environment [lindex $argv 0];

spawn ssh $environment
expect "\$ " {
  send ". entry.sh"

  send -- "alias db_replica='psql ..."
  send -- "\n"

  send -- "export EDITOR=nano"
  send -- "\n"

  expect "\$ " { interact }
}

That script:

Reads the first argument to it and sets to a local variable named “environment”.
Starts an instance (spawn) of ssh, passing the value of environment to it.
It then waits (expect) until it reads a $
It sends a few commands to source a file, create an alias, export a variable, etc.
It then expects another prompt and starts and switches to interactive mode (interact)

That last steps is what “drops” me into the terminal in the remote host, personalized with my taste.

This is just the proverbial tip of the iceberg. The expect manual has a lot more information.

The REPL: Issue 120 - August 2024

2024-09-04T19:24:12-07:00

Structure Your ERb and Partials for more Maintainable Front-end Code in Rails

Interesting exploration of how to write better ERB views in Rails. One the one hand, adding ERB tags and interspersed Ruby code in HTML is not optimal. On the other extreme, is generating everything in Ruby with tag helpers, which is also less than optimal. No matter where you land, it seems that one needs to know both HTML, ERB and some Ruby DSL to write views, with an eye to what HTML will be generated.

actual_db_schema: Wipe out inconsistent DB and schema.rb when switching branches

Interesting library. It attempts to solve the issue of jumping between multiple branches in Rails code bases and the DB schema getting out of sync. I have not used it, and can’t say that I’ve struggled much with this problem. What I do struggle with is structure.sql conflicts of different branches all wanting to insert their migration number in the same spot.

The REPL: Issue 119 - July 2024

2024-08-03T08:50:02-07:00

Entering text in the terminal is complicated

Most of the things that I associate with “usable” CLIs actually comes from readline. TIL that you can wrap anything with readline! rlwrap nc. It doesn’t seem to be installed on my Mac, by default though.

PostgreSQL INSTEAD OF Triggers

I’ve been working with Postgres triggers recently. They are immensely powerful, but I find the syntax hard to get used to, and the manual has limited examples. In particular this tutorial shows a use of RETURNING ... INTO to put a value into a variable that made it much easier to achieve what I wanted, instead of a chain of CTEs with INSERT INTO that was becoming unwieldy.

Event sourcing for smooth brains: building a basic event-driven system in Rails

I don’t consider what this article described to be event sourcing. It’s a light event system in Rails, which is great, but not exactly event sourcing.

In event sourcing (as understood by most of the literature), the event records are the source of truth, and by “applying” those events you get the current state of the system. What they describe is not that. The system in the article has events and subscribers that act on those events, and the state of the system is maintained separately from the events.

In trying to simplify, they threw out some of what the event sourcing typically consider the most useful parts. Basically, they kept a pub/sub system hooked into after_commit hooks. Useful, but not what is described on the tin.

The REPL: Issue 118 - June 2024

2024-07-05T10:17:16-07:00

Better Know A Ruby Thing: On The Use of Private Methods

Noel Rappin writes thoughtfully about private methods. I actually approach writing classes in the opposite way that he does, and I also do it because of past experience. I currently work on a smallish team, but I used to work on a team that had ~200 engineers committing to the same monolith. Effectively, the cost of writing public-by-default methods was essentially the same as maintaining a library: It wasn’t clear to me who would use my class, when, or how. My approach is private-by-default: Make the API as small as possible.

If, in fact, there is a future need to re-use some abstraction that is currently used in a private method, I am happy to refactor later and make that functionality available purposefully.

Inline RBS type declaration

Experimental syntax, but this mode of type declarations are something I can get on board with!

Since it’s using #:: and # @rbs as special comments, they can probably be recognized by syntax highlighters and other tooling.

Lesser Known Rails Helpers to Write Cleaner View Code

I’ve been writing Rails for over 12 years, and some of these are new to me! I wonder if I’ll remember that they exist next time I have opportunity to use them!

The REPL: Issue 117 - May 2024

2024-06-05T14:54:44-07:00

Why PostgreSQL Is the Bedrock for the Future of Data

The take is not exactly surprising, since Timescale is all-in on Postgres, and the last part of the article feels a bit like an ad. That said, Postgres is fantastic and getting better with each release.

Free Yourself, Build the Future, Embrace PostgreSQL

Why, after 6 years, I’m over GraphQL

Well reasoned critique of GraphQL, with practical examples of the issues of working with GraphQL. Instead, the author wants to keep the strict typing, but go back to REST endpoints.

(Bi)Temporal Tables, PostgreSQL and SQL Standard

This is exciting! The article says that temporal support (and later bi-temporal support) is coming natively to Postgres! That is welcome news! I am excited to learn how will temporal modeling be supported and how can Rails (and other) frameworks can take advantage of it.

The REPL: Issue 116 - April 2024

2024-05-09T11:51:59-07:00

An unfair advantage: multi-tenant queues in Postgres

An interesting post on allocating jobs in queues “fair” for multi-tenants. The algorithm described does the job distribution at write time, leaving a simpler dequeueing process.

Ruby Heredocs

Good reference on how to use ruby HEREDOCs. Depending on how often you use them, it’s easy to forget the specific syntax.

When Do We Stop Finding New Music? A Statistical Analysis

Not a lot of complicated stats here, except some slicing and averaging. However, the conclusion resonates. Your musical tastes stop evolving. I certainly find many new less artists than I did before. That said, my experience is not exactly like what is described: While I still like the music that I listened to when I was 13-15, I listen more to music that I didn’t really get into until much later, in my late 20s and early 30s.

Postgres Updatable Views

2024-04-29T13:44:53-07:00

Postgres has support for creating views that are themselves updatable, as if they are tables. By default, this is only possible for simple views, as defined in the documentation. More advanced views, can also be updatable, with a bit more work. Lets examine how.

For the sake of this example¹, lets say we have a user_last_logins table that captures the last date and time that the user has logged into a system:

CREATE TABLE user_last_logins (
  user_id integer NOT NULL,
  logged_in_at timestamp NOT NULL
);
-- CREATE TABLE
-- Time: 10.102 ms

CREATE UNIQUE INDEX idex_user_id_on_user_last_logins
ON user_last_logins(user_id);
-- CREATE INDEX
-- Time: 7.783 ms

Somewhere in the application code, when the user logs in, we update the time that a user has logged in.

INSERT INTO user_last_logins(user_id, logged_in_at)
VALUES
  (1, NOW()), (2, NOW()), (3, NOW());
-- INSERT 0 3
-- Time: 5.007 ms

SELECT * FROM user_last_logins;
--  user_id |        logged_in_at
-- ---------+----------------------------
--        1 | 2024-04-29 14:36:11.765157
--        2 | 2024-04-29 14:36:11.765157
--        3 | 2024-04-29 14:36:11.765157
-- (3 rows)
--
-- Time: 0.481 ms


UPDATE user_last_logins
SET logged_in_at = NOW()
WHERE user_id = 1;
-- UPDATE 1
-- Time: 5.463 ms

SELECT * FROM user_last_logins;
--  user_id |        logged_in_at
-- ---------+----------------------------
--        2 | 2024-04-29 14:36:11.765157
--        3 | 2024-04-29 14:36:11.765157
--        1 | 2024-04-29 14:36:22.268137
-- (3 rows)
--
-- Time: 0.973 ms

Notice how the timestamp for the row with user_id = 1 was updated.

Now lets imagine that our requirements change. We are now set to record the time of every login by users, as opposed to the last time. We also have different client applications that might not be on the same schedule, and would like to keep the current database functionality intact. Our first order of business is to create our new table and populated with the information we have at hand:

BEGIN;

CREATE TABLE user_logins (
  user_id integer NOT NULL,
  logged_in_at timestamp NOT NULL
);

INSERT INTO user_logins (user_id, logged_in_at)
SELECT * FROM user_last_logins;

COMMIT;
-- BEGIN
-- Time: 0.129 ms
-- CREATE TABLE
-- Time: 4.720 ms
-- INSERT 0 3
-- Time: 0.842 ms
-- COMMIT
-- Time: 0.339 ms

SELECT * from user_logins;
--
--  user_id |        logged_in_at
-- ---------+----------------------------
--        2 | 2024-04-29 14:36:11.765157
--        3 | 2024-04-29 14:36:11.765157
--        1 | 2024-04-29 14:36:22.268137
-- (3 rows)
--
-- Time: 0.816 ms

The new table has the same structure as the old table, but has no unique index on user_id, because we want to allow multiple rows for each user. I am aware that I could have renamed the table instead of creating a new one and copying data. I choose to do it this way for didactical purposes.

It’s now possible to insert new records for each users:

INSERT INTO user_logins VALUES (1, NOW());
-- INSERT 0 1
-- Time: 1.613 ms

SELECT * from user_logins;
--  user_id |        logged_in_at
-- ---------+----------------------------
--        2 | 2024-04-29 14:36:11.765157
--        3 | 2024-04-29 14:36:11.765157
--        1 | 2024-04-29 14:36:22.268137
--        1 | 2024-04-29 14:37:23.696786
-- (4 rows)
--
-- Time: 0.505 ms

We still have a problem: There are clients to this application that can’t change their code right away. Let’s tackle first the clients that want to read an up-to-date user_last_logins relation:

BEGIN;

DROP TABLE user_last_logins;

CREATE OR REPLACE VIEW user_last_logins AS
SELECT user_id, MAX(logged_in_at) as logged_in_at
FROM user_logins
GROUP BY 1;

COMMIT;
-- BEGIN
-- Time: 0.103 ms
-- DROP TABLE
-- Time: 3.219 ms
-- CREATE VIEW
-- Time: 6.703 ms
-- COMMIT
-- Time: 1.338 ms

SELECT * FROM user_last_logins;
--  user_id |        logged_in_at
-- ---------+----------------------------
--        3 | 2024-04-29 14:36:11.765157
--        2 | 2024-04-29 14:36:11.765157
--        1 | 2024-04-29 14:37:23.696786
-- (3 rows)
--
-- Time: 1.273 ms

We’ve created a view that produced the same information that used to be in user_last_logins from the underlying user_logins table, with the same guarantees that a user_id will only show up in a single row. Read clients can continue working without a hitch. However, write clients won’t be able to update as before:

UPDATE user_last_logins
SET logged_in_at = NOW()
WHERE user_id = 1;
-- ERROR:  cannot update view "user_last_logins"
-- DETAIL:  Views containing GROUP BY are not automatically updatable.
-- HINT:  To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
-- Time: 0.975 ms

Conceptually, we know that the underlying table has the same structure, and that the insert to the view can be forwarded to the underlying table. The error even gives us a hint at what to do: Use and INSTEAD OF UPDATE trigger.

BEGIN;

CREATE OR REPLACE FUNCTION instead_function_insert_user_last_logins() RETURNS TRIGGER AS
$BODY$
BEGIN

INSERT INTO user_logins(user_id, logged_in_at)
VALUES (NEW.user_id, NEW.logged_in_at);

RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;

CREATE TRIGGER instead_trigger_user_last_logins
INSTEAD OF UPDATE ON user_last_logins
FOR EACH ROW
EXECUTE PROCEDURE instead_function_insert_user_last_logins();

COMMIT;
-- BEGIN
-- Time: 0.077 ms
-- CREATE FUNCTION
-- Time: 6.508 ms
-- CREATE TRIGGER
-- Time: 2.049 ms
-- COMMIT
-- Time: 0.427 ms

We’ve defined a trigger that runs on each attempted update of user_last_logins and instead runs a function. The body of the function plainly inserts instead into user_logins. And because user_logins powers the user_last_logins view, the insert appears to work as requested.

UPDATE user_last_logins
SET logged_in_at = NOW()
WHERE user_id = 1;
-- UPDATE 1
-- Time: 6.153 ms

SELECT * FROM user_last_logins;
--  user_id |        logged_in_at
-- ---------+----------------------------
--        3 | 2024-04-29 14:36:11.765157
--        2 | 2024-04-29 14:36:11.765157
--        1 | 2024-04-29 14:38:39.369368
-- (3 rows)
--
-- Time: 0.696 ms


SELECT * FROM user_logins;
--  user_id |        logged_in_at
-- ---------+----------------------------
--        2 | 2024-04-29 14:36:11.765157
--        3 | 2024-04-29 14:36:11.765157
--        1 | 2024-04-29 14:36:22.268137
--        1 | 2024-04-29 14:37:23.696786
--        1 | 2024-04-29 14:38:39.369368
-- (5 rows)
--
-- Time: 0.826 ms

Now our existing clients can migrate to use user_logins at their own pace. For this particular case, we only dealt with INSTEAD OF UPDATE, but in the same vein we could have defined an INSTEAD OF INSERT trigger.

Conclusions

The INSTEAD OF trigger allows us a complex view to remain updatable. In this made-up-case this allows views to function as an abstraction for tables that no longer exist, and de-couples the need from client code changing at the same time as the database schema.

I used Postgres v12.5 for this example. ↩

The REPL: Issue 115 - March 2024

2024-04-03T09:44:40-07:00

Bash Debugging

Another gem by Julia Evans. I always learn something from her zines, and this is no exception: You can have a step debugger in bash!

How Figma’s Databases Team Lived to Tell the Scale

Very interesting article on how Figma managed to scale Postgres. Having a DB proxy that checks the queries and routes to the correct shard (and even aggregates among shards) is wild.

The use of logical partitioning to prove their work before doing actual physical partitioning is very clever.

Before going out and building a custom replication scheme, remember that there are out-of-the-box solutions out there that most organizations are better choosing over custom solutions.

jnv: interactive JSON filter using jq

jnv looks like a nice tool for interactively exploring the a JSON file.