Ylan Segal

The REPL: Issue 59 - July 2019

View-centric performance optimization for database-backed web applications

This post is a walk-through of of the academic paper with the same title. Keeping page-load time low continues to be important, but it has become an increasingly challenging task, due to the ever-growing amount of data stored in back-end systems. The authors created a view-centric development environment that provides intuitive information about the cost of each HTML element on page, along with the performance-enhancing opportunities can be highlighted. The goal is to make it easier to explore functionality and performance trade-offs.

Interestingly, the development environment, Panorama, targets the Ruby on Rails framework specifically. I look forward to trying it out soon.

Zanzibar: Google’s Consistent, Global Authorization System

This paper includes a thorough description of the architecture behind Zanzibar, a global system for storing and evaluating access control lists internal to Google. As a highly distributed system, it builds on top of other Google technology, like Spanner – a distributed NoSQL database. In particular, I was very interested in consistency model and how they provide guarantees around external consistency so that the casual ordering of events is maintained. It achieves this by providing clients with tokens after write operations (called a zookie): When a client makes a subsequent request with that token, the system guarantees that any results are at least as fresh as the timestamp encoded in the zookie.

The paper has a lot more, including how they architect for performance with caching layers, and a purpose-built indexing system for deeply nested recursive permission structures.

Fast Feedback Loops

One of the reasons that I love TDD, is that it promotes fast feedback. You write a line, execute the tests, and see what the results are. I write outside-in-TDD most of the time. Occasionally, I don’t have a clear idea of what tests to write, or I am doing exploratory coding.

For example, lately I’ve found myself writing a fair amount of raw SQL queries (without an ORM). SQL is finicky, and produced notoriously hard-to-decipher errors. As a consequence, I like to build up SQL in small increments, and execute the work-in-progress statement often, to see it and its output alongside each other. My workflow looks something like this:

What is going on? I selected some SQL, executed in psql, and appended the commented-out output into the same selection. After inspection, I can change the statement and repeat.

Why?

The benefit I get from this workflow is that I can iterate in small steps, get feedback on what the current code does, and continue accordingly. This workflow is heavily inspired by Ruby’s xmpfilter or the newer seeing_is_believing. Both tools take Ruby code as input, execute it, and then record all (or some) of the evaluated code as comments to the code.

How?

This workflow is made possible by leveraging the pipe Atom package. I previously described it. It allows sending the current selection in Atom to any Unix command (or series of piped commands) and replaces the selection with the output.

Building on top of that, I wanted a unix command (that I called io, for lack of imagination) that would output both the original input and the commented-out output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/usr/bin/env bash
# Prints stdin and executes in the given program, commenting the output.

set -euo pipefail

# Determine which comment pattern to use
case $1 in
  psql)
    comment='-- ' ;;
  *)
    comment='# ' ;;
esac

grep -v "^$comment" /dev/stdin | tee >("$@" | sed "s/^/$comment/")

The case statement selects the correct comment prefix. It is customary in many Unix tools to treat a line starting with # as a comment. psql is different, in that it uses -- prefix. I haven’t needed support for anything else, but it’s easily extendible.

The meat of the execution breaks down like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Reads /dev/stdin and removes and lines starting with a comment
grep -v "^$comment" /dev/stdin

# The comment-less input, is now send to tee.
# tee will redirect the input to a file and to stdout.
| tee

# Instead of a file, we give tee a sub-shell as a file descriptor, using
# process substition.
>( )

# That subshell will execute the rest of the arguments passed to io
# as a command
"$@"

# the output is piped to sed, to add the comment prefix to every line
| sed "s/^/$comment/"

The result is that the final output is what we have been looking for: The original input without comments, plus the executed input with comments added.

From my point of view, this is a great example of the Unix philosophy: Composing utilities to create new functionality. I took advantage of the flexibility in input/output redirection and process substitution to improve my development workflow.

Book Review: Designing Data-Intensive Applications

Designing Data-Intensive Applications is one of the best technical books I’ve read in a long time. Data storage and retrieval is central to most software projects. There is ever-growing ecosystem of databases, stream processing, messaging queues, and other related systems. The book successfully explains how this technologies are different, how they are similar, and how they fit together.

The book is split into three parts. In Part I, Foundations of Data Systems, you will become acquainted with the underlying algorithms that modern database systems use to store and retrieve data, the difference between relational, No-SQL, and graph databases. You will learn what different vendors think ACID means, and what guarantees are actually made by different transaction levels. Part II, Distributed Data, covers replication, partitioning, common pitfalls with distributed systems, and how it all ties to the central problem: consistency and consensus. In Part III, Derived Data, we learn about batch processing and stream processing; their similarites, differences, and failure tolerance characteristics

Along the way, Kleppmann covers even sourcing, change data capture, immutable logs (think Kafka), and how they can be leveraged to build applications. The explanations in the book are are so clear, that they make the topics appear simple and accessible. For example, I gained a lot of insight by framing data either as part of the sysyem-of-record vs derived data. In the same vein, thinking about the data journey being divided in the write path and read path is very useful. Any work not done on the write path, will need to be done on the read path. By shifting that boundry, we can optimize one or the other.

I highly recommend this book to any engineer with interests in backend data systems.

The REPL: Issue 58 - June 2019

Per-project Postgres

In this post, Jamey Sharp elaborates on a neat technique to run different versions of postgres on a per-project basis. I learned that you can run postgres on a Unix socket only, without having a port open, which removes the need to manage those ports for each version of postgres. The technique also has the advantage of keeping the data for the project, inside the project directory structure. It illustrates the power and flexibility of Unix tools.

How to do distributed locking

Martin Kleppmann writes about distributed locks in general, and in particular the merits of Redlock, a Redis-based distributed-lock algorithm. Kleppmann breaks down the reasons to use a distributed lock, it’s characteristics, and how Redlock in particular is vulnerable to timing attacks. I found this to be great technical writing. The post came about when Kleppmann was researching his book, Desiging Data-Intensive Applications. I finished that book a few days ago, and hope to write a review soon. I can recommend it enough.

The REPL: Issue 57 - May 2019

We Can Do Better Than SQL

Simple SQL statements can read almost like English. With just a bit of complexity (e.g. more than one join) they quickly can become almost impossible to dicern. In this post Elvis Pranskevichus critiques SQL’s shortcomings compellingly. He then introduces EdgeQL, a query language designed to fix SQLs shortcomings. This is the first time I’ve heard of it or EdgeDB.

Is High Quality Software Worth the Cost?

With his traditional knack for analysis and synthesis, Martin Fowler describes how the familiar trade-off of quality and cost that is intuitive in the physical world doesn’t quite hold for software. Software projects are constantly evolving, requirements changing. Internal quality determines that speed at which features can be delivered. Disregarding internal quality leads to software projects where it becomes almost impossible to continue making changes. I can’t recommend this article enough.