• Avro Schema Evolution

    I’m involved in an ambitious project to split a monolith into a service-oriented architecture connected via a stream of events using Kafka topics. Kafka is agnostic to the data serialization format used, and I’ve been looking at Avro in particular.

    Avro – like Protocol Buffers and [Thrift] – is a binary format, making it more space-efficient than a text and verbose format like JSON. It stands out in that it’s schema support logical and complex types, and it’s decoupling between the writer’s schema and the reader’s schema, which provides flexibility.

    The schema for an application’s data is expected to change over time. In database-backed applications, this is typically done by changing the data shape with a migration. I’ve written about how deploying schema migrations needs to be though of carefully. In that article I covered a few different strategies, but they all shared a common trait: All the data has one shape before the migrations, and a different one afterwards. That is incompatible with applications that rely on an immutable log. Let’s explore Avro’s schema evolution.

    Read on →

  • Book Review: Building Git

    Building Git by James Coglan

    git is a widely successful version control system, it’s used in software companies large and small. It’s distributed nature changed software development in many ways. In Building Git, James Coglan re-implements a subset of git’s functionality from the ground up, using Ruby, which has a large standard-library and is higher-level than the original C.

    git itself is a large project with a lot of functionality. The book covers a lot of ground, in a step-by-step fashion. Each line of code is explained both conceptually and syntactically.

    Read on →

  • Postgres Ranges

    In my previous posts about bi-temporal data, I dealt with a lot of queries that had where clauses that dealt with operations in dates. For example:

    SELECT
        employee_id,
        committee_id
    FROM
        committee_membership
    WHERE
        valid_from <= '2020-05-02'
        AND '2020-05-02' < valid_up_to
        AND tx_applicable_from <= NOW()
        AND NOW() < tx_applicable_up_to
    

    The underlying schema looks like this:

    CREATE table committee_membership (
      employee_id int NOT NULL,
      committee_id int NOT NULL,
      valid_from date NOT NULL,
      valid_up_to date NOT NULL,
      tx_applicable_from date NOT NULL,
      tx_applicable_up_to date NOT NULL
    )
    

    The four dates in the table share the same structure. There are two prefixes valid and tx_applicable, and two suffixes from and up_to. This structure that hints that the dates represent two different concepts: An interval in time that delineates validity and an interval that delineates applicability.

    Read on →

  • The REPL: Issue 68 - April 2020

    I’ve been spending a lot of time thinking about bi-temporal data in the last few weeks. You can read what I’ve written about it in Bi-Temporal Data and Modeling Bi-Temporal Deletions.

    The Case for Bitemporal Data

    Craig Baumunk presents at the NJ SQL Server User Group on bi-temporal data. He goes over the differences between non-temporal, valid temporal, transaction temporal modeling and the different types of problems that they solve. He makes the case of why bi-temporal data is superior to all the previous issues and what the implications are. The presentation is from 2011, but it is still as relevant as ever. Note that the presentation is broken up into 7 different parts.

    Bi-temporal data modeling with Envelope

    Jeremy Beard covers the importance of bi-temporal data modeling, and the type of problems that it can solve. Using a credit score example, he builds up the modeling bit by bit in an intuitive way. The second portion of the article focuses specifically on the implementation in Cloudera EDH, which I don’t use.

    Principles and priorities

    Jeremy Keith writes on how to think about design principles. Sometimes, design principles can be truisms that can be less than useful (e.g. Make it usable.). Expressing principles as a set of priorities, makes them more useful and actionable (e.g. Usability, even over profitability). As an example, he uses the HTML design principles as:

    Users, even over authors. Authors, even over implementors. Implementors, even over specifiers. Specifiers, even over theoretical purity.

    Read on →

  • Modeling Bi-Temporal Deletions

    In non-temporal data, deletions are literal: Specific rows or columns are deleted, because only the current state is modeled. In bi-temporal data, the equivalent operation is modeled by inserting new facts.

    Read on →