Deuill.org

Deuill.org News

December Adventure 2024

Having procrastinated long enough, I’ve started reading the SICP as part of my December Adventure. The following ramblings represent my day-to-day progress – however much that may be – sprinkled with other thoughts.

Day 01

Decided I’d kick this process off with (unsuspecting) passengers, and brought my laptop to the library – I would not dare ruin my physical copy in the trip to and fro – only to become massively distracted, as I’m wont to be, by the promise of doing something else. Still, I read most of Chapter 1, which is largely a discussion on Scheme and programming basics.

I left a few hours later, dejected but determined to do better the next day. First thing: no laptop, I’ll use the physical book or a copy on my e-book reader. Second thing, I’ll try to do the exercises on my phone – a BlackBerry Q20 – and to that end spent a few hours later that night compiling Chicken Scheme 5.4.0 for QNX. It works!

A screenshot of Helium, a code editor for BlackBerry 10, showing a small snippet of Scheme code

Day 02

Arrived at the library in better spirits, and bereft of anything that can reasonably connect to the Internet. Finished reading through Chapter 1, and part way through Chapter 2.

Section 1.1.7 discusses Newton’s Method for finding square roots, and provides a definition of the mathematical function like so:

√x = the 'y' such that 'y ≥ 0' and 'y² = x'

As noted further down, the definition does not describe a procedure, but simply the properties of the thing in question; in other words, imperative knowledge vs. declarative knowledge.

However, programming languages in general don’t support concepts of equality-as-identity, which the above seems to denote, only of equality-as-(left-associative)-assignment (unless you look at specialized constraint-solving systems such as miniKanren). In a system such as that, the above example might be enough at calculating square roots for any x!

It’s getting interesting now…

Day 03

Not much reading done today – hosted a friend from Totnes – but managed to implement PDF thumbnail previews for slidge-whatsapp instead; had the work knocking about in my head for a while. Yay for distractions!

Days 04, 05, 06, 07

Lots of volunteering obligations these days, and not much space for focus. Working through some exercises from Chapter 1 on the go, however!

Days 08, 09, 10, 11, 12, 13, 14

Whoa, that’s a lot of time. Spent the week seeing a friend in Devon, and as such didn’t have much computer-time; worked through exercises in Chapter 1 again, but losing track of all the parentheses. Worked on a bit of writing for the site, though.

Day 15

Continued working through exercises for Chapters 1 and 2 – at this rate, I’ll still be on Chapter 2 by the end of the month. That said, I did get some new sections on writings and movie collections that are semi-related to the overarching project of being productive. Maybe the SICP is a slow burner. ☺️

The Rest of the Month

Ooof, I’m not very good at this thing, am I? Most of the second half of December flew fast, with me only doing small bits of reading; I did, however, manage to spruce up content on the wiki, implented a “Now Playing” widget on the home-page, and a OpenRing-seque webring on posts (like this one – scroll to the bottom!). I’ll make separate posts for all these in time.

This isn’t the end, of course, and my adventures with SICP will continue into 2025. See you next year!

1.12.2024 00:00December Adventure 2024
https://deuill.org/post/december...

Building an Authorization System

https://deuill.org/post/building...

Some time around 2019¹, Cloudflare introduced its first version of API Tokens, capable of being scoped to granular levels of access for specific actions (e.g. listing, reading, deleting) against specific resources (e.g. accounts, zones, DNS records).

The underlying authorization system providing these granular access capabilities was mostly built out by yours truly over the years, from early 2018 and until my move away from the Cloudflare IAM team in 2022; credit for the initial design, however, goes to Thomas Hill, who established the data model of subject/action/scope, described in more detail below.

Though the system has evolved in both its user-facing aspects and its scale, the basic principles and overarching design of the authorization system (as well as most of the core code) is still much the same as it was back in 2019; a testament to the robustness and flexibility of the design.

Since then, a number of other designs have also emerged, chief of which is Google’s much-vaunted Zanzibar; these systems are complex enough to merit their own separate posts, but suffice it to say that they approach their design quite differently to ours.

The Basics of Authorization

At the core of any authorization system lies a question, for example:

Can Alice update DNS Record X in Zone Y?

Each part of this sentence has some significance toward the (yes/no) answer we might receive, and each represents some part of our final data-model, though different systems find different ways of answering questions like this.

Let’s start from Alice, our actor, or subject (we’ll prefer using “subject” further on): any request for authorization is assumed to pertain to an action being taken against some protected system, and any action is assumed to originate somewhere, whether that’s a (human) user, or an (automated) system, or anything in between. Moreover, the subject is assumed to provably be who they say they are, that is, be authenticated, before any authorization request is even made, lest we allow subjects to impersonate one another.

Actions, such as “update” above, are typically performed against resources, such as “DNS record X”, both of which pertain against one another in some way; it makes no sense to try to “update a door”, as much as it doesn’t make sense to “open a DNS record”. Resources, in turn, are also identified uniquely (the “X” above) and can furthermore be qualified by context, or “scope”, the “zone Y” above.

A more generalized form of the question above would, then, be:

Can Subject perform Action against Resource with ID, under a specific Scope with ID?

The concepts of subject, action, resource, and scope form the majority of our data-model, with only a sprinkle of organizing parts in between. Before we look at each of these in turn, let’s examine a common thread that exists between them, the idea of a “resource taxonomy”.

What is What: The Resource Taxonomy

How do we determine what makes for a valid authorization request, and how do we ensure that authorization policies are consistent with the sorts of (valid) questions we expect to receive?

Subjects, actions, resources, and their scopes exist in a universe of (expanding) possibilities relevant to their use, and relate to one another intimately.

Different systems solve these issues in different ways, but a centralized system does benefit from solid definitions of what can and cannot be given access to; at Cloudflare, this culminated in a resource taxonomy, an organized hierarchy of possible “things” present in the system, driven by a reverse DNS naming convention, for instance:

com.cloudflare.api.account.zone.dns-record

Which represents an (abstract) DNS record placed under a zone.

Though names appear to contain their full hierarchies (and in some cases do, as DNS records belong to zones, which themselves belong to accounts, with predictable resource naming patterns), they have no real semantics beyond needing to be unique – we could’ve just as well used zone-dns-record as a unique name, though the reverse DNS convention does come in handy when expressing actions and resource identities.

In our case, the (partial) resource hierarchy of relevance looks like so:

com.cloudflare.api.account
- com.cloudflare.api.account.zone
  - com.cloudflare.api.account.zone.dns-record
com.cloudflare.api.user
- com.cloudflare.api.token

If resources have stable references, then actions against those resources also need some sort of stable reference – in our case, we just extend the existing naming convention, adding a verb suffix to resource names, for example:

com.cloudflare.api.account.zone.dns-record.update

Our naming convention is only, thus far, capable of referring to resources in the abstract, and requires a way of specifying which specific resource we’re attempting to give access to; this is accomplished, once again, by adding a suffix to the resource name, this time, an (opaque) unique identifier, e.g.:

com.cloudflare.api.account.zone.dns-record.5d32efec

The 5d32efec suffix refers to an identifier known by the source-of-truth² for DNS records, and is otherwise opaque to the authorization system – any unique sequence of characters would do.

Putting this all together, you might be able to restate our perennial question above like so (omitting the com.cloudflare.api prefix for brevity):

Can user.3cf2e98a do account.zone.dns-record.update against account.zone.dns-record.845cf6a7, under scope account.zone.5ab65c35?

Phew, that’s a mouthful.

How does Authorization Happen?

The point of making sure there’s common understanding on what kinds of things can be given access to relates to how the authorization system is designed as being fundamentally data-agnostic and completely independent from other systems asking questions of it. Two rules play into this determination:

Authorization policies are the sole property of the authorization system; no other service has any access to them. All other systems can do is ask whether or not access is allowed for a given resource/action, with a yes/no answer.
The authorization system cannot and does not ensure the correctness of authorization policies, beyond its own semantics. There is no guarantee that zone X belongs to account Y for a corresponding resource/scope relationship, nor is there any guarantee that DNS record 845cf6a7 is a valid ID for that resource; only the systems-of-record can ensure these invariants.

The assumption, then, is as follows: systems asking for access against a specific resource (e.g. a DNS record) are likely in a good place to ensure the IDs they provide are valid; furthermore, they’re also likely in a good place to know the hierarchy of resources, specifically their immediate parents (e.g. the zone and account) to provide as scopes.

It is through this decision to decouple resource identity from taxonomy (which remains part of the authorization system, mainly for validation purposes) that has made the system as flexible and long-lasting as it has been.

Subjects and Authorization Policies

This business about resources and taxonomies doesn’t actually bring us much closer to answering questions about access in our authorization system; how do we do that?

Firstly, we need to look at the originators of actions, our so-called subjects. As alluded to in the example above, subjects in our system are identified by their unique, fully-qualified resource identifier, e.g.:

com.cloudflare.api.user.3cf2e98a

Which represents a (presumably human) user with ID 3cf2e98a. There’s nothing else our authorization system needs to know about the subject – remember, subjects are assumed to have already authenticated at a level prior to asking about access, typically by a different system, which would then produce a signed token of some kind (e.g. a JWT), which would then be provided as context to requests made against the authorization system.

Access for subjects is expressed as a collection of “policies”, themselves collections of action and resource identifiers. An example pseudo-policy that would fulfill access for previous examples might look like this:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.845cf6a7
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35

Resolving access then becomes a simple matter of traversing assigned policies, and applying the following criteria for each:

Check if actions list contains the action requested.
Check if resources list contains the resource requested. If a matching resource contains a list of scopes, check that the request contains matching scope names, ignoring any additional scopes given in the request.

If any policy in the list matches all criteria, then access is allowed.

Extending and Generalizing Access

Allowing access to specific resources by ID is all fine and well, until you have to provide access to all DNS records across all zones in an account, including any future DNS records. Rather than putting the burden of updating policies onto humans (or worse, some automated system somewhere), we need a way to allow access to classes of things, all at once.

Turns out our reverse DNS naming convention fits this use-case well; rather than using a concrete identifier as a resource suffix, we can simply use an asterisk to denote a partial wildcard, for instance:

com.cloudflare.api.account.zone.dns-record.*

Use of wildcards here does not intend to denote any kind of lexical matching of IDs – that is to say, you couldn’t really use dns-record.5c* to match resources with IDs starting with 5c, as identifiers are opaque as far as the authorization system is concerned.

Rather, the use of an asterisk denotes access to all resources of that kind, but also introduces additional constraints on our policies, namely the mandatory use of a scope, lest we want to provide access to all DNS records anywhere.

A modified policy giving access to all DNS records for our example zone would then look like so:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.*
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35

Similar to partial wildcards, we can also specify a standalone asterisk (i.e. a * without a resource name) as a “catch-all wildcard”, denoting access to all resources under a scope.

As an extra rule, we might define that catch-all wildcard access includes its top-most scope as if it were the resource itself, if this is a fully-qualified or partial wildcard resource. That is, the following policy:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- actions:
  - key: com.cloudflare.api.account.zone.read
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: *
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35

Will allow access for a check against action zone.read and resource zone.5ab65c35. Nevertheless, wildcard matching is only available at the resource level; scopes provided in policies are always assumed to match directly, with no wildcard semantics.

Excluding Access and Policy Prioritization

So far, our policies have been exclusively aimed at giving unequivocal access to resources. There are times, however, where we might want to throw a fence around things, lest we have the hoi polloi trample on our metaphorical petunias.

Doing so with our system as described is deceptively simple, though some complexity lurks underneath the surface. Let’s first see how we might, for example, give access to all DNS records in a zone, except for a single specific record:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- access: allow
  actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.*
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35
- access: deny
  actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.65caf35c
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35

In case you missed it, the addition of an access field with allow or deny values denotes which stance a policy takes, and whether a full match will mean access is allowed or denied.

Herein the problems begin: requesting access for dns-record.65caf35c under zone.5ab65c35 will have both policies match, one to allow since we’re given access to all DNS records for the zone, and one to deny, since we’re denied access to the specific DNS record.

How do we resolve this conflict?

We could determine which policy “wins” by just taking the last decision made (i.e. the last policy in the list) as final; that, however, would put the onus of ensuring that policies are ordered the right way on users, with potentially catastrophic consequences if they are not.

We must therefore assume the intentions of our users – why would anyone take away access, only to give it back immediately? Clearly the opposite must always be true (especially since no access at all is the default): deny policies always trump allow policies, if the two overlap.

We can, and will, however, further elaborate on this rule, as the specificity of matching matters as well; direct matches against specific resources trump partial wildcard matches, which trump catch-all wildcard matches. It should, then, be possible to say the following:

Allow access to all resources under account X, but deny access to all resources under zone Y (including the zone itself), except for DNS records, but not including DNS record Z.

Which we might translate into the following policy representation:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- access: allow
  actions:
  - key: com.cloudflare.api.account.zone.read
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: *
    scopes:
    - key: com.cloudflare.api.account.9cfe45ac
- access: deny
  actions:
  - key: com.cloudflare.api.account.zone.read
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: *
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35
    - key: com.cloudflare.api.account.9cfe45ac
- access: allow
  actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.*
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35
    - key: com.cloudflare.api.account.9cfe45ac
- access: deny
  actions:
  - key: com.cloudflare.api.account.zone.dns-record.update
  resources:
  - key: com.cloudflare.api.account.zone.dns-record.65caf35c
    scopes:
    - key: com.cloudflare.api.account.zone.5ab65c35
    - key: com.cloudflare.api.account.9cfe45ac

Of course, policies this complex are fairly rare, but they do exist, and catering to these requirements is important in a system that purports to be as flexible as possible.

Scaling the System

So far, we’ve been providing lists of actions and resources as direct references, but doing so in real life would be incredibly onerous (especially given the large number of options available to us). The solution to this is – you guessed it – normalization, or in other words, grouping things under a unique name we can refer to.

For actions, we can form action groups, or as they’re sometimes (and perhaps confusingly) called, “roles”. These would represent collections of possible actions available to a subject in the abstract, not tied to any specific resource or scope. For instance, an example “DNS Administrator” action group might look like this:

id: 9aff84ac
name: DNS Administrator
actions:
- key: com.cloudflare.api.account.zone.dns-record.read
- key: com.cloudflare.api.account.zone.dns-record.create
- key: com.cloudflare.api.account.zone.dns-record.update
- key: com.cloudflare.api.account.zone.dns-record.delete

Similarly, resource definitions can benefit from being named and referenced separately, for instance:

id: fd25a5dd
name: Production Zones
resources:
- key: com.cloudflare.api.account.zone.2acf325f
  scopes:
  - key: com.cloudflare.api.account.6afe524a
- key: com.cloudflare.api.account.zone.33cfade6
  scopes:
  - key: com.cloudflare.api.account.6afe524a

One might then assign these action groups and resource groups to a policy under a separate action_groups and resource_groups field respetively, to the exclusion of an actions and resources fields for the policy, e.g.:

subject: com.cloudflare.api.user.3cf2e98a
policies:
- access: allow
  action_groups:
  - id: 9aff84ac # DNS Administrator
  resource_groups:
  - id: fd25a5dd # Production Zones

It is assumed that these action and resource groups are managed separately, and can be re-used as needed by the owning user (which also implies that access to these resources is also governed by the authorization system, which needs to control access from itself, a fun exercise in recursion).

Conclusion

This post is not so much a guide of what the current (or really, past) state of the production system is at Cloudflare; a number of omissions and simplifications have been made to save on making this a rambling epic.

Rather, it is a high-level overview of the design thinking behind a production system that is part of every single request made to the public Cloudflare API (and a few more still), and which hopefully serves as a map for others looking to build similar systems, the core idea being: authorization is about ensuring the basic questions being asked can be answered as quickly and unambiguously as possible.

Typically, this means a lot of work goes into policy semantics, and for us, this meant building out a resource taxonomy in order to balance the data-agnostic nature of the system with the need to ensure that policies are always correctly formulated.

There is still a lot of ground left to cover, however, first of all being ABAC. I’ll leave that for a future post, then.

Specifically, August 2019, though the beta was opened a couple of months earlier. ↩︎
In other words, the service or component responsible for handling DNS records. ↩︎

31.7.2024 09:00Building an Authorization System
https://deuill.org/post/building...

Systems Engineering Manifesto

https://deuill.org/post/systems-...

Truly robust systems are primarily achieved through a combination of code reuse and consistent API design; these inform, and are informed by, the broader developer experience (e.g. team structures, workflows, etc.)
The degree in which code reuse can be achieved depends intrinsically on the cohesiveness of developer practice. Teams that use different architectures, programming languages, development approaches will find it all the harder to reuse code except at the edges of their practice.
Code reuse and consistent API design, and therefore the broader developer experience, are fundamentally economies of scale. Production systems rarely exist in a vaccuum, and thus the true robustness of a system is a function of all its dependencies; similarly, bugs in shared code are (potentially) experienced together, but are also solved together and for all consumers of a system.
Code reuse exists mainly in the space between integration and use, or in other terms, boilerplate and business logic. Boilerplate is assumed to be largely difficult to be made cohesive across systems, and business logic is assumed to inherently be what makes a system unique to its own needs.

These conceptions place artificial limits on the degree of code reuse even within individual systems, and solving this condrum requires rethinking of the definitions of boilerplate code, library code, and business logic, as well as the divisions between them.
All of these above principles exist both at the macro level, i.e. in the way disparate teams might participate in a service-oriented-architecture, but also at the micro level, i.e. in the way any given service is architected within a team, whether or not any external team or code dependencies are implied.

22.3.2024 23:20Systems Engineering Manifesto
https://deuill.org/post/systems-...

Serving Go Modules with Hugo

https://deuill.org/post/serving-...

Serving Go modules off a custom domain (such as go.deuill.org) requires only that you’re able to serve static HTML for the import paths involved; as simple as this sounds, finding salient information on the mechanics involved can be quite hard.

This post serves as a soft description into the mechanics of how go get resolves modules over HTTP, and walks through setting up Hugo (a static site generator) for serving Go modules off a custom domain.

Why would anyone want to do this? Aside from indulging one’s vanities, custom Go module paths can potentially be more memorable, and control over these means you can move between code hosts without any user-facing disruption.

Basic Mechanics

At a basic level, go get resolves module paths to their underlying code by making HTTP calls against the module path involved; responses are expected to be in HTML format, and must contain a go-import meta tag pointing to the code repository for a supported version control system, e.g.:

<meta name="go-import" content="go.example.com/example-module git https://github.com/deuill/example-module.git">

In this case, the go-import tag specifies a module with name (or, technically, the “import prefix”, as we’ll find out in later sections), go.example.com/example-module, which can be retrieved from https://github.com/deuill/example-module.git via Git.

Go supports a number of version control systems, including Git, Mercurial, Subversion, Fossil, and Bazaar; nevertheless, and assuming the meta tag is well-formed, go get will then continue to pull code for the repository pointed to, for the VCS chosen.

Thus, with this content served over https://go.example.com/example-module, we can then make our go get call and see code pulled from Github:

$ go get -v go.example.com/example-module
go: downloading go.example.com/example-module v0.0.0-20230325162624-6da6d8c20f04
…

The repository pulled must be a valid Go module – that is, a repository containing a valid go.mod file – for it to resolve correctly; this file must also contain a module directive that has the same name as the one being pulled.

…that’s pretty much it!

It is, of course, quite plausible that we can just stuff static HTML files in our web root and be done with it, but where’s the fun in that? Furthermore, Hugo allows us to create complex page hierarchies using simple directives, which is definitely of use to us here.

Hugo Setup

In order to get a workable Go module host set up in Hugo, we need two things: content items, each representing a Go module, and a set of templates to render these as needed.

First, we’ll need to set up a new site with Hugo:

$ hugo new site go.deuill.org

This will set up a fairly comprehensive skeleton for our Go module host, but will need some love before it’s anywhere near useful. First, add some basic, site-wide configuration – open hugo.toml and set your baseURL and title to whatever values you want, e.g.:

# hugo.toml
baseURL = 'https://go.deuill.org'
languageCode = 'en-us'
title = 'Go Modules on go.deuill.org'

Next, let’s add a basic module skeleton as a piece of content – a Markdown file in the content directory named example-module.md. This file doesn’t actually need to contain any content per se: rather, we’ll be looking to use Hugo front matter, i.e. page metadata, to describe our modules (though of course the extent to which we take rendering content is entirely up to us):

# content/example-module.md
---
title: Example Module
description: An example Go module with a custom import path
---

Since Hugo doesn’t create any default templates for rendering content pages, we’ll need to create some basic ones ourselves; at a minimum, we’ll need a page for the content itself, but ideally we’d also want to be able to see all modules available on the host.

Let’s tackle both in turn – for content-specific rendering, we’ll need to add a template file in layouts/_default/single.html. The content can be quite minimal for the moment:

<!-- layouts/_default/single.html -->
<!doctype html>
<html lang="{{.Site.LanguageCode}}">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <title>{{.Title}} - {{.Site.Title}}</title>
        <meta name="viewport" content="width=device-width, initial-scale=1">
    </head>
    <body>
        <main>
            <h1>{{.Title}}</h1>
            <aside>{{.Description}}</aside>
        </main>
    </body>
</html>

Similarly, our home-page template is located in layouts/index.html, and looks like this:

<!-- layouts/index.html -->
<!doctype html>
<html lang="{{.Site.LanguageCode}}">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <title>Go Modules - {{.Site.Title}}</title>
        <meta name="viewport" content="width=device-width, initial-scale=1">
    </head>
    <body>
        <main>
        <h1>{{.Site.Title}}</h1>
            {{range .Site.RegularPages}}
            <section>
                <h2><a href="{{.Permalink}}">{{.Title}}</a></h2>
                <aside>{{.Description}}</aside>
            </section>
            {{end}}
        </main>
    </body>
</html>

So far, so good – if you’ve been following along here, you should, by this point, have a fairly brutalist listing of one solitary Go module, itself not quite ready for consumption by go get as-of-right-now.

Serving Go Modules

Serving a Go module is, as shown above, a simple matter of rendering a valid go-import tag for the same URL pointed to by the import path itself. Rendering a go-import tag requires three pieces of data, at a minimum:

The module path (e.g. go.example.com/example-module).
The source control protocol, e.g. git or hg.
The source control repository URL, e.g. https://github.com/deuill/example-module.

Hugo supports adding arbitrary key-values in content front matter, which can then be accessed in templates via the {{.Params}} mapping. Simply enough, we can extend our content file example-module.md with the following values:

# content/example-module.md
---
title: Example Module
description: An example Go module with a custom import path
+module:
+  path: go.example.com/example-module
+repository:
+  type: git
+  url: https://git.deuill.org/deuill/example-module.git
---

Producing a valid go-import tag is, then, just a matter of referring to these values in the content-specific layout, layouts/_default/single.html; we can also render them out on-page to make things slightly more intuitive for human visitors, e.g.:

<!-- layouts/_default/single.html -->
<!doctype html>
<html lang="{{.Site.LanguageCode}}">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <title>{{.Title}} - {{.Site.Title}}</title>
        <meta name="viewport" content="width=device-width, initial-scale=1">
+        <meta name="go-import" content="{{.Params.module.path}} {{.Params.repository.type}} {{.Params.repository.url}}">
    </head>
    <body>
        <main>
            <h1>{{.Title}}</h1>
            <aside>{{.Description}}</aside>
+            <dl>
+                <dt>Install</dt>
+                <dd><pre>go install {{.Params.module.path}}@latest</pre></dd>
+                <dt>Documentation</dt>
+                <dd><a href="https://pkg.go.dev/{{.Params.module.path}}">https://pkg.go.dev/{{.Params.module.path}}</a></dd>
+            </dl>
        </main>
    </body>
</html>

Ship it! This setup is sufficient in serving multiple Go modules, each in their own directories, and with a functional, albeit somewhat retro, human-facing interface.

Sub-modules and Sub-paths

Most modules would be well-served by this setup, but it does assume that the Go module is placed at the repository root; based on what we know about go get, the import path needs to resolve to an HTML file containing a valid go-import tag, and that import path needs to match the resulting go.mod file.

The module path in the go-import tag can, therefore, be assumed to also be always equal to the import path; closer reading of the official Go documentation reveals that this is instead the import prefix, relating to the repository root and not necessarily to any of the Go modules placed within.

Furthermore, the full path hierarchy must present the same go-import tag in order to resolve correctly with go get. Clearly there’s some headaches to be had.

To better illustrate the issue, let’s assume the following repository structure, containing a number of files in the repository root, as well as a Go module in one of the sub-folders, e.g.:

├── .git
│   └── ...
├── LICENSE
├── README.md
└── thing
    ├── go.mod
    ├── go.sum
    ├── main.go
    └── README.md

The full import path for this module would be go.example.com/example-module/thing, which is also what the module directive would be set to in the go.mod file; however, the import prefix presented in the go-import tag needs to be set to go.example.com/example-module.

Given this conundrum, it is not enough to simply set a different module.path in the content file for example-module.md, or even create a separate example-module/thing.md content file – we need to ensure that the full hierarchy resolves to a valid HTML file containing a valid go-import tag, that, crucially, always points to the import prefix of go.example.com/example-module.

Turns out that Hugo has yet a few tricks up its sleeve, and can assist us in setting up this complex content hierarchy using a single content file, the trick being content aliases.

Aliases are commonly intended with redirecting alternative URLs to some canonical URL via client-side redirects (using the http-equiv="refresh" meta tag); for our use-case, we’ll need to slightly extend the underlying templates and render a valid go-import tag alongside the http-equiv tag.

First, let’s add the sub-path corresponding to the Go module as an alias in our existing example-module.md content file:

# content/example-module.md
---
title: Example Module
description: An example Go module with a custom import path
module:
  path: go.example.com/example-module
repository:
  type: git
  url: https://git.deuill.org/deuill/example-module.git
+aliases:
+  - /example-module/thing
---

Navigating to go.example.com/example-module/thing will, then, render a default page containing a minimal amount of HTML, as well as the aforementioned http-equiv meta tag. We can extend that for our own purposes by adding a layout file in layouts/alias.html, e.g.:

<!-- layouts/alias.html -->
<!DOCTYPE html>
<html lang="en-us">
    <head>
        <title>{{.Page.Title}}</title>
        <link rel="canonical" href="{{.Permalink}}">
        <meta name="robots" content="noindex">
        <meta charset="utf-8">
        <meta http-equiv="refresh" content="5; url={{.Permalink}}">
        <meta name="go-import" content="{{.Page.Params.module.path}} {{.Page.Params.repository.type}} {{.Page.Params.repository.url}}">
    </head>
    <body>
        <main>This is a sub-package for the <code>{{.Page.Params.module.path}}</code> module, redirecting in 5 seconds.</main>
    </body>
</html>

Crucially, this alias has full access to our custom front matter parameters, making integration of additional sub-modules a simple matter of adding an alias in the base content file.

Since our import paths still expect the fully formed path (and not just the prefix), our human-readable code examples rendered for specific modules will be incorrect in the face of sub-packages; solving for this is an exercise left to the reader (hint hint: add a sub field in front matter, render it in the go get example via {{.Params.module.sub}}).

Making It Pretty

If you’re not a fan of the brutalist aesthetic, and would like to waste your human visitors' precious bandwidth with such frivolities as colors, you could spice things up with a bit of CSS. A little goes a long way:

/* static/main.css */
html {
    background: #fefefe;
    color: #333;
}

body {
    margin: 0;
    padding: 0;
}

main {
    margin: 0 auto;
    max-width: 50rem;
}

a {
    color: #000;
    border-bottom: 0.2rem solid #c82829;
    padding: 0.25rem 0.1rem;
    text-decoration: none;
}

a:hover {
	color: #c82829;
}

dl dt {
    font-size: 1.2rem;
    font-weight: bold;
}

dl dd pre,
dl dd a {
    display: inline-block;
}


pre {
	background: #f0f0f0;
	color: #333;
	padding: 0.4rem 0.5rem;
	overflow: auto;
	margin-bottom: 1rem;
	white-space: nowrap;
}

Add this to static/main.css and link to it from within layouts/_default/single.html and layouts/index.html.

Code and Other Links

The setup described here is directly inspired by and used for hosting my own Go modules, under go.deuill.org. The source-code for the site is available here.

12.8.2023 21:00Serving Go Modules with Hugo
https://deuill.org/post/serving-...

External Knowledge in Declarative Systems

https://deuill.org/post/external...

I recently re-read through Michael Arntzenius’ excellent list of Aphorisms on Programming Language Design, and a specific point caught my eye:

21. “Declarative” means you can use it without knowing what it’s doing.

All too often, it means you can’t tell what it’s doing, either.

It is, in hindsight, quite obvious – any knowledge of how a declaration is processed implies knowledge of the context that declaration will be used in, be it a configuration file, a SQL query, a programming language REPL, and so on. That is to say, context matters, but more importantly, knowledge of the context matters even more.

It seems to me that this maxim can be applied beyond what we think of as strictly “declarative programming languages”.

Though programming languages differ in their modes of expression, the act of writing in a programming language is, itself, always declarative. This, too, seems obvious in hindsight; regardless of whether you’re writing in C or SQL, you’re not telling the computer what to do, but rather how it should be done, or what the outcome should be. Furthermore, both of these exercises requires fore-knowledge of the context: in the former case, knowledge of the C language and libraries, in the latter, knowledge of SQL and, for some use-cases, how the query planner works.

Of course, things are made easier by the fact that the behaviour of both C and SQL is covered by specifications, and both operate in fairly standard ways regardless of which compiler or database engine is used. On the other hand, though one might readily understand the formatting rules of any INI, JSON, or YAML file, the same cannot be said about how these are used once parsed – it heavily depends on the context. When it comes to DSLs, all bets are off.

How are we, then, to build systems that minimize this sort of “external” knowledge, or that otherwise make the least amount of assumptions in defining what is explicit and what is implicit?

A few things come to mind:

Use prior art or knowledge. One can leap-frog years of experience-building by utilizing pre-existing patterns, especially where these don’t form core competencies or differentiators.
Don’t override the meaning of existing conventions, especially in isolation. Different approaches to design should look (and feel) different.
Make things that look similar also behave in similar ways; a symbol used to mean one thing in a specific context should ideally not be re-used for different semantics in a different context.

This only seems to confirm another of Michael’s Aphorisms:

19. Syntax is a pain in the ass.

13.4.2023 17:30External Knowledge in Declarative Systems
https://deuill.org/post/external...

A Year of CoreOS Home Server

https://deuill.org/post/a-year-o...

Fedora CoreOS has saved me. Believe me, it’s true – I was but a poor, lost soul, flitting through the hallways of self-hosting redemption that have seen so many a young folk driven to madness – all of this changed approximately a year ago, when I moved my home-server over to CoreOS, never (yet) to look back.

The journey that led me here was long and perilous, and filled with false twists and turns. The years from 2012 to 2017 were a simpler time, a mix of Ubuntu and services running directly on bare metal; so recent this was, that it might be atavism. Alas, this simplicity belied operational complexities and led to an unrelenting accumulation of cruft, so the years from 2017 to 2021 had me see the True Light of Kubernetes, albeit under a more minimal single-node setup with Minikube.

My early days with Kubernetes were carefree and filled with starry-eyed promises of a truly declarative future, so much so that I in turn declared my commitment to the world. It wasn’t long after until the rot set in, spurred by a number of issues, for example: Minikube will apparently set up local TLS certificates with a year’s expiration, after which kubectl will refuse to manage resources on the cluster, and which might cause the cluster to go belly up in the case of a reboot. And even with Kubernetes managing workloads, one still needs to have a way of setting up the host and cluster, for which there’s a myriad of self-proclaimed panaceas out there.

Clearly, the answer to complexity is even more complexity: simply sprinkle some Ansible on top and you’ve got yourself a stew. And to think there was a time where I entertained such harebrained notions.

At first, it was a twinkle, a passing glance. Fedora CoreOS doesn’t feature as large in the minds of those of us practicing the Dark Arts of Self-Hosting (though I’m hoping this changes as of this post), and is relegated to being marketed as experimental, nascent. Nothing could be further from the truth.

The pillars on which the CoreOS temple is built are three, each playing a complementary role in what makes the present gushing appropriate reading material:

Butane/Ignition, in which our host can be set up in a declarative manner and in a way which allows mere mortals such as myself to comprehend. The spec is short, read it.
Podman, in which containerized workloads are run. For many, Podman is simply a new, drop-in replacement for Docker, but it can be much more than that.
systemd, which needs little introduction, and in which all of our disparate orchestration needs are covered; service dependencies, container builds, one-time tasks, recurring tasks, all handled in all of their glorious complexities.

Tying the proverbial knot on top of these aspects is how much the system endeavours to stay out of your way, handling automatic updates and shipping with a rather hefty set of SELinux policies, ensuring that focus remains on the containers themselves.

Why Fedora CoreOS?

Before we head into the weeds, let’s try to address why you might even care about working with CoreOS; if anything, a bare-metal host will do well for most simple workloads, and Kubernetes isn’t all that unapproachable for a more complex single-node setup. How does CoreOS differentiate itself from other systems?

I can only really answer this from my own experience, by the main points that make CoreOS a worthwhile investment are:

The system is stable and robust, and is intended to be as hands-off as possible. This generally means you won’t have to worry about the base system itself across its entire life-cycle. One might argue that this is no different to any bare-metal system set up with auto-updates, though I’d personally never have these extend to system upgrades (and perhaps nothing beyond security updates).
The system is reasonably secure, and tries to make user interactions and workloads reasonably secure as well. This sometimes leads to inflexibility, as is the case with SELinux (which, if you’re not familiar with, is hell to try to understand), but the system has its way of keeping the user honest, which is a boon long-term.
The system has a good end-to-end deployment story, and is accompanied by excellent documentation. This generally means that you can rely on well-integrated workflows in testing, deploying, and updating your CoreOS-based system, and not have to resort to strange contortions or third-party/custom solutions in doing so.

Contrast these points with your typical bare-metal or Kubernetes-based setup (which is really just a layer above a bare-metal setup that you need to maintain separately):

Ubuntu and other similar efforts can be stable long-term (I probably stayed on the same LTS version of Ubuntu for 3 years), but this can lead to update stagnation and issues when time comes to move to the next major version of the OS. Most bare-metal OSs are designed to be managed as deeply as is necessary, which can lead to issues if there’s no discipline in change control.

In addition to this, keeping a Kubernetes cluster updated can be a full-time job for many, and even a single-node Minikube/K3s setup is not zero-maintenance by any means, and comes with its own set of perils.
As mentioned above, typical bare-metal setups tend to approach security in a less-than-holistic way, and give users the brunt of choice in deciding how to secure user workloads from the system, and vice-versa. Given how complicated security is, and how one shouldn’t connect a toaster, let alone an email server, to the internet for fear of having their house burn down, leaving these choices to the user may not work out for the long run.

Having user workloads run under Kubernetes improves the situation somewhat, as one is given a multitude of controls designed to separate and secure these from one another (e.g. network policies, CPU and memory limits); however, Kubernetes is also supremely complex, and is itself subject to esoteric security concerns.
Deployment, documentation, and upgrade concerns are typically rather disparate in other systems, and the quality of documentation varies wildly between communities. Kubernetes itself is well-documented, but remains complex and occupies a large surface area not typically needed for a simple home-server setup.

People tend to pick and choose solutions based on what their goals are, and the extent in which they’re comfortable learning about and maintaining these solutions long-term. If you’re looking for a system that is minimal, uses common components, and largely stays out of your way after deployment, CoreOS is a perfect middle-of-the-road solution.

Fedora CoreOS Basics

There’s a few things to keep in mind, going into a CoreOS-based setup:

The system is immutable, and you’re expected to use the system as-is, out-of-the-box, and without needing to rely on anything not installed by default. Don’t even think about reaching for rpm-ostree. In fact, don’t even think of storing anything outside of /var, and maybe /etc.
SELinux policies are pre-configured to be fairly restrictive, which means there’s quite a lot of functionality unavailable outside of interactive use – this includes things like using gpg in systemd services.
Although not in any way unstable, the Podman ecosystem is still moving fast and may not be as feature-complete as one might expect coming from Kubernetes, or even Docker.
CoreOS will auto-update even between major versions, and unless configured otherwise, will reboot as needed when new versions become available. Allowing for system reboots is good hygiene; embrace the chaos.

CoreOS comes with a sizeable amount of documentation of excellent quality which will be useful once you’re ready to get your hands dirty, but the rest of this spiel will instead focus on setting up a system based on the CoreOS Home-Server setup I depend on myself. Clone this locally and play along if you wish, though I’ll cut through the abstractions where possible, and explain the base concepts themselves.

With all that disclaimed and out the way, let’s kickstart this hunk of awesome.

Provisioning with Butane

Butane is a specification and related tooling for describing the final state of a new CoreOS-based system, using YAML as the base format; this is then compiled into a JSON-based format and used by a related system, called Ignition. Both systems follow similar semantics, but Butane is what you’ll use as you develop for your host.

Let’s imagine we’re looking to provision a bare-metal server with a with a unique hostname and set up for SSH access via public keys. Our Butane file might look like this:

variant: fcos
version: 1.4.0
passwd:
   users:
   - name: core
      ssh_authorized_keys:
      - ecdsa-sha2-nistp521 AAAAE2VjZHNhL...
storage:
  files:
  - path: /etc/hostname
     mode: 0644
     contents:
       inline: awesome-host

The default non-root user for CoreOS is aptly named core, so we add our SSH key there for convenience; Butane allows for creating an arbitrary amount of additional users, each with pre-set SSH keys, passwords, group memberships, etc.

In addition, we set our hostname not by any specific mechanism, but simply by creating the appropriate file with specific content – we could, alternatively, provide the path to a local or even a remote file (over HTTP or HTTPS). Simplicity one of Butane’s strengths, and you might find using the same basic set of directives for the vast amount of our requirements.

Place this under host/example/spec.bu if you’re using the coreos-home-server setup linked to above, or simply example.bu if not. Either way, these definitions are sufficient in having Butane produce an Ignition file, which we can then use in provisioning our imaginary CoreOS-based system. First, we need to run the butane compiler:

$ butane --strict -o example.ign example.bu

Then, we need to boot CoreOS and find a way of getting the example.ign file there. For bare-metal hosts, booting from physical media might be your first choice – either way, you’ll be dropped into a shell, waiting to install CoreOS based on a given Ignition file.

If you’re developing your Butane configuration on a machine that’s on the same local network as your home-server, you can use good ol’ GNU nc to serve the file:

# Assuming the local IP address is 192.168.1.5.
$ printf 'HTTP/1.0 200 OK\r\nContent-Length: %d\r\n\r\n%s\n' "$(wc -c < example.ign)" "$(cat example.ign)" | nc -vv -r -l -s 192.168.1.5

This mess of a shell command should print out a message confirming the listening address and random port assignment for our cobbled-together HTTP server. If you’re using coreos-home-server, all of this is handled by the deploy target, i.e.:

$ make deploy HOST=example VERBOSE=true

You’re then ready to refer to the HTTP URL for the host Ignition file based on the local IP address and random port assignment over at your live system:

$ sudo coreos-installer install --insecure-ignition --ignition-url http://192.168.1.5:31453 /dev/sda

Assuming all information is correct, you should now be well on your way towards installing CoreOS on the /dev/sda disk.

So, what do you do once you’re here? In short, nothing – the system, as shown in the above example, has configuration enough to give us SSH access into an otherwise bare system. CoreOS doesn’t come with much functionality other than the minimum needed to support its operations, and when I said the system is immutable, I meant it: you’re not supposed to re-apply Ignition configuration beyond first boot¹.

Instead, the blessed way of expanding the functionality of a CoreOS-based server is re-deploying it from scratch; we’ll bend this rule slightly, but it’s important to understand that we’re not intended to tinker too much with the installed system itself, as this would contradict the notion of repeatability built into CoreOS as a whole.

A simpler way of testing our changes is available to us by using virt-install, as described in this tutorial, or by using the deploy-virtual target:

$ make deploy-virtual HOST=example VERBOSE=true

This, again, is a major strength of CoreOS – alternative systems require the arrangement of a more complex and disparate set of components, in this case (most likely) something like Vagrant (in addition to, say, Ansible). Virtual hosts don’t only help in developing new integrations, but also allow us to experiment and test against the same versions of the OS that will end up running on the server itself.

Building and Running Container Services

Since the base system (deliberately) allows for little flexibility and customization, we have to explore alternative ways of extending functionality; in CoreOS, the blessed way of doing so is via Podman, a container orchestration system similar to (but not based on) Docker.

Typically, containers are presented as either methods of isolating sensitive services from the broader system alongside more traditional methods of software deployment, or as forming their own ecosystem of orchestration “above the metal”, as it were. Indeed, most distributions expect most software to be deployed via their own packaging system, and, at the other side of the spectrum, most Kubernetes cluster deployments don’t care what the underlying distribution is, assuming it fulfils some base requirements.

Fedora CoreOS stands somewhere in the middle, where Podman containers are indeed the sole reasonable method of software deployment, while not entirely divorcing this from the base system.

I had little knowledge of Podman coming into CoreOS; what I knew was that it’s essentially a drop-in replacement for Docker in many respects (including the container definition/Dockerfile format, technically part of Buildah), but integrates more tightly with Linux-specific features, and does not require a running daemon. This all remains true, and though the Podman ecosystem is still playing catch-up with Docker in a few ways (e.g. container build secrets), it has surpassed Docker in other ways (e.g. the podman generate and podman play suite of commands).

Podman and CoreOS will happily work with container images built and pushed to public registries, such as the Docker Hub, but we can also build these images ourselves with podman build; let’s start from the end here and set up a systemd service for Redis, running in its own container, under a file named redis.service:

[Unit]
Description=Redis Key-Value Store

[Service]
ExecStart=/bin/podman run --pull=never --replace --name redis localhost/redis:latest
ExecStop=/bin/podman stop --ignore --time 10 redis
ExecStopPost=/bin/podman rm --ignore --force redis

Though far from being a full example conforming to best practices, the above will suffice in showing how systemd and Podman mesh together; a production-ready service would have us use podman generate systemd or Quadlet (whenever this is integrated into CoreOS).

The service above refers to a localhost/redis image at the latest version, and also specifies --pull=never – this means that the service has no chance of running successfully, as the image referred to does not pre-exist; what we need is a way of having the required container image built before the service runs. What better way to do so than with systemd itself?

With the help of some naming and file-path conventions, we can have a templated systemd service handle container builds for us, and thus ensure container dependencies are maintained via systemd requirement and ordering directives. We’ll need an additional file called container-build@.service:

[Unit]
Description=Container Build for %I
Wants=network-online.target
After=network-online.target
ConditionPathExists=/etc/coreos-home-server/%i/Containerfile

[Service]
Type=oneshot
ExecStart=/bin/podman build --file /etc/coreos-home-server/%i/Containerfile --tag localhost/%i:latest /etc/coreos-home-server/%i

That final @ in the service file name denotes a templated systemd service, and allows us to use the service against a user-defined suffix, e.g. container-build@redis.service. We can then use this suffix via the %i and %I placeholders, as above (one has special characters escaped, the other is verbatim).

The Containerfile used by podman build is, for the most part, your well-trodden Docker container format, though the two systems might not always be at feature parity. In this case, we can cheat and just base on the official Docker image:

FROM docker.io/redis:6.2

The only thing left to do is extend our original redis.service file we with dependencies for the container-build@redis service:

[Unit]
Description=Redis Key-Value Store
Wants=container-build@redis.service
After=container-build@redis.service

Getting the files deployed to the host is simply a matter of extending our Butane configuration, i.e.:

variant: fcos
version: 1.4.0
systemd:
  units:
  - name: container-build@.service
    contents: |
      [Unit]
      ...
  - name: redis.service
    enabled: true
    contents: |
      [Unit]
      ...
storage:
  files:
  - path: /etc/coreos-home-server/redis/Containerfile
     mode: 0644
     contents:
       inline: "FROM docker.io/redis:6.2"

All of this is, of course, also provided in the coreos-home-server setup itself, albeit in a rather more modular way: service files are themselves placed in dedicated service directories and copied over alongside the container definitions, and can be conditionally enabled by merging the service-specific Butane configuration (e.g. in service/redis/spec.bu).

You might be wondering why you’d want to go through all this trouble of building images locally, when they’re all nicely available on some third-party platform; typically, the appeal here is the additional control and visibility this confers, but having our container definitions local to the server allows for some additional cool tricks, such as rebuilding container images automatically when the definition changes, via systemd path units:

[Unit]
Description=Container Build Watch for %I

[Path]
PathModified=/etc/coreos-home-server/%i/Containerfile
Unit=container-build@%i.service

Install this as container-build@.path and you’ll have the corresponding container image rebuilt whenever the container definition changes. You can even tie this up by having the service file depend on the path file, which will ensure the path file is active whenever the service file is used (even transitively via another service):

[Unit]
Description=Container Build for %I
Wants=network-online.target container-build@%i.path
After=network-online.target container-build@%i.path

The new container image will be used next time the systemd service is restarted – Podman has built-in mechanisms for restarting systemd units when new container image versions appear, though this requires that the services are annotated with a corresponding environment variable. See the documentation for podman auto-update for more information.

Extending Service Units

If there is one final trick up our sleeve, it’s drop-in units, which allow for partial modification of a concrete systemd unit.

Let’s imagine we wanted our Containerfile to access some protected resource that uses SSH key authentication for access (e.g. a private Github repository); to do so, we have to provide additional parameters to our podman build command, as used in our container-build@.service file.

Since there’s no real way of providing options beyond the templated unit suffix, and using an EnvironmentFile (or similar) directive means applying the same options to every instance of the templated unit, it might seem that we’d have to do away with our generic container-build service and use an ExecStartPre directive in the service unit itself.

Enter drop-in units: if, for the examples above, we create a partial systemd unit file under the container-build@redis.service.d directory with only the changes we want applied, we’ll get just that, and just for the specific instance of the templated unit (though it’s also possible to apply a drop-in for all instances). Given the following drop-in:

[Unit]
After=sshd-keygen@rsa.service

[Service]
PrivateTmp=true
ExecStartPre=/bin/install -m 0700 -d /tmp/.ssh
ExecStartPre=/bin/install -m 0600 /etc/ssh/ssh_host_rsa_key /tmp/.ssh/id_rsa
ExecStart=
ExecStart=/bin/podman build --volume /tmp/.ssh:/root/.ssh:z --file /etc/coreos-home-server/%i/Containerfile --tag localhost/%i:latest /etc/coreos-home-server/%i

The end-result here would be the same as if we had copied in the PrivateTmp and ExecStartPre directives into the original container-build@.service file, and extended the existing ExecStart directive with the --volume option. What’s more, we can further simplify our drop-in by implementing the original ExecStart directive with expansion in mind:

[Service]
Type=oneshot
ExecStart=/bin/podman build $PODMAN_BUILD_OPTIONS --file /etc/coreos-home-server/%i/Containerfile --tag localhost/%i:latest /etc/coreos-home-server/%i

Which would then have us remove ExecStart directives from the drop-in, and add an Environment directive:

[Service]
Environment=PODMAN_BUILD_OPTIONS="--volume /tmp/.ssh:/root/.ssh:z"

If not specified, the $PODMAN_BUILD_OPTIONS variable will simply expand to an empty string in the original service file, but will expand to our given options for the specific instance covered by the drop-in.

Since CoreOS hosts have stable identities, generated once at boot, we can add the public key in /etc/ssh/ssh_host_rsa_key.pub to our list of allowed keys in the remote system and have container image builds work reliably throughout the host’s lifetime.

Where to Go From Here?

There’s a lot more to cover, and clearly not all unicorns and rainbows – though the base system itself is simple enough, integration between containers makes for interesting challenges.

Future instalments will describe some of these complexities, as well as some of the more esoteric issues encountered. And, assuming it takes me another year to follow up on my promises, you’re welcome to follow along with progress on the repository itself.

This is different to systems such as Ansible, Chef, or Salt, where deployed hosts/minions are generally kept up-to-date in relation to upstream configuration ↩︎

4.6.2022 17:00A Year of CoreOS Home Server
https://deuill.org/post/a-year-o...

Orthogonal Git Workflow

https://deuill.org/post/orthogon...

Countless of bytes have been wasted on tutorials and Internet debates (i.e. flame wars) on which Git workflow is the best, when to merge or rebase, how many lines of code per commit make for the best review experience, and so on.

What I’ll attempt to do in this post, in the least amount of bytes possible, is describe a simple, orthogonal Git workflow, designed for projects with a semi-regular release cadence, and built around a pre-release feature freeze.

This is not intended to be the end-all guide to workflow nirvana, but rather a collection of idioms that have been applied successfully throughout the lifecycles of various projects.

Additionally, the final two chapters contain advice and pointers on merging strategies and commit standards. If all that sounds interesting, please read on.

Permanent Branches

Using Gitflow as a starting point, we simplify certain concepts and entirely discard others. Thus, we specify two main, permanent branches:

The `master` Branch

The master branch represents code released to production (for releases with final release tags) and staging (for releases tagged -rc). We’ll talk more about release tags in a second, but it’s important to understand that the tip of master should always be pointed to by a tag, -rc or otherwise.

Commits are never made against master directly, but are rather made as part of other branches, and merged into master when we wish to deploy a new tag. Merge strategies are described below, and apply towards all code moved between branches.

The `develop` Branch

The develop branch is the integration point for all new features which will eventually make their way into master.

As with master, no commits should be made against develop directly, but should rather be part of ephemeral feature branches. The use of such branches is described below.

Ephemeral Branches

While the above two branches are permanent (i.e. should never be removed), they only serve as integration points for code built in ephemeral, or temporary branches. Of these we have two, each serving a distinctly different purpose, and having different semantics.

Feature Branches (`feature/XXX_yyy`)

Feature branches are where most of the work in a project happens, and are always opened against, and merged back into, the develop branch. What constitutes a feature is fairly broad, but essentially covers any code that is not a bugfix for an issue that exists in current master.

Figure 2.0 - Branching and merging feature branches. — **Figure 2.0** - *Branching and merging feature branches.*

Feature branch names follow a naming convention of feature/XXX_yyy where XXX refers to the ticket number opened against the work (if any), while yyy is a short, all-lowercase, dash-separated description of the work done. A (perhaps contrived) example would be:

git checkout develop # This will affect the base branch for our feature.
git pull develop     # Always a good idea to branch of the latest changes.
git checkout -b feature/45_implement-flux-capacitor

The rules behind merging of feature branches back into develop are project-specific, but most teams would have the code go through peer review and possibly a CI pass before merging. However, it is intended that projects implement a semi-regular (or at least predictable) release schedule, in which case features that are intended to appear in the upcoming release will have to be merged into develop before the feature freeze starts.

Once the feature freeze starts and develop is merged back into master and tagged as an -rc, the team is free to merge feature branches into develop again.

While most rules are meant to be broken, the ones described above (as loosely defined as they are) fit into the versioning strategies employed, and as such will benefit by being followed as closely as possible.

Certain fixes, however, cannot wait for next release, or are designed to fix breaking issues present in the master branch. For those, we have the following.

Bugfix Branches (`bugfix/XXX_yyy`)

Bugfix branches are intended to contain the bare minimum amount of code required for fixing an issue present on the master branch, and as such are always opened against, and merged back into master.

Figure 2.1 - Merging bugfix branches between master and develop. — **Figure 2.1** - *Merging bugfix branches between `master` and `develop`.*

Naming conventions and code acceptance rules are identical to those for feature branches, apart for the bugfix/ prefix applied. Bugfixes are not subject to feature freezes or release schedules.

For bugs that appear on both master and develop, the bugfix branch may, optionally, be merged into develop as well, which has the additional benefit of reducing divergence between the two branches. Why does this matter? Read below.

Versioning and Tagging Schemes

So now you have a bunch of code on develop waiting to be released. How do we go about doing that? Imagine the following, two-week (i.e. ten working day) release schedule.

Days 1 - 6: Feature Development

Cycle starts, with feature development commencing immediately. Features are opened against develop, peer-reviewed, tested, and eventually merged back into develop according to the release manager/team lead/maintainer’s directions. Large features ready for merging during the end of the window may be left un-merged in order to better test and/or avoid any latent issues.

Days 7 - 9: Feature Freeze/Pre-Release Bugfixing

Merge window closes, with any features left unmerged making their way into next cycle’s release. This is also called a “feature freeze”.

The develop branch is merged into master, and a -rc tag corresponding to the next feature version is opened. So, for instance, if the last version tagged against master was v.2.9.3, this tag is to be v.2.10.0-rc1. This tag is then pushed to a staging server and tested by all means available.

Any bugs we inevitably find are fixed in bugfix branches opened against master, and merged as soon as the fixes have been verified on the branches themselves. A subsequent -rc (i.e. v.2.10.0-rc2) release is tagged whenever we wish to push a new, fixed version to staging.

Day 10: Release Day

Release day! Hopefully we’ve had enough time to thoroughly test the new version, and as such are ready to tag and push a final version of master, v.2.10.0, to production. We make another round of testing on production and get ready for the next cycle (or release drinks).

Post-release Maintenance

We will eventually find bugs in production that weren’t uncovered by our testing on either feature branches or master. The strategy we follow differs slightly depending on which phase of the next cycle we’re on.

Before the Feature Freeze

As master is still in a pristine state, merging bugfixes back into master is a simple matter of opening a bugfix branch, merging that in, and tagging a new bugfix release version (e.g. v.2.10.1 for the above example) as soon as we’re ready to push to production.

After the Feature Freeze

The situation is slightly complicated by the fact that master now contains code that we’re not ready to release to production, and as such cannot be tagged directly. However, the workflow for opening a bugfix branch remains the same, as the issue will most likely exist in master, even with the additions from develop.

The most elegant way of solving this issue is opening a new “release” branch against the last stable tag, which will serve as the integration point for all relevant bugfix branches. The naming convention we’ll use for this branch is the major version for the release we’re branching off, i.e. v.2.10.

Once the branch has been created, we’re free to merge in all relevant bugfix branches, test locally, and tag the new version against this branch.

Figure 3.1 - Tagging a bugfix after feature freeze. — **Figure 3.1** - *Tagging a bugfix after feature freeze.*

This is the only case where a branch other than master is tagged, and as such constitutes a extraordinary measure.

A Note About Versioning

Both situations above require us to tag new release versions. Normally, we’d tag an initial -rc version, after which we’d push to staging and test. Whether this is necessary or not for bugfixes is debatable, and is left to be decided on a case-by-case basis. However, the convention of -rc to staging, final version to production, remains constant in all situations.

Merge Strategies

Countless debates exist on rebasing vs. merging, and whether to squash commits or not. Realizing that, in most cases, personal preference plays the largest role in choosing a strategy, the following sections may appear debatable, so please, take them with a grain of salt and apply them as needed. However, we’ll try to provide as much rationale as possible, while exploring alternatives in order to better understand the reasons behind our choices.

It is also important to understand that the following sections only apply to public code, i.e. anything that has been pushed to a remote. Nobody but you knows whether you squashed your 15 commits into 1 just before you pushed your code to a public repository somewhere. Regardless of the above, it often pays off to use the same strategies both offline and online, for reasons explained below.

General rules that apply to all strategies: merge with fast-forward, avoid squashing, avoid rebasing.

Merging Between Branches

Merging code between develop and various feature branches, as well as between develop and master, is one of the most common day-to-day operations, so let’s cover each case individually.

Merging from `feature` to `develop`

Once a feature has been peer reviewed and tested as a unit, and provided the feature freeze window is still open, a feature may be merged back into develop.

Choosing to merge instead of rebasing is based on the following rationale: the state of the work in the feature branch is a direct result of the point in develop it was branched off. For long-lived feature branches, this meta-information is an important aspect of understanding the design choices behind the feature work.

Additionally, rebasing disrupts the linear nature of history, that is to say, commits may appear to be behind ones that were made further in the past, but which were rebased into develop afterwards. This makes reasoning about the history harder (for instance, when wanting to bisect based on the knowledge that develop was in a “good” state on some specific date).

Rebasing may also lead to the loss of information concerning how a feature evolved in time, especially when a feature had to be refactored in response to changes made in develop (more on how these changes are brought into the feature branch in the following chapter).

In most cases, we’re only really concerned with the latest version of a feature. A commit introducing some code that is superseded by a following commit in the same feature branch may appear to be irrelevant since it never really touches upon the state of develop at the time of merging.

However such information is important in a historical sense. It may be that the code was refactored in response to an outside event, such as a different feature being merged or product decisions being made behind the scenes. Such information may be relevant in the future, even if the code itself was never strictly part of any release.

The general idea is that, anything done with intent, that is to say, manually, should be preserved in the state in which it was made.

Syncing from `develop` into a `feature`

In several cases, you may need to synchronize your feature branch with develop, for instance, when fixing merge conflicts.

Again, choosing to merge instead of rebasing is based on the general idea that actions with intent should be preserved. This especially true when working on a public branch, but holds for private branches as well.

Imagine the following scenario: You’re working on a feature branch for implementing image-uploading functionality in a CMS product. The functionality is close to be complete, when a refactor of the underlying Image class is merged into develop.

You, of course, can no longer merge my code as-is, and will have to change it in response to the refactor. You’re given two choices: either merge develop into the feature branch, fix any conflicts, and continue to add commits for refactoring any remaining functionality, or rebase the feature branch on top of develop and make it appear as if you were working with the refactored code from the beginning.

There are several reasons why merging provides benefit, especially in the long-term. One is, of course, that your refactor may be relevant to someone (including yourself) in the future. It may be as a pointer for refactoring other, similar features, or may help when attempting to debug issues that did not exist prior to the refactor.

Another, perhaps more esoteric reason is: fixing merge conflicts can go wrong. You may accidentally choose the wrong part of a conflict, or not merge the changes correctly, or add a typo somewhere that would not exist otherwise. When rebasing, it will appear as if these errors were part of the original design. When merging, these errors will appear as part of the merge commit, and as such can be traced back to with greater ease.

The proliferation of merge commits is the most common reason for choosing to rebase rather than merge, but cases like this demonstrate the value of preserving merge commits, both for their content and as meta-information: this was the point where you needed to refactor your feature; this is the point you merged your feature into develop.

Merging from `develop` to `master`

The same general rules for merging between feature branches and develop apply here as well: merge with fast-forward, do not squash.

It may be that, due to bugfix branches being applied to master alone and not develop, that the two will diverge. This however, should not complicate matters much, as in most cases develop is a strict super-set of master. Merging bugfix branches on both master and develop can help alleviate any future problems, and is the preferred strategy.

Merging from `bugfix` to `master`

Again, the same rules as with merging between feature branches and develop apply. As stated above, we should also merge all bugfixes into develop as well, even when the bug no longer applies, in order to eliminate divergence.

Notes About Fast-Forwarding and Squashing Commits

When merging features or bugfixes, we choose to fast-forward our branch relative to the base branch, for various reasons, most notably, the fact that feature and bugfix branches are intended to be ephemeral, and can (and should) be pruned regularly. The reason of why we treat these branches as such is related to how we treat commits, and is explained further below.

Choosing not to fast-forward makes bisecting and reasoning about the history harder, while providing dubious benefits, especially since pull requests (on, for example, GitHub or Bitbucket) continue to exist even if the underlying branch has been deleted.

It may appear, from the above, that the most important aspects of our workflow lie within our branching and merging strategies. However, this is not entirely true.

The smallest monad in any Git repository is the commit, which also makes it the most important aspect of our workflow. Maintaining a clean history depends largely on the quality of each individual commit pushed, and keeping the quality consistent is hard and requires buy-in from every individual team member.

Squashing commits is the antithesis of maintaining consistent quality – why would you squash commits that have been prepared with such diligence? Several other reasons apply, as explained in the following sections.

Commit Standards

The following chapters outline several rules on creating good commits.

Naming and Messages

Perhaps the easiest rule to implement, and the one providing the most benefits for the least amount of effort, is standardizing on naming conventions for commit messages. The advice below echoes conventions followed by quite a few large repositories, including the Git repository for the Linux kernel itself, but is nevertheless worth repeating:

Commit titles should be prepended with the file name or subsystem they affect, be written in the imperative starting with a verb, and be up to 60 characters in length.

So, applying the aforementioned rules, we have two examples:

Bad example:

The Get method of the Image class now fetches files asynchronously

Good example:

Image: Refactor method “Get” for asynchronous operation

The reasons are many-fold: prepending the name of the subsystem helps in understanding where the work is happening at a glance. Using the imperative and starting with a verb is easier to understand by using the following sentence before every commit title: “applying this commit will…”. Lastly, the choice of limiting the title to 60 characters may appear archaic, but it helps in being more terse.

All commits should be accompanied by a commit message (separated from the title by two consecutive newlines), ideally containing the rationale behind the changes made within the commit, but minimally the name of the ticket this work is attached to, which will most certainly be useful to you at some point in the future. For example:

Image: Refactor method “Get” for asynchronous operation

Fetching images from the remote image repository is now asynchronous, in order to allow for multiple images to download concurrently. This change does not affect the user-facing API or functionality in any way.

Related-To: #123

Using a standard syntax for relating commits to ticket numbers helps with finding them using git log --grep.

What to Commit and When

We don’t always have the ability or knowledge to foresee the final, completed state of work needed in order to implement a feature or fix a bug. As such, most work is driven by whatever idea we have about the code at the moment, and can therefore change rapidly.

The standard rule for choosing what to include in a commit is this: every commit should represent a single, individually reversible change to the codebase. That is to say, related work, work that builds on top of itself in the same branch, should be part of the same commit.

As an example, in the course of implementing the asynchronous image operations described above, you find a bug in the same file but a different, unrelated method.

This bugfix and the feature work done should appear in two, separate commits, for the simple reason that, we should be able to revert a buggy feature without sacrificing unrelated bugfixes made in the course of building that feature.

The tools we use will largely affect what our commits look like: GitHub now allows for better control over reviewing specific commits. Gerrit allows commits to be grouped into patch-sets, which can be reviewed and reworked as separate entities (which would usually either require a rebase or a new pull request). Other tools only allow reviewing the latest version of a branch as a whole.

Pushing for clear boundaries between commits, especially in the face of ever-changing requirements, and the fact that in most cases, you’d only ever revert an entire pull request/branch and not the individual commits themselves, may appear to be losing battle.

The easiest way to deal with these issues is at the time of review: if the commits are too big (over a couple hundred lines of code) and do not appear cohesive, reviewing the code is that much harder, and will eventually lead to inferior code quality and/or bugs falling through the cracks.

Closing Remarks

Various concepts have been presented, some harder to implement than others. If there is one take-away, please allow it to be this: it’s better to be consistent than to be correct, and it’s better to be simple than comprehensive.

Rules that are not orthogonal to one another are harder to implement and follow consistently, so keep that in mind when choosing which battles to fight.

The graphics in this post have been generated using Grawkit, a AWK script which generates git graphs from command-line descriptions.

30.12.2016 20:40Orthogonal Git Workflow
https://deuill.org/post/orthogon...