Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Preface

The title is doing real work. You already know the trick where you stop thinking about what the computer does and start thinking about what shape the data has. Lisp taught you that. Elixir taught you that pattern matching is a form of destructuring proof. What you haven’t yet internalized — and what this book is for — is that ML’s type algebra is a portal system: you carve a hole in one place, you carve a matching hole somewhere else, and values fall through. The compiler’s job is to refuse to let you carve a hole that doesn’t go anywhere.

This is different from “types as memory layout” (your C intuition) and different from “types as protocol” (your TypeScript and Go intuition). In ML, types are propositions and values are proofs. That sentence is going to sound like a humanities-major flourish until about chapter 3, at which point it’ll start to feel like the most operationally useful thing anyone ever told you about a programming language.

On the writing of this book

This section comes before any other, because the reader deserves to know it before deciding how to read what follows.

The book you are about to read was compiled by Claude, an AI assistant made by Anthropic, in conversation with William, a human engineer. The conversation that produced it spanned dozens of hours across multiple sessions. The voice you are reading is Claude’s — almost every sentence in this book was written by Claude. The decisions about what to include, what to defer, what depth to write at, and what intuitions to chase were made jointly: William asked questions, redirected, pushed back, and ratified; Claude proposed, drafted, revised, and synthesized.

This is not a book that William wrote with AI assistance. It is closer to the inverse: a book that Claude wrote at William’s direction, with William’s domain knowledge, intuitions, and ambitions setting the agenda and the standard. The ratio matters because it changes how you should read what follows. When the book makes a confident claim about, say, the right way to handle preemption in an effect-typed kernel — that claim is Claude’s synthesis of the design conversation. When the book describes a tradeoff and picks a side, the picking was usually Claude’s, sometimes pushed back on by William, sometimes ratified without modification.

The corresponding strengths and weaknesses are these. The book is more internally consistent than something written by committee or assembled from notes; it has the unity of voice that comes from one mind doing the synthesis. It is denser than most introductory texts because Claude is a tireless explainer and the conversation never had to slow down for someone to catch their breath. It is also, in places, more confident than the underlying knowledge justifies — Claude’s prose has the cadence of someone who has done this before, and Claude has not done this before. No one has. The ISA is novel, the language is novel, the kernel is novel. The book describes a system that exists at present only in the specifications you are about to read, an emulator in Ruby, and a compiler under construction. Treat the confident tone as a feature of the prose, not a guarantee of the underlying engineering.

Three further notes:

The ideas are co-developed. Many of the design decisions — for example, the unification of interrupts and effects, the choice to make memory allocation an effect, the linearity of continuations as the basis for multicore safety — emerged from the conversation rather than from either party alone. William brought the initial ISA sketch (effects-native, hardware continuations, hardware pattern matching). Claude brought the broader ML-and-effects literature, the comparative analysis with adjacent designs, and the discipline to push the design toward consistency. Neither party would have produced the same design alone. The book is a faithful record of what the conversation produced.

Errors are not concentrated in one party. When the book is wrong about something — and it will be wrong about something — the error is as likely to be Claude’s confident assertion of a falsehood as it is William’s misremembering of a hardware detail. The book has not been formally reviewed by experts in the various subfields it touches. Read accordingly.

The book makes no apology for the AI involvement. This is not an experiment in “can AI write a book.” It is a working document for a real engineering project, written in the most efficient way available to the person responsible for that project. The AI involvement is disclosed because honesty is owed to the reader; it is not minimized because there is nothing to be embarrassed about. If the resulting book is useful, the AI’s contribution to its usefulness is real. If it is not useful, the AI’s contribution to its uselessness is also real. Either way, the work stands or falls on its merits as a piece of engineering writing.

With that out of the way:

What this book is

This is a field manual for a specific co-design experiment: a small ML in the style of OCaml or Standard ML, targeting a small instruction set architecture that has effect handlers, first-class continuations, hardware pattern matching, and a network-on-chip as its primary communication fabric. The language and the ISA were designed together, with each chapter’s decisions justifying themselves by what the other side of the design needed.

The thesis is that most language/ISA mismatches are accidents of history. C’s semantics fossilized into the machines we have, and every language since has paid the impedance tax. If you get to design both sides, the question stops being “how do I make ML fast on a register machine that was designed for a different language” and becomes “what does the ML algebra want a machine to be.”

The output is not a production language. It’s not a production ISA. It’s a working artifact — a 16-bit teaching machine that can run a real kernel — built specifically to make a set of intuitions visible. Reading the book, you should come away able to feel why the design choices were made, what they cost, and what they buy.

Who this is for

You should already be a confident programmer. The book assumes you’ve shipped systems in C or Go or Rust, and you know what a process is, what an interrupt is, what a page table is. You may have written firmware; you may not have. You don’t need ML experience — we’ll build the intuition from algebraic first principles — but you should be comfortable with the idea that types can do real work in a program, beyond serving as documentation.

The book leans into the systems angle. If you came for “how do I make a beautiful list-processing language,” you’re in the wrong room. The whole motivating question is: if you got to design the language and the metal together, with concurrency and devices in mind from the first day, what would you build?

How to read this book

Chapters are short and dense. They build incrementally. Skipping is risky for the first ten chapters because each one cashes in vocabulary established earlier. After chapter 14 — the full multicore kernel — chapters can be read in any order.

Code samples are written in a small ML we call CML (Concrete ML). The full language specification is in Appendix B, and you can read the appendix at any point if a syntax detail trips you. The ISA specification is in Appendix A; you’ll likely want to consult it during chapters 6 through 9. Both appendices are the canonical reference; the body chapters are the explanation.

Throughout the book, decisions are flagged as decisions, not as truths. When the design chooses between alternatives, we say so, list the alternatives, and explain the choice. The book is a document of a co-design process, not a manifesto. You should disagree with some of it; you should build a better version of some pieces.

On the title

Concrete Machines gestures at Gilbert Simondon’s notion of concretization — the process by which a technical object becomes more itself over time, as its parts come to fit each other rather than fitting an idealized abstract function. A pre-concretized machine has parts that are theoretically composable but practically fragile; a concretized machine has parts whose roles have grown into each other and reinforce each other.

The co-design conducted in this book is an exercise in concretization. The ML and the ISA become more themselves by being designed together. The handler is more useful because the hardware knows about it; the hardware is simpler because the handler exists. By the end, you should be able to feel the parts fitting.

The subtitle, Now you’re thinking with portals, is unapologetic. Live with it.

Cardinality and the operators

Forget OCaml exists for a few pages. The ML family — SML, OCaml, F#, and the more distant cousins Haskell and Rust — share an algebraic core. Before we name the operators, we need one piece of vocabulary.

Cardinality

Cardinality of a type is the count of distinct values it has. That’s it. It’s the same word mathematicians use for the size of a set, applied to types-as-sets-of-values.

Some examples to anchor:

  • bool has cardinality 2. Its values are true and false.
  • unit has cardinality 1. Its only value is ().
  • A type with no values — the empty/void type — has cardinality 0.
  • int32 has cardinality 2³² ≈ 4.3 billion.
  • string has cardinality ℵ₀ (countably infinite, ignoring memory limits) — there are infinitely many possible strings.
  • A custom type type traffic_light = Red | Yellow | Green has cardinality 3.

When I say a type “has N values,” I don’t mean “holds N values at runtime” the way a list holds elements. I mean: across all possible programs and all possible runtime states, the set of legal inhabitants of this type has N members. bool has cardinality 2 because there exist exactly two booleans in the universe. A variable of type bool is, at any moment, one of them.

Why this matters, and why this book will keep returning to it: cardinality is the link between the algebra and the discipline. The phrase “make illegal states unrepresentable” cashes out as: choose a type whose cardinality equals the number of legal states in your domain. If your domain has 7 legal states and your type has cardinality 12, you have 5 representable illegal states — bugs waiting to be written. If your type has cardinality 5, you can’t represent two of your legal states — bugs waiting to be discovered. Cardinality is how you count whether your type is the right shape.

It’s also why the operator names — product, sum, exponential — aren’t metaphors. They’re literal. The cardinality of a pair really is the product of the cardinalities. The cardinality of a function type really is an exponential. The algebra of types is the algebra of finite sets, lifted.

The operators

Product types. A pair (A, B) is the type whose values are one A and one B. Cardinality: |A| × |B|.

Verify with bool: (bool, bool) has 4 values — (true, true), (true, false), (false, true), (false, false). 2 × 2 = 4. ✓

You know this from tuples, structs, records. Nothing new yet.

Sum types. A choice A + B is the type whose values are either an A or a B, tagged. Cardinality: |A| + |B|.

Verify: bool + bool has 4 values — Left true, Left false, Right true, Right false. 2 + 2 = 4. The tag matters; Left true and Right true are distinct values of the sum type even though they share the underlying true.

This is the operator your C and Go intuition doesn’t have a clean slot for. Go fakes it with interfaces; C fakes it with tagged unions you maintain by hand; TypeScript fakes it with discriminated unions and prays. In ML, it’s primitive and the compiler enforces that you handled every tag.

Function types. A -> B is the type of something that, given an A, yields a B. Cardinality: |B|^|A|.

Verify: bool -> bool has 4 values — the identity function, negation, constant-true, constant-false. 2² = 4. ✓ Each function is determined by its choice of output for each of the |A| possible inputs, hence |B| choices made |A| times.

Yes, exponential. This will matter — it’s why higher-order functions are so expressive: a single function value can encode an astronomical amount of behavior.

The unit type. unit has exactly one value, (). Cardinality 1. It’s the multiplicative identity: A × unit ≅ A, because |A| × 1 = |A|. You can think of it as “no information” — a value of type unit tells you nothing you didn’t already know, because there was only one possibility. That’s exactly what void should have meant in C.

The empty type. A type with zero values. Cardinality 0. It’s the additive identity: A + empty ≅ A, because |A| + 0 = |A|. You’ll rarely write a value of this type — you can’t, there aren’t any — but it shows up as the return type of functions that don’t return (exit, infinite loops, raising exceptions in pure settings). A function of type A -> empty is one that takes an A and cannot possibly return normally, and the type system knows it.

Algebraic identities

Now the move. Identities you learned in middle school just work on types, because they’re literally the same identities over the same kind of arithmetic:

  • A × (B + C) ≅ (A × B) + (A × C) — distribution. A pair of an A with either-a-B-or-a-C is the same as either-(a-pair-of-A-and-B)-or-(a-pair-of-A-and-C). Cardinalities: |A|·(|B|+|C|) = |A|·|B| + |A|·|C|. ✓
  • A^(B+C) ≅ A^B × A^C — a function from a sum is the same as a pair of functions, one per case. This is what pattern matching is. Cardinalities: |A|^(|B|+|C|) = |A|^|B| · |A|^|C|. ✓
  • (A^B)^C ≅ A^(B×C) — currying. A function returning a function is the same as a function of two arguments. Cardinalities: (|A|^|B|)^|C| = |A|^(|B|·|C|). ✓

The ≅ symbol is “isomorphic to” — there’s a lossless, invertible translation between values of the two types. Not “equal,” because the types are syntactically distinct, but interchangeable in any way that matters. When you see two ML programmers debating whether to use a pair of options or an option of a pair, this is what they’re arguing about: which side of an isomorphism is more readable, since they’re operationally the same.

When an ML programmer says “make illegal states unrepresentable,” they mean: use the algebra above to construct a type whose cardinality equals exactly the number of legal states in your domain. No more, no fewer. The whole discipline is counting.

Systems lens, first pass

You’re going to want to know: does this algebra cost anything at runtime? Mostly no. A product is a struct. A sum is a tagged union — typically one word of tag, plus enough space for the largest variant’s payload (with caveats per implementation we’ll cover when we get to the ISA). A function is a code pointer plus a closure environment. The algebra is mostly a compile-time bookkeeping discipline with predictable runtime shapes. The portals are free; what’s behind them costs what you’d expect.

There’s one cost worth flagging now: sums force a tag check at every match site. On a modern branch predictor with a closed enumeration of cases, this is essentially free in steady-state hot loops. But if you’re writing the kind of code where a mispredicted branch matters — packet processing inner loops, DSP — you’ll want to know which sums in your hot path compile to what. The ISA chapters will give you the tools to find out.

What this chapter committed to

A vocabulary. Type, value, cardinality. Four operators: product, sum, function, unit-and-empty. The algebraic identities that connect them. The principle that ML’s design discipline reduces to counting. The hint that this algebra will have a direct, surprisingly cheap, hardware story.

The next chapter introduces the operation that consumes sums for breakfast: pattern matching.

Pattern matching, the only control flow

Take a breath. If you have Elixir in your background, you already think this way; the ML version is just stricter and total.

type shape =
  | Circle of float
  | Rectangle of float * float
  | Triangle of float * float * float

let area s =
  match s with
  | Circle r              -> 3.14159 *. r *. r
  | Rectangle (w, h)      -> w *. h
  | Triangle (a, b, c)    ->
      let s = (a +. b +. c) /. 2.0 in
      sqrt (s *. (s -. a) *. (s -. b) *. (s -. c))

Two things to internalize.

Exhaustiveness

First, match is exhaustive by default and the compiler will warn (configure it to error) if you miss a case. This is the difference between a switch in C and a match in ML: the compiler has counted your portals, and if you’ve left one un-entered it tells you which one.

In systems code, this means refactoring a tagged union — adding a new state — produces a complete enumeration of every site that needs updating. No grep. No “did I get them all.” The compiler is your refactor.

Recall the cardinality story from the previous chapter. A sum with three variants has cardinality 3. A match expression on it should handle 3 cases. The compiler is doing arithmetic on your code: it’s counting cases against variants. If the numbers don’t agree, it tells you.

This generalizes. If you add a fourth variant later, every match expression on the type becomes incomplete. The compiler emits a list. You walk the list, decide for each case what the new variant should do, and move on. This is what refactoring should be: not “find every usage of the function and update it,” which requires correct grep, but “the compiler told me exactly what I need to fix and now I fix it.”

Nested patterns

Second, patterns nest:

match (x, y) with
| (Some a, Some b) -> ...
| (Some a, None  ) -> ...
| (None,   Some b) -> ...
| (None,   None  ) -> ...

You can read a pattern as a shape you’re asserting the value has; if the shape matches, the bindings are introduced. The pair pattern matches if both sides match. The Some a pattern matches if the option is Some and binds a to its contents. Compose these arbitrarily; matches can be as deep as your data structure.

This is the same intuition as Elixir’s {:ok, value} matches, stretched across the whole language as the primary mechanism for case analysis. There’s no special syntactic distinction between “small destructuring” and “big dispatch” — it’s all pattern matching, all the way down.

A consequence: pattern matching is the only control flow you actually need. if/else exists in ML — it’s if e1 then e2 else e3 — but it’s a special case of a match on a boolean:

match b with
| true  -> e2
| false -> e3

The moment you have more than two branches, or any structure to inspect, you reach for match. The if form is for when the boolean is opaque (a comparison result, a returned flag) and you don’t want the ceremony.

Coming from Lisp

You’re used to cond and case and match (in Racket, Clojure, etc.) being convenient. Here it’s not convenient — it’s the mechanism. There’s no if/elseif/else chain idiom. There’s no visitor pattern. Get comfortable reaching for match first.

The Lisp tradition has macros, which give you the freedom to invent control structures. ML doesn’t have macros (at least, not in its small canonical form). It gets that expressive power instead from algebraic data types and exhaustive matching — once you have those, you can express most of what a Lisper would reach for a macro for, but with the compiler checking your work.

Some Lisp idioms translate directly:

Lisp idiomML equivalent
(cond ...)`match v with
(case k ...)`match k with
Tagged data via cons cellsAlgebraic data types
Quote-and-eval for control flowFirst-class functions and effect handlers
(if ...) for two-wayif then else

The piece that doesn’t translate: dynamic typing. ML’s matches resolve at compile time, and you cannot match on values whose types aren’t known statically. This is a genuine constraint, and people coming from Lisp sometimes find it claustrophobic at first. The payoff is that the compiler is doing more for you. You give up the flexibility; you get back certainty.

Coming from object-oriented languages

Subtype polymorphism is not how ML dispatches on cases. There’s no virtual method table being consulted. The variant tag is a small integer, the match compiles to a jump table or decision tree, and the whole thing is open to the optimizer. When you write a hot loop over a sum type, the dispatch is roughly free — comparable to a switch over an enum in C, often better because the compiler knows the variants are closed.

This is the inversion you’ll feel most. In OO, adding a new case is easy (subclass) and adding a new operation is hard (touch every class). In ML, adding a new case is hard (touch every match) and adding a new operation is easy (write a new function with a new match). The two languages have made opposite choices about which axis is easy to extend.

For systems work — protocol parsers, schedulers, control planes, compilers — the cases are usually closed by external specification (HTTP methods don’t change weekly) and the operations multiply. ML’s choice fits this shape. We’ll come back to this in chapter 4.

What this chapter committed to

Pattern matching as the primary control-flow mechanism. Exhaustiveness as a compile-time check, with refactoring leverage. Nesting and composition. Some translations from Lisp and OO idioms.

The next chapter takes the type-as-proposition idea seriously and shows what it does for engineering practice.

Types as propositions

Here’s the move that changes how you write code. Stop thinking of a type signature as documentation. Start thinking of it as a claim you are making, which the compiler will refuse to typecheck unless you have provided a proof.

val parse_request : bytes -> (request, parse_error) result

This signature claims: “I can transform any byte buffer into either a well-formed request or a parse error, with no other outcomes.” If your implementation can panic, throw, return null, or block on I/O without that being in the type, the signature is lying and the compiler will catch it (mostly — effects are the loophole and we will address them).

Curry-Howard, taken as engineering practice

The Curry-Howard correspondence makes this precise: types correspond to propositions, programs correspond to proofs, evaluation corresponds to proof normalization. You don’t need to take this seriously as logic to take it seriously as engineering practice. The engineering practice is: write the type first, and let the compiler tell you what’s missing.

A short example. You want a function that takes a non-empty list and returns its head:

val head : nonempty_list a -> a

The type promises that you always return an a. If you write the function to handle the empty case by raising an exception, the type is lying. To not lie, you must encode the non-emptiness in the type — make a nonempty_list type whose values are always at least one element long:

type nonempty_list a = { hd : a; tl : list a }

let head (xs : nonempty_list a) : a = xs.hd

Now there’s no way to construct a nonempty_list that doesn’t have a head. The function cannot fail. The signature is honest. The compiler enforces it.

The discipline cascades. Anywhere you would have written xs.head_or_raise(), you now write xs.head, no exception path, no defensive check. The invariant that the list is non-empty has been moved from runtime to compile time, and that’s the single highest-leverage move available in software engineering. Every site that handles the list no longer needs to know what to do if it’s empty, because empty isn’t representable.

Why this matters for systems work

The shape of your control plane’s state machine is a type. The shape of a wire protocol is a type. The shape of “this resource has been acquired but not yet released” is a type — we’ll meet this one as linear typing in chapter 13. The shape of a packet whose checksum has been verified is a type, distinct from the shape of an unverified packet.

If you can encode an invariant in the type system, you do not have to test for it, document it, or hope code review catches it. You have moved it from runtime to compile time. The number of bugs that survive this move is dramatically smaller than the number that survived being “documented in the function comment.”

A concrete example from this book’s design conversation: continuations are linear. The type system tracks that a continuation can be invoked at most once. The kernel cannot accidentally invoke a continuation twice, because the second invocation is a type error. The class of bugs called “use-after-resume” is structurally absent from any kernel written in this language. Not “tested away,” not “carefully avoided by senior engineers” — structurally absent.

This is what “ontologically minimal” design looks like from the inside. You let the domain dictate the algebra. You encode the algebra in types. The program writes itself in the negative space.

The discipline of writing types first

The practice is: when you start a new function, write its type before its body. Stare at the type. Ask: is this honest? Does it admit all the success cases? Does it disallow all the failure cases I cared about? Does it accidentally permit something I’d want to forbid?

If the type isn’t honest, fix the type first. Maybe you need to refine the input type (only non-empty lists). Maybe you need to refine the output (return a result so callers must handle the error case). Maybe you need to add an effect annotation (this function performs I/O; say so).

Once the type is honest, the body is constrained. There are usually far fewer plausible implementations of an honest type than of a vague one. The honest type guides you.

Look at the difference:

(* Dishonest: what happens if x is negative? *)
val sqrt : float -> float

(* Honest: we admit it might fail *)
val sqrt : float -> float option

(* More honest still: refuse to take a negative input *)
val sqrt : nonneg_float -> nonneg_float

The third form is most honest, but requires you to construct nonneg_float values somewhere — which is fine, that work has to happen anyway, it just gets pushed to the boundary where the float enters the program.

The compiler as a colleague

Treating types as propositions reframes the compiler from “a thing that yells at me about syntax” to “a thing that’s checking my work.” Every type error is the compiler saying: you claimed X, but your code does not prove X. Either fix the code or fix the claim. This is a useful thing for a colleague to do.

You will, in practice, sometimes spend an hour fighting the type checker over a single function. This is not a sign that ML is hostile; it is a sign that you wrote a claim you can’t support, and the compiler is refusing to let you ship it. After enough rounds of this, you develop the instinct to write more honest claims at the start, and the fights get rarer.

This is the same loop as test-driven development, but cheaper — the “tests” are types, which are written more compactly than test cases, and the compiler runs them on every save.

What this chapter committed to

Types are claims; the compiler is the proof checker. Honest types push invariants from runtime to compile time, which is the single highest-leverage move available. The practice is: write the type first, ask if it’s honest, fix it until it is, then write the body.

The next chapter is about the design choice ML and OO have made in opposite directions, and how to tell which one your problem wants.

The expression problem

Now we get to ML’s defining tradeoff, the one your OO instincts will fight.

The expression problem, named by Philip Wadler in 1998 but observed long before: a program manipulates data of various cases using operations. You want to extend in two directions: add new cases, and add new operations. OO makes it easy to add new cases (subclass) and hard to add new operations (touch every class). ML makes it easy to add new operations (write a new function with a match) and hard to add new cases (touch every match).

Neither is wrong. They’re dual. The question is which axis your problem extends along.

The two axes

Visualize it as a 2D grid:

Op 1Op 2Op 3
Case AA.1A.2A.3
Case BB.1B.2B.3
Case CC.1C.2C.3

Each cell is “what op N does on case X.” The grid is the full specification of the program’s behavior. The question is: how is the code organized?

Object-oriented organization puts the rows together. Each class (Case A, Case B, …) is a unit that includes all of its operations. Adding a new case is local: write a new class with all its operations. Adding a new operation is global: touch every class and add the new method.

ML organization puts the columns together. Each function (Op 1, Op 2, …) is a unit that includes all of its cases. Adding a new operation is local: write a new function with a match over all cases. Adding a new case is global: touch every function and add a new match clause.

The grid is the same; the orientation differs. Each orientation rewards extension along one axis and punishes extension along the other.

For systems work, the columns win

For most systems work — protocol parsers, schedulers, control planes, compilers, anything where the shape of the data is fixed by an external specification and the operations multiply — ML’s choice is exactly right.

The shape of HTTP doesn’t change weekly. The shape of TCP segments doesn’t change weekly. The shape of an x86 instruction doesn’t change weekly. The shape of a process state machine doesn’t change weekly. But you’re constantly adding new operations: a new serialization format, a new analyzer pass, a new metric, a new optimization, a new debugger view.

ML rewards this. Define the sum type once; write a new function per operation; the compiler tells you if you missed a case. The data type is the spec, and the spec is stable. The operations are the surface area, and the surface area grows.

OO punishes this. Adding a new operation requires touching every class. The visitor pattern exists specifically to make this less painful, and the visitor pattern is a clumsy reinvention of pattern matching, with worse ergonomics and worse type safety. When you find yourself reaching for visitor in an OO codebase, you’re trying to write ML through a keyhole.

For UI and plugin systems, the rows win

The opposite shape applies to UI code, plugin architectures, and anything where new cases arrive constantly and the operations are roughly fixed.

A widget toolkit has a fixed set of operations: render, handle input, layout, focus. But the cases — Button, TextField, Image, custom Widget3 your user wrote — multiply continuously. OO’s row-orientation handles this gracefully: each new widget class implements the four operations, and the rest of the system never notices it.

ML can handle this case too, with first-class modules or records-of-functions (we’ll see this in chapter 5), but the support is less ergonomic than OO’s built-in class hierarchy. You wouldn’t write a UI toolkit in stock ML for the same reason you wouldn’t write a compiler in stock Java: the language’s grain doesn’t fit the problem.

The lesson, generalized

Know which axis your problem extends along, and pick the language whose grain runs the same way. Your instinct to reach for inheritance is, for the systems work you’re describing, almost always wrong here. Reach for a sum type and a function instead.

This is not a “ML is better” claim. It’s a “the orientations are duals, and most problems lean one way or the other” claim. The orientation a language commits to is its single most consequential decision; everything else in the language flows from it.

OCaml’s pragmatism

OCaml — the largest ML in production use — has objects. They’re not the same as Java’s; they’re more like structural records of methods with their own type system. But they exist, and they exist for the cases where row-orientation is the right answer within an otherwise column-oriented language. They’re rarely used in idiomatic OCaml because rarely is column-orientation wrong, but when you need them, they’re there.

CML in this book doesn’t have objects. The cases we care about — kernels, protocols, schedulers — are all column-oriented. If we ever wanted to write a plugin system on top of CML, we’d reach for records-of-functions (which are a less-pretty form of objects) or first-class modules (which are a more powerful form). For the body of the book, sum types and functions suffice.

The duality in design

The duality reaches into the design of effect systems too. An effect handler is, structurally, a column: it provides a set of operations (the handler clauses) for a fixed set of cases (the effect’s operations). When you write a new handler, you’re writing a new column for the existing effect. When you add a new effect, you’re adding a new row.

In CML, effects are declared once, and the operations of an effect are fixed by that declaration. Adding a new operation to an existing effect would require touching every handler for that effect. So effects are cases-stable, operations-stable, with handlers as the extension point. This is the column orientation extended to control flow itself.

We’ll come back to this when we get to effects formally in chapter 7. For now, the seed is planted: ML’s design philosophy is column-oriented all the way down, including its concurrency model.

What this chapter committed to

The duality between case-oriented and operation-oriented program organization. Why systems work tends to be operation-oriented and ML therefore fits. The honest acknowledgment that the opposite shape exists and ML is not the right tool for it. A preview that the duality reaches into effect systems too.

The next chapter is about polymorphism, which is what lets ML write code generic over the case axis without giving up on either type safety or performance.

Polymorphism and the type variable

You’ve internalized that a type is a set of values with some cardinality, and that the algebra composes those sets. Now we ask: what does it mean to write a function that doesn’t care which set?

The type variable

Consider the function “give me back what you gave me.”

let id x = x

What’s its type? Well, what’s the cardinality of its input? You can’t say — it works for any input. Its type is 'a -> 'a, read “for all alpha, alpha to alpha.” The 'a is a type variable: a hole into which any concrete type can be plugged.

When you call id 42, the compiler instantiates 'a with int and the call has type int -> int. When you call id "hello", it instantiates with string. The function itself, in the source, doesn’t know which.

This is parametric polymorphism. In OO, “this works for any type” usually means “this works for any subtype of some base type” — bounded by an interface, dispatched through a vtable, paying a runtime cost. In ML, 'a -> 'a means literally any type, with no bound, and no runtime cost at all. The function is compiled once (or specialized; we’ll get to that) and works for everything.

How is that possible? Because id cannot do anything with its argument except pass it along. It doesn’t know if 'a is an int or a string or a 4KB struct. It has no operations available to it. The type variable is a promise of ignorance, and that ignorance is what makes the function universal.

Parametricity: ignorance as a guarantee

Here’s where it gets beautiful. Look at the signature 'a -> 'a and ask: what functions could possibly have this type?

Exactly one: the identity. Because the function receives an 'a and must return an 'a, and it has no way to construct an 'a from nothing (it doesn’t know what 'a is), and no way to transform the one it has (no operations are defined on an unknown type). The only 'a it can produce is the one it was given.

The signature, by itself, has told you the entire behavior. This is called a free theorem: a fact about a function deducible from its type alone, for free, without reading the implementation.

More examples:

  • 'a -> 'b -> 'a has exactly one inhabitant (up to behavior): return the first argument. There’s nothing else to do.
  • 'a list -> 'a list — many inhabitants, but all of them are some rearrangement/selection of the input. The function cannot invent new 'a values. It can return [], the input unchanged, the input reversed, every-other-element. It cannot fabricate elements.
  • 'a list -> int — cardinality information only. Length, sum-of-positions, hash-of-structure. The function can examine the list’s shape but not its contents.

This is what people mean when they say “types are propositions.” The signature 'a -> 'a is the proposition “for any A, given an A I produce an A,” and the only proof is id. Stronger signatures admit fewer proofs. The art of ML design is choosing signatures so constraining that wrong implementations cannot typecheck.

Your Lisp background will fight this for a moment — Lisp’s macros let you do anything, and that power feels like the point. ML’s bet is the opposite: deliberately restrict what you’re allowed to express, so the compiler can tell you more about what you’ve written. You give up dynamism; you get certainty. For systems code, where the cost of being wrong is high and the cost of being verbose is low, the trade is usually correct.

Why this is free at runtime

How is id compiled? Two main strategies, with different tradeoffs. The ML family splits on this; production languages have made both choices.

Uniform representation. Used by stock OCaml, SML/NJ, and most BEAM-adjacent implementations. Every value is represented as a single machine word. Either it’s a small integer stored directly (with the low bit set as a tag), or it’s a pointer to a heap-allocated object whose header describes its shape. Under this scheme, id is one function, compiled once, that moves a word from one register to another. It works for int, string, (int, string, float) list option — anything — because everything is a word.

The cost is indirection. A float is heap-allocated and boxed (with exceptions for arrays of floats and other escape hatches). A pair of ints is two pointers to two boxed ints, not two ints side-by-side. You pay in cache misses and allocation pressure for the convenience of “everything is a word.”

Monomorphization. Used by MLton and (analogously) by Rust. The compiler generates a separate specialized copy of id for every type it’s called with: id_int, id_string, id_my_4kb_struct. Each one knows the exact layout of its argument and can pass it through registers or memory as appropriate. No boxing.

The cost is code size. A polymorphic function used at N types produces N copies in the binary. For deeply polymorphic code (think: a generic data structure used at twenty types), this can balloon the binary substantially. Rust mitigates this with its dyn Trait escape hatch; MLton just accepts the size.

Hybrid. Haskell with various extensions, some research MLs. Mostly uniform, with selective specialization driven by annotations or inlining heuristics.

The high-level takeaway: parametric polymorphism is a compile-time tool with one of two runtime stories, and both are predictable. Unlike OO subtype polymorphism — where dispatch goes through a vtable at every call and the optimizer often can’t see through it — parametric calls are either monomorphized (and inlined like normal C) or uniformly represented (and the indirection is uniform and predictable, not per-call). You can reason about it.

For high-performance systems work, the practical heuristic: polymorphism in hot loops is cheap on monomorphizing implementations and possibly expensive on uniform-representation ones, depending on what’s being passed. If you’re processing a tight inner loop over 'a array where 'a = float, on OCaml you’ll want float array specifically (which OCaml special-cases to unboxed storage), not a generic polymorphic container.

The book’s CML defers this decision; the compiler team can pick monomorphization or uniform representation when it has more information about typical workloads. The book’s hardware design (the 3-bit tag, the verified-tagged ALU) is somewhat hostile to monomorphization, since it carries type information everywhere by construction. So the natural choice is uniform representation with type tags, with monomorphization as an optimization for hot loops where the compiler can prove it pays off.

Polymorphism vs. genericity vs. interfaces

A clarifying triangulation, since you have intuitions from several languages:

Go generics (post-1.18) are constrained parametric polymorphism with monomorphization or dictionary-passing, depending on the compiler’s mood. Closer to ML than to Java.

Java/C# generics are constrained parametric polymorphism with uniform representation (Java erases to Object; C# specializes for value types but not reference types). Closer to OCaml than Java programmers usually realize, but bounded by interfaces, which ML’s 'a is not.

C++ templates are monomorphization, plus duck typing. The compiler doesn’t check whether T has a + operator until you instantiate; ML’s type variables are checked once at the definition site, which is why ML compile errors are local and C++ template errors are not.

Elixir/Erlang don’t have static parametric polymorphism — everything is dynamically typed and the “polymorphism” is just “this function happens to work on whatever you pass it.” The discipline you cultivated with @spec and Dialyzer is approximately ML’s discipline, applied retroactively.

Lisp generics (CLOS, Clojure protocols) are ad-hoc polymorphism with runtime dispatch — closer to OO than to ML. ML’s parametric polymorphism is a different axis entirely; you’ll see ML programmers reach for it where a Lisper would reach for a multimethod, and the ML version is statically checked but less extensible.

Where bounded polymorphism lives in ML

You will eventually want to say “this function works for any 'a, as long as 'a can be compared / hashed / serialized.” That’s where ML’s module system enters, and it’s so different from interface-based bounded polymorphism that it would deserve its own book.

Briefly: the answer is not “type classes” (that’s Haskell) and not “interfaces” (that’s everyone else). The answer is: you pass in a module that provides the required operations, either explicitly or via a functor. The compiler resolves it at compile time, monomorphizes or doesn’t depending on implementation, and you pay nothing at the call site.

CML in this book doesn’t fully exploit the module system; the kernel and example programs are simple enough that they don’t need it. A real CML would grow modules and functors over time. For now, the parametric polymorphism described here is the whole story.

A small exercise

How many distinct functions of type bool -> bool -> bool are there? Compute it from cardinalities, then enumerate them in your head and check.

Answer: |bool|^(|bool|×|bool|) = 2^4 = 16. They include and, or, xor, nand, the constant functions, the projections, the negations of projections — all of two-variable boolean logic. The signature literally enumerates a finite logic.

What this chapter committed to

Parametric polymorphism as a compile-time discipline of ignorance. Free theorems as the payoff: the type alone constrains the function dramatically. Two implementation strategies (uniform representation and monomorphization), both predictable in cost. Brief comparisons to neighboring languages. The hint that ML has a richer module-based bounded-polymorphism story that this book mostly defers.

Part I is now complete. You have the algebraic vocabulary, the control-flow primitive, the propositional view of types, the orientation choice, and parametric polymorphism. The next part — Part II — applies all of this to an ISA: what does it mean to design hardware that thinks in these terms?

What does it mean for the ALU to type-check?

Part I gave you the algebra. Part II takes the algebra to the metal. We’re going to build a 16-bit ISA whose primitives are shaped by the language they’ll run. The first chapter of this part starts with the question of where types live: in the source code, in the compiler’s internal records, or in the running machine itself.

The ISA we’ll design carries types at runtime. Every value in a register has a 3-bit tag identifying its type. The ALU checks tags before operating. This is a real design decision with consequences. Before committing to it, let’s lay out the alternatives.

Four alternatives

Alternative A: Tags are dynamic; the ALU dispatches on them

Every value carries a tag. The ADD instruction inspects both operands’ tags. If they’re both int32, it does int32 addition. If they’re both float, it does float addition. If they’re mismatched, it raises a hardware fault.

This is the Lisp Machine design. The Symbolics 3600 and the MIT CADR did exactly this. It makes dynamic typing nearly free: the tag check that an interpreter would do in software, the hardware does in a single cycle alongside the arithmetic.

Buys: dynamic dispatch is cheap. Polymorphism without monomorphization is viable. Garbage collection is helped enormously — the GC can scan a register and know from its tag whether to follow it as a pointer. A “value” type that covers everything is possible.

Costs: The type system, if you have one, is redundant with the hardware tags. You’re paying for tag bits at runtime that the compiler could have proven unnecessary. Monomorphization stops making sense, because you can’t erase what the hardware demands be present.

Alternative B: Tags are static; the ALU dispatches at decode

The tag isn’t carried in the value at runtime. Instead, the instruction encoding names the type: ADD.i32, ADD.i64, ADD.f32 are distinct opcodes. The ALU’s “type checking” is really the decoder’s job — it picks the right execution unit based on the opcode.

This is what x86, ARM, RISC-V, and basically every modern ISA actually do.

Buys: monomorphization is the natural fit. Values in registers are raw bits with no overhead. This is the fastest design for statically-typed code.

Costs: Dynamic dispatch becomes expensive again — you’re back to a software tag check and a branch. The hardware isn’t really doing anything ML-specific; you’ve described a normal ISA.

Alternative C: Tags are static AND carried; the ALU verifies

Each value carries its tag at runtime, but the instruction encoding also specifies the expected type. ADD.i32 reads two registers, checks both tags are i32, performs the add, writes a result with tag i32. A tag mismatch is a hardware fault — but it’s a fault that, in a well-typed program, cannot occur. The hardware check is a verification of the compiler’s claim, not a dispatch mechanism.

This is closer to what you’d get if you pushed CHERI’s capability discipline further.

Buys: The type system and the hardware tags are in correspondence. A well-typed program never faults on a type check, so the type system is a soundness proof for the program’s execution. Dynamic typing is still possible (using a sum type whose tag is inspected explicitly), but it’s opt-in. The hardware tag also gives the GC a precise marking primitive, and lets cross-device messages carry their type information natively over the NoC.

Costs: You’re paying for tag bits in every value, which costs memory bandwidth and register width. You’re paying for the tag check in every ALU operation, which is parallel-with-the-op so latency-free but costs energy and silicon.

Alternative D: Tags are carried, but only at boundaries

Inside a function body, values are raw — the compiler proved the types, no tags needed. At function call boundaries, message-passing boundaries, and effect-perform boundaries, values get tagged. This is a hybrid: register-resident values are bare bits within a “tagged region,” and crossing a boundary either attaches or strips a tag.

Buys: You get C’s register efficiency in hot loops, plus the cross-boundary verifiability of Alternative C. Cross-device messages over the NoC arrive tagged.

Costs: Compiler complexity: tracking “tagged” vs. “bare” representations and inserting tag/untag operations at boundaries. The inside of a function is no longer verified by the hardware, so the type system has to be trusted there. For a research ISA where you’re co-designing for safety, this is a regression.

The intuition pump

Picture a single ADD of two int32s. Walk through what happens in each alternative:

  • A (Lisp Machine): Hardware reads both registers, sees tags say “int32,” performs integer add, writes result with int32 tag. Took one cycle, with the tag check parallel to the add. Same instruction would have done float add if the tags said float.
  • B (modern ISA): Decoder sees ADD.i32, routes to the integer ALU. ALU adds. Done. No tags anywhere.
  • C (verified-tagged): Decoder sees ADD.i32. Hardware reads both registers, checks both tags are i32 (else fault), performs add, writes result with i32 tag. Same mechanism as A, different semantics: the tag is verification, not dispatch.
  • D (boundary-tagged): Inside a function, looks like B. At a call site that crosses an ABI boundary, the compiler emits TAG.i32 to attach a tag before the call, and UNTAG.i32 to strip it on receipt.

Now picture perform (SendToDevice 0x42 payload). In A and C, the payload carries its tag with it and the receiving device can interpret it. In B, the payload is bare bits and the protocol has to encode the type somehow. In D, you’d tag at the perform boundary and untag at the handler.

If the NoC is going to carry typed messages between heterogeneous devices, you almost certainly want the type to be physically present in the message, not implicit in the protocol. That’s a tag in the value. This argues strongly for C.

The decision

For this book’s ISA, we choose Alternative C. The decision is principally driven by:

  1. Cross-device messages over the NoC need to carry type information. The hardware tag does this naturally.
  2. The type system gives a precise GC root-set without compiler-annotated stack maps.
  3. The hardware tag check turns the type system from “compile-time discipline” into “compile-time discipline backed by runtime verification.” Type unsoundness manifests as a hardware fault, not as silent corruption.
  4. Effect dispatch, which we’ll meet in chapter 7, benefits from the same tag mechanism — continuations are tag-distinguished from other pointers.

The costs (register width, ALU tag-check silicon) are acceptable for a teaching ISA, and arguably acceptable for a small production ISA if the design holds together at scale.

What this implies for the language

Several things follow from the choice:

The language has a notion of “primitive type” that maps to a hardware tag. There’s a small, fixed set: integers of various widths, floats (if any), pointers, continuation handles. The current 3-bit tag space holds 8 such types.

Compound types are built out of primitives. Sums, products, functions. These need representation strategies; we’ll get there in later chapters. The short answer: they’re heap-allocated and pointed to via a PTR-tagged value.

Monomorphization is possible but less crucial. Because hardware tag dispatch is cheap, polymorphism is cheap. You can still monomorphize for code-quality reasons, but you’re not forced to.

The type system has a soundness theorem of the form “well-typed programs do not fault on hardware tag checks.” This is a real, provable property that the language must preserve. We get to choose the language and the type system; we can make this hold.

Sum types and hardware tags overlap conceptually but not in implementation. We thought, briefly, about making sum-type variants share the hardware tag namespace directly. The decision (carried forward into the ISA spec) is to keep them separate: hardware tags identify primitive types, and sum-type variants are recorded in heap-object headers. We’ll explore this in the language spec; for now, hold the seeds in mind.

Consequences for tag width

How wide should the tag be? Three bits gives 8 tags, which is the choice in this book’s ISA. The allocation:

TagName
0INT
1PTR
2CONT
3CLOSURE
4UNIT
5BOOL
6RESERVED
7RESERVED

INT is the only numeric type at present — 16-bit signed integer. PTR is a heap pointer. CONT is a continuation handle (a heap pointer, distinguished for GC and for RESU typechecking). CLOSURE is a closure record (also a heap pointer, distinguished for CALL). UNIT and BOOL are special-cased so they fit in registers without heap allocation. Two slots are reserved for future expansion.

Three bits is tight. We can’t add FLOAT, INT64, BYTE as distinct primitive types without either expanding the tag or remapping. The decision for the 16-bit toy is to live with the constraint — there’s plenty of design space in just INT, PTR, CONT, CLOSURE, UNIT, BOOL to demonstrate the model. A 32-bit successor would widen the tag to 4 or 5 bits and allocate the additional primitives.

What this chapter committed to

Verified-tagged architecture as the design choice. The 3-bit tag namespace. The principle that the type system and the hardware tags are in correspondence, with the type system as a soundness proof for hardware tag integrity.

The next chapter introduces the operation that makes this ISA strange: perform, the instruction that captures a continuation and transfers to a handler.

Continuations as values, effects as control transfer

This is the chapter that explains the design’s most unusual commitment, the one that runs through everything else. Most CPUs treat control flow as: execute the next instruction, or take a branch, or call a function and eventually return, or take an interrupt and eventually return. This ISA adds one more primitive: perform an effect. The hardware captures the current point of execution as a value, transfers control to a handler that the kernel installed, and lets the handler decide whether to resume.

This sounds esoteric. It is, in fact, the most useful primitive we will add. Let me build up to why.

What’s a continuation, exactly

A continuation is the rest of the computation, reified as a value. If you have an expression

let x = E in F

then “the rest of the computation after E completes” is conceptually the function fun v -> let x = v in F. In a language with first-class continuations, this rest-of-the-program can be captured as a value, stored, passed around, and invoked later.

If you’ve used call/cc in Scheme, you’ve worked with continuations. If you’ve written async/await in any modern language, you’ve worked with a restricted form of continuations — await saves “the rest of this function” as a state machine and resumes it when the awaited value arrives. If you’ve written generator functions (yield in Python), the generator’s frame is a continuation that gets resumed each time you ask for the next value.

In CML, continuations are first-class values. They have a type — unit continuation is a continuation that expects unit when resumed; int continuation is one that expects an int. They live in registers; they live on the heap; they get passed to functions; they get stored in queues. The compiler tracks their types.

In the ISA, continuations have a hardware tag — CONT — and the runtime representation is a heap-allocated record describing exactly enough state to resume from where they were captured. We pinned this layout in the ISA specification (Appendix A): saved program counter, saved stack pointer, saved registers, and a “used / unused” flag.

What’s an effect

An effect is a typed operation that user code can perform and that some surrounding handler is responsible for interpreting. In CML, effects are declared by signature:

effect IO {
  print : string -> ()
  read_byte : () -> byte
}

effect Scheduler {
  yield : () -> ()
  exit  : () -> empty
}

These declarations introduce families of operations. Each operation has a type. The print operation takes a string and returns unit; the read_byte operation takes nothing and returns a byte.

User code performs an operation:

let n = perform IO.read_byte () in
perform IO.print "you typed something";
n

When this code runs, perform IO.read_byte () causes control to transfer to whatever handler is currently installed for the IO.read_byte effect. The handler decides what to do — read a byte from somewhere, return it. When the handler resumes the continuation, the value of the perform expression becomes whatever the handler supplied.

This is similar to a function call. The difference is who decides what the call does. With a function call, the called function is named at the call site; you know what runs. With a perform, the call site names only the effect; what runs is whichever handler is currently in scope. The same perform IO.print msg might write to UART in the kernel, write to a buffer in a test harness, or write to a network socket in a distributed system. The user code doesn’t know.

What’s a handler

A handler is an installed responder for one or more effects. In CML:

handler print_to_uart = {
  | IO.print s, k       -> uart_send s; resume k ()
  | IO.read_byte (), k  -> resume k (uart_recv ())
  | return v            -> v
}

let main () =
  with print_to_uart in {
    perform IO.print "hello\n";
    let b = perform IO.read_byte () in
    perform IO.print ("got: " ^ string_of_byte b ^ "\n")
  }

The handler print_to_uart declaration defines a set of clauses. Each clause names an effect operation and binds two things: the operation’s argument, and the continuation. The continuation k represents “the body’s execution at the point of this perform, ready to be resumed.”

When the body performs IO.print "hello\n", control transfers to the IO.print s, k clause with s bound to "hello\n" and k bound to the suspended continuation. The clause does the actual work (writing to UART) and calls resume k () to deliver () back to the perform site, after which the body continues with the next line.

The return v clause runs when the body completes normally. It receives the body’s return value and is the handler’s last opportunity to do something with it.

Why this is a unification

Now the payoff. Several operations that look distinct in conventional kernels are, under this design, the same operation:

A context switch. A process performs Scheduler.yield. The handler captures the continuation, files it in the ready queue, pops a different continuation, and invokes it. Context-switching is perform + resume on a different continuation.

A blocking I/O call. A process performs IO.read_byte. The handler checks if a byte is available; if so, it resumes the continuation with the byte. If not, it files the continuation in a “waiting for input” queue and runs someone else. When a byte arrives later, the handler picks up the waiting continuation and resumes it.

A timer interrupt. The timer device performs Hardware.tick. The handler captures the interrupted code’s continuation, files it in the ready queue, and runs someone else. This is the same operation as voluntary yield. The only difference is who performed it — the running code (yield) or the timer device (tick).

A device-completion notification. The disk performs Hardware.disk_done. The handler looks up which continuation was waiting for this completion, supplies the result, and resumes.

An exception or error. A process performs Error.divide_by_zero. The handler decides whether to recover (resume with a default value), terminate the process (don’t resume), or escalate (perform another effect with the error).

All of these are the same mechanism. One control-flow primitive — perform — that unifies what conventional systems treat as half a dozen distinct mechanisms with their own machinery.

Why this is hardware-shaped

In a software implementation of effects — as found in OCaml 5, Koka, Eff — the mechanism for perform is some variant of stack manipulation: the runtime walks back up the stack until it finds an installed handler, then transfers control. This costs a few hundred to a few thousand cycles, depending on stack depth.

In our ISA, perform is a single instruction. The handler search happens in hardware. The continuation capture happens in hardware. The transfer happens in hardware. The whole operation retires in a handful of cycles. Context switches between processes — which require user-level continuations to be captured and invoked — are no longer expensive operations that kernels have to engineer around. They’re cheap, and they’re uniform with every other effect.

This is the design’s central bet: that effects are common enough, and important enough, that putting them in hardware pays off.

The handler search mechanism

The hardware needs to find the handler for a given effect. The data structure is a handler stack, with handlers installed by recent with blocks at the top and the kernel’s bottom handlers at the bottom. When user code performs an effect, the hardware searches top-down for a handler that catches it.

In our ISA, this stack is partially cached in a small content-addressable memory called the handler CAM. The CAM holds the most-recently-installed handlers. If the effect is in the CAM, dispatch is one cycle. If not, the hardware walks the spilled portion of the handler stack in memory, costing more cycles per probe. The CAM is 8 entries in the current design; the spill cost is roughly 12× a hit. This is documented in the ISA spec.

For the user code, none of this is visible. perform either dispatches fast or slow; correctness is the same; the cost model is published so programs can be tuned.

Continuations are linear

A subtle but load-bearing decision: continuations in CML are linear. They can be resumed exactly once. If a handler clause resumes the continuation, it cannot also resume it again. If a handler clause stores the continuation in a queue, it cannot also resume it directly.

This is what makes effect-based scheduling sound. If continuations could be duplicated, a buggy or malicious handler could invoke a process’s continuation twice, running the process simultaneously in two contexts with all the resulting confusion. By making continuations linear, the type system rules this out.

Linearity also enables migration across cores. When you ship a continuation to another core (as we’ll see in the multicore chapter), the source core loses it and the target core gains it. There’s no danger of both cores trying to run the process; the type system has enforced that there’s only one continuation handle, and now it lives on the other core.

We’ll fully unpack linear types in chapter 13. For now, the important fact: continuations are values that can be passed around, but they’re linear, so the type system tracks ownership.

What this chapter committed to

Continuations as values the ISA understands. Effects as typed operations that transfer control to installed handlers. The unification of context switches, I/O blocking, interrupts, exceptions, and a few other concepts under one mechanism. The hint that linearity makes the whole thing safe.

The next chapter traces a single physical event — a keystroke — through every layer of this design, from the switch under the key to the perform that delivers it to a process.

The keyboard, end to end

This is a tracing chapter. We’re going to follow a single physical event — pressing the ‘A’ key — through every layer of the system, from the physical switch to the perform that delivers it to a process. The point is to make concrete how the unification described in the previous chapter actually works at the hardware level.

We have, for purposes of this chapter, a real-ish setup: a physical keyboard wired to your SoC via some serial protocol; inside the SoC, a keyboard interface device (KID) that bridges the serial protocol to the chip’s network-on-chip; the NoC itself, with four compute cores and various devices attached; and the kernel running on core 0, with a handler installed for keyboard events.

When you press ‘A’, this is what happens.

Stage 1: Outside the SoC

The physical key switch closes. The external keyboard controller’s scan loop notices, debounces (a few milliseconds), and decides to report a keypress. It encodes “A pressed” as some bytes per whatever serial protocol — let’s say a simple UART-style protocol where each keycode is one byte. It transmits those bytes to the SoC’s pin.

This is happening in real time, on the order of milliseconds, governed by the physical world and the external chip’s clock. The SoC has no direct view into it.

Stage 2: The KID receives serial bits

Inside the SoC, the KID has a serial receiver. As bits arrive on the input pin, the receiver shifts them into a register. When a complete byte has been assembled, the receiver flags it. This is a hardware-only process, happening at the bit clock rate of the external link.

The KID’s logic now has a question: what do I do with this byte?

For our simple protocol, the byte is a complete keycode. The keypress for ‘A’ arrives as the single byte 0x41. The KID is ready to forward it.

Stage 3: The KID emits a NoC packet

The KID has one byte and needs to know where to send it. Some core wants to know about keypresses. The KID can’t pick arbitrarily; it needs configuration.

This is the first interesting design decision. There are several patterns for “how does a device know who to talk to”:

  1. Hardcoded destination at synthesis. Always sends to core 0. Simple, inflexible.
  2. Configurable register. The kernel writes to a register at boot, naming the destination. Configurable, but requires a back-channel from kernel to device for configuration.
  3. Service-discovery. Devices announce themselves at boot; consumers subscribe. Complex, like a real OS.

For a toy, (2) is right. The kernel, during boot, sends a configuration packet to the KID: “send all your keypresses to core 0.” The KID receives this packet, latches the destination into a register, and uses it for all future packets.

So let’s say that’s been done. The KID now has byte 0x41 and knows to send it to core 0. It constructs a NoC packet:

FieldValueNotes
dst_id0x00core 0
src_id0x05KID’s address
op0x01“keypress event,” agreed protocol
flags0x0no priority
payload0x00000041keycode in 32 bits with leading zeros
reserved0x000

It hands this packet to its NoC interface, which is the local mesh router port for the KID’s location.

Stage 4: The packet traverses the NoC

The 4-local NSEW mesh routes packets by tile coordinate. The KID is at some position; core 0 is at another. The NoC’s router logic computes a path — say, “go east twice, then north once” — and the packet hops from router to router, one mesh hop per cycle, until it arrives at the input port of core 0’s local router.

This is purely hardware. No instructions are executed during routing. The NoC is fabric, not compute. The latency is a few cycles per hop, which on a 4-local mesh is at most a handful of cycles total — small compared to anything else we’re discussing.

One important property: NoC ordering is generally guaranteed between any two endpoints. Packets sent from KID to core 0 arrive in the order they were sent. This is important for the kernel’s correctness: a sequence of keypresses can’t be reordered in flight.

Cross-endpoint ordering is not guaranteed: a packet from KID to core 0 and a packet from disk to core 0 can arrive in either order relative to each other. This is fine; the kernel doesn’t assume such ordering.

Stage 5: Core 0’s NoC interface delivers the packet

The packet arrives at core 0’s local router and is handed to the core’s NoC interface — the hardware block that bridges the NoC fabric to the core’s instruction stream. This is where the most interesting design decisions happen, because the NoC interface is the thing that has to turn a packet into a perform.

Three plausible patterns:

  1. Memory-mapped queue. The NoC interface deposits packets into a memory-mapped FIFO. The core periodically reads from the FIFO. Traditional approach.
  2. Interrupt on arrival. The NoC interface raises an interrupt line; the core’s interrupt handler reads the packet.
  3. PERF on arrival. The NoC interface directly performs an effect when a packet arrives.

For this design, (3) is the answer, and this is where the elegance shows.

The NoC interface, on receiving a packet, examines the op field. It treats the op as a 4-bit value within the family 0xF (which the ISA reserves for hardware-originated effects). The interface acts as if the currently-executing code had issued PERF (family=0xF, op=0x01), <payload>, except it didn’t — the NoC interface is hardware.

The core’s pipeline, at its next preempt-sample point (end of sequencer cycle, per the precise-exception discipline), notices the signal and handles it as if the currently-executing code had issued the PERF itself:

  1. The current instruction completes.
  2. The continuation at the current instruction boundary is captured into a fresh CONT_REC on the heap.
  3. The captured continuation goes into v1, the payload goes into v0, the op code goes into v7.
  4. The handler CAM is searched for a handler matching (family=0xF, op=0x01).
  5. Control transfers to that handler.

From the language’s perspective, this is indistinguishable from a perform issued by user code. The handler clause doesn’t know — and doesn’t care — whether the perform came from inside or from the NoC. Same calling convention, same continuation, same dispatch.

Stage 6: The handler runs

The kernel, during boot, installed a handler for (0xF, 0x01) — “keypress arrived from KID” — with the body something like:

| Hardware.key_arrival b, k ->
    match Queue.pop sched.waiters with
    | Some waiter ->
        Queue.push sched.ready { id = p.id; k };
        invoke waiter.k b
    | None ->
        Queue.push sched.inbox b;
        resume k ()

The handler does its work — files the byte in the inbox if no one is waiting, or wakes a waiting process. Then either resumes k (the continuation of whatever was running when the packet arrived) or invokes the waiter’s continuation. Either way, control flows back into user code, and the system continues.

What we’ve just built

Trace it end-to-end and notice what’s happened. A physical event — a keystroke — propagated through six layers, and arrived in the language as a perform. At no point did the kernel write code to “handle an interrupt.” It wrote a handler for an effect. The effect happens to be performed by hardware rather than by software, but the kernel’s code is identical in both cases.

This is the unification. Three observations to make it concrete:

The keyboard is structurally identical to a software process. Both produce effects that the kernel handles. The keyboard’s effects happen at hardware-determined rates; a process’s effects happen at software-determined rates. The kernel doesn’t distinguish.

The “interrupt handler” is just a handler. No separate interrupt context, no IRQ stack, no top-half/bottom-half discipline. The handler runs in the frame of whatever was preempted, has access to whatever the lexical scope provides, and resumes (or doesn’t) via the normal continuation mechanism.

Configuration is also just an effect. The “kernel sends a packet to the KID to set its destination” we glossed over — that’s also a perform. Specifically, it’s the kernel performing an effect with payload “configuration to KID at address 0x05.” The KID’s NoC-side hardware decodes this and writes to its internal destination register. There’s no special “MMIO write” instruction; it’s the same PERF mechanism going the other direction.

What’s hiding in the boot story

I waved at “the kernel, during boot, sends a configuration packet.” Let’s unwave it.

When the SoC powers up, the NoC interface on each core is in some default state. The handler CAMs are empty. The effect tables are empty. The cores are in a known state at some reset vector.

A real boot sequence:

  1. Each core comes up at its reset vector. One core (by convention, 0x00) is the bootstrap core; the others wait.
  2. Bootstrap core initializes its handler tower: installs the outermost handler for (0xF, *) — the catch-all for hardware-originated performs — pointing at a “panic” or “log and drop” routine, so that errant packets don’t deadlock.
  3. Bootstrap core enumerates devices on the NoC. This is itself a sequence of performs: “send ‘identify’ to address 0x01, wait for response,” “send ‘identify’ to address 0x02, wait for response,” etc.
  4. For each device of interest, the bootstrap core sends a configuration packet binding the device to a core. KID gets bound to core 0. UART gets bound to core 0. Etc.
  5. The bootstrap core installs specific handlers for (0xF, *) corresponding to each device whose packets it should receive.
  6. The bootstrap core spawns initial processes.
  7. The bootstrap core releases the other cores, which run their own initialization sequences.

This is a real boot ROM. Maybe 200 instructions, mostly PERF to talk to devices. It’s the only code on the system that uses bare device-level PERFs rather than higher-level effects; everything afterward goes through the kernel’s abstractions.

Where multicore enters

Notice that the same mechanism — perform packets traversing the NoC — works for core-to-core communication. If core 0 wants to send a message to core 1, it issues PERF (0xF, op_for_inter_core) payload, where the payload includes core 1 as the destination. The NoC interface routes the packet; core 1’s NoC interface receives it; core 1’s handler runs.

The keyboard and the other core are the same kind of thing from the kernel’s perspective. Both are addresses on the NoC. Both communicate via packets. Both surface as performs.

We’ve built the entire mechanism for multicore already. Multicore is just “more cores, addressed the same way as devices, with the kernel running on multiple of them.” We’ll formalize this in chapter 14; for now, the point is that the design pays off across all the things you’d want to do with it.

What this chapter committed to

The full trace of an event through hardware, NoC, and into the language. Devices as effect producers. The boot story for binding devices to cores. The preview that multicore is the same mechanism extended.

The next chapter looks at the same mechanism running in the opposite direction — what does it look like when CML user code sends a message to a device?

The send side, and what PERF really does

The previous chapter traced an event into the language. This chapter traces a request out — what happens when user code does perform UartTx 'A' to send a byte over the UART. The point is to make precise what perform actually does at the hardware level, and to surface a tension that the design has to resolve: PERF for local effects and PERF for remote effects are similar but not identical.

The asymmetry that’s been hiding

We’ve been treating PERF as a single mechanism. But it has two distinct things it might do, and the hardware has to decide which:

  1. Local handler search. Look in the handler CAM, find a handler installed in the current frame tower, transfer control to it. This is what every perform in the cooperative kernel does so far.

  2. Remote dispatch. Send a NoC packet to another device, which will deliver it as a PERF on that device. This is what we want to happen when the kernel sends a byte to UART.

How does the hardware know which? Several answers, with very different consequences.

Design space

Option A: Always search locally first, fall through to NoC on miss.

PERF executes. CAM is searched. If a handler is found, transfer control locally. If no handler is found, consult the effect table. If the effect table entry says “remote, dst=X,” emit a NoC packet to X.

The cleanest unification — one PERF instruction, dispatch decided by configuration. But it has a subtle cost: every PERF destined for a remote device pays the local search cost first. For a UART send happening many times per second, that’s a lot of fruitless CAM lookups.

It also has a confusing corner case: what if a user installs a handler for an effect that’s supposed to go to a remote device? The local handler intercepts, and the remote device never hears about it. This might be a feature (interpose on device traffic for debugging) or a bug (you accidentally captured a write you meant to send to UART). Powerful and dangerous.

Option B: The effect table entry encodes “local” vs. “remote” explicitly.

Each entry in the effect table is one of:

  • Local(handler_pc) — fall back to a handler at this PC if not in CAM
  • Remote(device_addr, op_encoding) — emit a NoC packet, don’t check CAM at all

PERF consults the effect table first to determine the mode. If Local, do CAM search. If Remote, build packet and emit.

More efficient (no wasted CAM lookups for remote effects) and more explicit. The cost is that dispatch is no longer uniform — the language has to know whether an effect is local or remote, which leaks the implementation into the type system unless we hide it well.

Option C: Separate instructions for local PERF vs. remote send.

PERF is always local. SEND is always remote. The compiler decides which to emit based on the effect’s binding at compile time.

This is what the very early ISA drafts had, before we unified. It’s the most efficient but it splits the abstraction. The language’s perform is no longer one operation; it’s two, and the compiler bridges them.

Option D: The CAM also holds “remote redirect” entries.

CAM entries can be either LocalHandler(pc) or RemoteRedirect(device_addr, op). PERF searches the CAM as before; a remote-redirect entry emits a NoC packet instead of transferring control locally.

The choice

For this design, Option B with a refinement from D. The effect table is the canonical mapping from (family, op) to destination — local handler or remote device. The CAM holds overrides installed by INST, which are necessarily local (you INST a handler at a PC, not at a device address). When PERF executes:

  1. Consult the CAM. If a local override is found, transfer control.
  2. Otherwise, consult the effect table. If Local(pc), transfer control. If Remote(dst, op), build a NoC packet and emit. If unhandled, trap.

This gives us:

  • One instruction (PERF) at the language level
  • Efficient remote dispatch (no wasted CAM search for remote effects, since the CAM only holds local handlers by construction)
  • Override capability (a kernel debugger could INST a handler for an effect that’s normally remote, capturing all device traffic)
  • Clean compile target (the compiler emits PERF eid8, va for every effect; the runtime/effect-table-setup decides the rest)

UART send, end to end

Let’s trace a perform UartTx 'A'. The UART transmitter is device 0x10. At boot, the kernel populated the effect table:

effect_table[(family=2, op=0)] = Remote(device=0x10, op=0x01)

— that is, family 2 op 0 is “UART transmit byte,” routed to device 0x10’s op 0x01. The kernel exposes this to user code via an effect UartTx : byte -> unit declaration.

User code:

perform UartTx 'A'

Compiled to:

LI    v0, 0x41           ; 'A'
PERF  family=2, op=0, v0

Stage 1: PERF executes

The core’s pipeline reaches the PERF instruction. The decoder identifies it; the operand is the 8-bit effect ID (family=2, op=0), and v0 provides the payload.

  1. CAM search. Looking for (2, 0). No local override; miss.
  2. Effect table lookup. Read effect_table[(2,0)] from memory. Entry says Remote(0x10, 0x01).
  3. Decision: remote dispatch. Skip continuation-capture (we’re not doing local handler search; there’s no handler frame to transfer to on this core). Instead, build a NoC packet.

This is a meaningful divergence in PERF semantics depending on the dispatch mode, and it deserves explicit acknowledgment. For remote PERF, no continuation is captured on the local core. The PERF site continues execution after the packet is sent. The receiving device handles the message however it does; if it needs to reply, it sends its own packet.

This is exactly the asymmetry between “synchronous local effect” and “fire-and-forget remote send.” A UART transmit is fire-and-forget: send the byte and continue. You don’t block waiting for the byte to be on the wire.

Stage 2: Packet construction

The PERF unit assembles a NoC packet:

FieldValue
dst_id0x10 (UART)
src_id0x00 (this core)
op0x01 (from effect table)
flags0x0
payload0x00000041 (the byte, zero-extended)

The packet is constructed in a NoC output staging register inside the PERF execution unit. This is a multi-cycle operation, but that’s fine — multi-cycle is the norm for anything involving the NoC.

Stage 3: Handoff to the NoC interface

The packet is handed to the core’s NoC output port, which feeds into the mesh router. The PERF unit’s job is done; the rest is asynchronous.

Stage 4: Backpressure semantics

What does PERF do if the NoC output FIFO is full?

  1. Stall. PERF blocks the pipeline until the FIFO has room. Simple. Means PERF can take unbounded cycles. Preemption can still happen (since stall is a preemptable wait state).
  2. Trap. PERF raises an effect indicating send failure. Handler can retry or report.
  3. Drop. PERF silently drops the packet. Bad. Don’t.

(1) is right. Stall is the default and composes with preemption naturally — if the stall takes too long, the timer eventually fires and the scheduler kicks in. The currently-running process pays the cost, but only of the stall time, and only fairly.

Stage 5: Routing

The packet enters the mesh, follows the routing algorithm, exits at the UART’s NoC input port. A handful of cycles.

Stage 6: UART receives and acts

The UART’s NoC interface receives the packet, decodes it, notes op=0x01 and payload’s low byte = 0x41. It hands the byte to its internal transmit FIFO. The serial-out logic shifts the byte out the TX pin.

The byte travels down the wire, and the SoC’s involvement is complete.

The interesting semantic question: synchronous request-response

I glossed past something. For local PERF, the continuation is captured. For remote fire-and-forget PERF, it isn’t. But what about a remote PERF where the caller wants a response?

Consider effect DiskRead : sector -> bytes. The caller wants to send a read request and then receive the data. The continuation after perform (DiskRead 42) expects to be resumed with the bytes.

Two real choices:

B1: Pipeline blocks waiting for a response. PERF emits the request packet, then blocks the pipeline until a response packet arrives, at which point it delivers the payload to the PERF site and continues. Simple semantics, but the core is blocked for the whole disk operation, which is forever in CPU terms. Bad.

B2: Capture the continuation; resume it when the response arrives. PERF captures the continuation (like a local PERF) and files it somewhere keyed by the request ID. The PERF site doesn’t continue; control returns to the scheduler. When the response packet arrives, it’s matched against the filed continuation and resumed with the response.

(B2) is what you want. Notice: this means the kernel’s ReadByte logic and a hypothetical disk read are doing the same thing at different levels. The kernel-level ReadByte files the continuation in waiters and runs a different process. A hardware-level synchronous PERF would similarly file the continuation, awaiting a matching response.

The question is whether the hardware does this automatically (capture + file + resume on response) or whether it leaves it to software. Two options:

Option α: The hardware doesn’t know about request-response. DiskRead compiles to a fire-and-forget PERF that sends a request packet with a continuation handle as part of the payload. The kernel’s response handler unpacks the handle from the response and resumes it. Continuation handling is entirely software.

Option β: The hardware has a synchronous PERF mode. Captures the continuation, files it in a hardware table indexed by request ID, emits the packet, yields. When a response packet with that request ID arrives, the hardware automatically resumes the continuation.

(α) is simpler and more flexible; (β) is faster but commits the hardware to a specific request-response pattern.

For the teaching ISA, (α) is the right answer. It keeps the ISA minimal and pushes request-response into kernel code where it can be inspected and modified. The kernel writes a small “remote-effect helper” library that wraps fire-and-forget PERFs with continuation tracking. Higher-level effects (DiskRead) compile to calls into this library.

This means at the ISA level: PERF for remote effects is always fire-and-forget. Synchronous request-response is built in software on top of the primitive.

PERF semantics, summarized

The PERF execution unit, in pseudocode:

on PERF (family, op), v_payload:
  if CAM hits on (family, op):
    capture continuation
    transfer control to local handler
  elif effect_table[(family, op)] == Local(pc):
    capture continuation
    transfer control to pc
  elif effect_table[(family, op)] == Remote(dst, dev_op):
    build packet { dst, src=this_core, op=dev_op, payload=v_payload }
    send packet (may stall on backpressure)
    continue at next instruction (no continuation capture)
  else:
    trap (unhandled effect)

This is the unified semantics. The dispatch decision is made at the effect-table lookup. The CAM is for fast local overrides; the effect table is for default routing; the trap is for unhandled cases.

The compiler doesn’t need to know which mode a PERF will be in. It always emits PERF eid8, v_payload. The kernel, by populating the effect table at boot, decides which effects are local and which are remote. User code is uniform.

What this chapter committed to

The send side of PERF. The decision to unify local and remote dispatch under one instruction with effect-table-driven routing. Fire-and-forget semantics for remote PERF, with synchronous request-response built in software on top. Backpressure as a pipeline stall.

Part II is complete. We have an ISA whose primitives — type-checked arithmetic, hardware-aware pattern matching, perform-as-effect, continuation-as-value, NoC-as-effect-routing — fit the language we want to compile. Part III applies all of this to a kernel.

A teaching kernel in 40 lines

Part III is where the design earns its keep. We’ve laid out the algebra, committed to an ISA, specified an effect-aware language. Now we use all of it to build a kernel.

The kernel grows feature by feature across this chapter and the next four. By the end of Part III, it’s a multicore preemptive kernel that can run multiple processes, time-share between them, block on I/O, and migrate work between cores. It will be approximately 200 lines of CML. The features add cleanly without restructuring the existing code, which is itself the demonstration we’re trying to make: the model composes.

This chapter is the starting point. A cooperative single-core kernel. About 40 lines. It does the minimum needed to be a kernel: multiplex a CPU across multiple flows of control using only the primitives we’ve built.

The full kernel

(* === Effect signatures === *)

effect Yield  : unit
effect Print  : string -> unit
effect Tick   : core_id -> unit

(* === Process and scheduler state === *)

type process = {
  id : int;
  k  : unit continuation;
}

type sched_state = {
  mutable ready   : process queue;
  mutable next_id : int;
}

let sched : sched_state = {
  ready   = Queue.empty;
  next_id = 0;
}

(* === Spawning a process === *)

let spawn (body : unit -> unit) : unit =
  let pid = sched.next_id in
  sched.next_id <- pid + 1;
  let k = reify (fun () -> body ()) in
  Queue.push sched.ready { id = pid; k }

(* === The scheduler loop === *)

let rec scheduler () : unit =
  match Queue.pop sched.ready with
  | None -> ()
  | Some p ->
      handle
        invoke p.k ()
      with
      | Yield k ->
          Queue.push sched.ready { id = p.id; k };
          scheduler ()
      | Print msg k ->
          uart_write msg;
          invoke k ()
      | Tick _ k ->
          Queue.push sched.ready { id = p.id; k };
          scheduler ()

(* === Two toy processes === *)

let proc_a () =
  perform (Print "A1\n");
  perform Yield;
  perform (Print "A2\n");
  perform Yield;
  perform (Print "A3\n")

let proc_b () =
  perform (Print "B1\n");
  perform Yield;
  perform (Print "B2\n");
  perform Yield;
  perform (Print "B3\n")

(* === Boot === *)

let main () =
  spawn proc_a;
  spawn proc_b;
  scheduler ()

When this runs, the output is:

A1
B1
A2
B2
A3
B3

Two processes, alternating, cooperatively yielding. About 40 lines. That is a kernel. A small one, missing many things, but a real kernel in the sense that matters: it multiplexes a CPU across multiple flows of control using only effects and continuations. Let’s go through it.

The effect declarations

effect Yield : unit
effect Print : string -> unit
effect Tick  : core_id -> unit

An effect declaration introduces a typed operation that user code can perform and that some surrounding handler must interpret. The type after the colon is the type the perform-site receives back when the handler resumes it.

Declaring an effect does not say what it does. Yield has no implementation here. It’s an abstract operation. The implementation is whatever handler catches it. The same perform Yield could be interpreted as “put me on the back of the queue” by the cooperative scheduler, “yield to the highest-priority blocked process” by a priority scheduler, or “no-op” by a degenerate scheduler. The effect is the signature; the handler is the implementation.

Note Tick : core_id -> unit — the timer effect carries which core it’s targeting. In a multicore setting, the timer device performs (Tick 0), (Tick 1), etc., and the NoC routes the perform to the corresponding core. This is your inter-device-messaging-as-effects design showing up naturally. For now, single-core, the core_id is always 0.

The process type

type process = {
  id : int;
  k  : unit continuation;
}

A process is two things: an id and a unit continuation. Read unit continuation as “a continuation that, when invoked with a value of type unit, will run to completion (or until it performs another effect).” This is the type you wanted to see.

A continuation is the rest of the computation, reified as a value. A unit continuation means “this continuation doesn’t need any value to resume — it was suspended at a point that didn’t bind a result.” A perform Yield is suspended this way: the effect returns unit, so the continuation captured at the perform site is a unit continuation. When the scheduler invokes it with (), the suspended code resumes after the perform Yield.

A process, then, is literally just a continuation with metadata. There’s no task_struct. The continuation is the saved state.

The size of unit continuation as a value is one word — it’s just a pointer. The size as a resource is the size of the frame tree, which can be small (a process that has barely started) or large (a process deep in some recursion).

The scheduler state

type sched_state = {
  mutable ready   : process queue;
  mutable next_id : int;
}

A single mutable record holding the ready queue and a counter. mutable marks fields you’re allowed to reassign; ML languages default to immutable, and you have to opt in to mutation.

This is where ML’s purity-by-default policy bumps into systems reality. The scheduler is intrinsically stateful — the ready queue changes over time, and that change is observable by every scheduling decision. We can’t pretend otherwise. So we use mutable, and we accept that this code is not pure.

But notice what we’ve contained. The mutability is in one place, in a record with a well-defined invariant. Everything else in the kernel is pure: proc_a doesn’t mutate; the handler clauses receive continuations as values and dispatch them without mutating anything except sched.ready. The mutation is localized to the resource that genuinely needs to mutate.

This is the discipline. ML doesn’t ban mutation; it makes mutation visible and contained. When you read a function signature, the absence of mutation is a guarantee; the presence of mutation is a deliberate, marked choice.

Spawning

let spawn (body : unit -> unit) : unit =
  let pid = sched.next_id in
  sched.next_id <- pid + 1;
  let k = reify (fun () -> body ()) in
  Queue.push sched.ready { id = pid; k }

Three things happen: assign an id, reify body into a continuation, and push the new process onto the ready queue.

reify is the operation that converts a function-not-yet-run into a continuation-ready-to-be-invoked. Think of it as “boxing up the work but not starting it.” After reify, the continuation k is a value sitting in memory representing “the entire computation body (), paused at the very beginning.” When something later does invoke k (), the computation starts running.

The scheduler loop

let rec scheduler () : unit =
  match Queue.pop sched.ready with
  | None -> ()
  | Some p ->
      handle
        invoke p.k ()
      with
      | Yield k -> ...
      | Print msg k -> ...
      | Tick _ k -> ...

This is the heart. Line by line:

The scheduler pops a process. If none, exit. If one, invoke its continuation inside a handle block. The handle block installs handlers for the three effects we care about; when the running process performs any of them, control transfers to the matching clause.

The handle ... with block installs a handler for the listed effects, then runs the body. The body is invoke p.k () — we run the process. The process runs inside this handler. When it performs an effect, control transfers to the matching clause of the handler. The clause receives the effect’s payload and the continuation k representing the rest of the process at the point of the perform.

Read that last sentence twice. The handler clause binds k. This k is the suspended process. When the process did perform Yield, the hardware captured the state of the process at that point and bound it to the name k in this handler clause.

The Yield clause

| Yield k ->
    Queue.push sched.ready { id = p.id; k };
    scheduler ()

When the running process performs Yield, the hardware binds the suspended continuation to k. We file k back into the ready queue, and recurse to schedule the next process.

We do not invoke k here. We hand it off to the queue. The current process is now “ready, not running” — and its representation is the continuation sitting in the queue.

This is the four-line context switch. This is the whole thing. Save = Queue.push sched.ready { id = p.id; k }. Restore = Queue.pop in the next iteration plus invoke p.k (). There is no save_context, no load_context, no register-saving prologue. The continuation is the save; the invoke is the restore.

The Print clause

| Print msg k ->
    uart_write msg;
    invoke k ()

When the process performs Print msg, we write to the UART (a primitive for now; in a real kernel this would itself be perform UartTx), then immediately invoke k () to resume the same process. The process did not yield; it just made a request that the kernel serviced synchronously.

Control does not return from invoke k () to the lines after it. invoke is a tail call into the resumed process. The process continues running inside this same handle block, until it performs another effect or finishes.

The Tick clause

| Tick _ k ->
    Queue.push sched.ready { id = p.id; k };
    scheduler ()

Tick is preemption. We’ve wired up the clause to show the shape; in this single-core teaching kernel, there’s no actual timer firing yet, so Tick never arrives. The next chapter connects a simulated timer that performs Tick at intervals, and we’ll see preemption work without any other code changing.

Note the body is identical to Yield. This is the unification: preemption and cooperative yield share an implementation. The only difference is who initiated the perform.

The processes

let proc_a () =
  perform (Print "A1\n");
  perform Yield;
  perform (Print "A2\n");
  ...

A process is just a function. Nothing distinguishes proc_a from any other unit-returning function except that it performs effects. The “processhood” of proc_a comes from being spawned — being reified into a continuation and filed in the scheduler’s queue.

This is the abstraction win. User-level code is written in straight-line, sequential, blocking style. There are no callbacks, no async/await coloring, no explicit state machines. The perform Yield looks like a function call. The fact that control might transfer to a completely different process and come back later is invisible to the source code.

In Linux, code that wants to yield calls schedule() and somehow magic happens. In an async runtime, code that wants to yield uses await and the function it’s in becomes part of a state machine you didn’t write. Here, code that wants to yield writes perform Yield, which is a function call with the obvious type, and the language semantics handle the rest.

Boot

let main () =
  spawn proc_a;
  spawn proc_b;
  scheduler ()

Spawn two processes, start the scheduler. The scheduler will pop proc_a, run it until it yields, file its continuation back, pop proc_b, run it until it yields, and so on, alternating, until both processes complete.

What this kernel is missing

I want to be honest about what’s not here, because the gaps are the next chapters:

No isolation. proc_a and proc_b share the same address space, the same scheduler state, everything. A buggy process could mutate sched.ready directly. Real isolation requires capability discipline — each process has access only to the resources it’s been granted — which is part of what linear types will enable.

No real preemption. Tick is wired up but nothing fires it. Chapter 11 adds a simulated timer device that performs Tick at intervals, and we’ll see preemption work without changing the scheduler at all.

No blocking. A process that wants to wait for I/O has nowhere to go — there’s only one queue. Chapter 12 adds blocked queues keyed by what’s being waited on.

One core. Chapters 13 and 14 replicate the scheduler per-core and add migration effects.

No memory management. Processes can allocate freely from a single shared heap. The real kernel introduces per-process arenas and the bump allocator.

Print is a stub. It should itself be an effect routed to a UART driver, not a direct call. Later chapters make this proper.

But the shape is right. Every one of those additions slots into the existing structure without reworking it. New effects, new handler clauses, new queues, more scheduler instances. The unification holds throughout.

What this chapter committed to

A working cooperative kernel in 40 lines, exercising the effect-and-continuation machinery to do something real. The shape that all later chapters extend.

Take a few minutes with the code before moving on. The thing I want you to feel before we add preemption is the flatness of the design — the scheduler is one match expression and a queue, a process is a continuation and an id, a context switch is Queue.push followed by Queue.pop and an invoke. Run the execution in your head: main calls spawn proc_a, which reifies proc_a into k_a and queues it; spawn proc_b does the same; scheduler pops k_a and invokes it; proc_a runs perform (Print "A1\n") which transfers to the Print handler clause, which writes “A1” and invokes the continuation immediately, returning into proc_a at the next line; proc_a runs perform Yield which transfers to the Yield handler clause, which queues the continuation and recurses into scheduler; scheduler pops k_b and invokes it; and so on.

When you’ve felt that, the rest of Part III is about adding features without losing the flatness.

Preemption without a separate context

The previous chapter’s kernel is cooperative: processes yield voluntarily. If proc_a never performed Yield, the kernel would run it forever, ignoring everyone else. This is the cooperative-only failure mode that early Mac OS, Windows 3.1, and certain game engines have lived with — any process that doesn’t yield politely can hang the whole system.

Preemption is the fix. The kernel takes the CPU back from a running process whether the process wants it or not. In a conventional kernel, this involves a wad of new machinery: an interrupt context, interrupt enable/disable around critical sections, a separate kernel stack, special discipline about what can run in interrupt context. Here, none of that exists. Watch.

Adding preemption

We add three things:

  1. A simulated timer device that performs Tick at intervals.
  2. CPU-bound processes that never voluntarily yield.
  3. …nothing else.

Specifically: the scheduler from the previous chapter requires no changes. The Tick clause was already there, anticipating the moment a timer would fire. Now that a timer exists and fires, the clause does its job.

(* === Effects (unchanged) === *)

effect Yield : unit
effect Print : string -> unit
effect Tick  : core_id -> unit

(* === Timer device abstraction (NEW) === *)

type timer = {
  mutable countdown : int;
  reload            : int;
}

let timer : timer = { countdown = 3; reload = 3 }

(* === Process and scheduler state (unchanged) === *)

(* ... process, sched_state, spawn, scheduler all unchanged ... *)

(* === CPU-bound processes that never yield (NEW) === *)

let proc_loop name iters () =
  let rec go n =
    if n >= iters then ()
    else begin
      perform (Print (name ^ string_of_int n ^ "\n"));
      go (n + 1)
    end
  in
  go 0

(* === Boot (slightly modified) === *)

let main () =
  spawn (proc_loop "A" 5);
  spawn (proc_loop "B" 5);
  scheduler ()

With preemption wired in, the output looks something like:

A0
A1
A2
B0
B1
B2
A3
A4
B3
B4

The exact interleaving depends on where in each process’s execution the timer fires — but the shape is what matters: each process runs for a quantum (here, three operations), then gets preempted, then resumes later. Neither process ever wrote perform Yield. They were preempted involuntarily.

What didn’t change

The scheduler. Not a line. The Tick clause was already there. The kernel was already preemption-capable; we just hadn’t connected a preemption source.

This is the unification paying off in real currency. In a traditional kernel, adding preemption means writing interrupt handlers, modifying the scheduler to be reentrant from interrupt context, managing interrupt enable/disable around critical sections, and dealing with the fact that interrupt context and process context have different stacks and different rules. Here, none of that exists, because there is no separate interrupt context. Tick is an effect; effects go to handlers; the scheduler is a handler. Same machinery.

The processes. proc_loop is straight-line code with no awareness that it might be preempted. It doesn’t call yield. It doesn’t check a preemption flag. It doesn’t cooperate. It just runs, and the kernel takes the CPU back when it wants to. This is the property real operating systems need: user code should not have to be aware of scheduling, and here it isn’t.

What did change

The timer. A new device with a countdown register and a reload value. The semantics: the ISA decrements countdown on every instruction (or every cycle, or every N cycles), and when it hits zero, the next instruction boundary is treated as a perform (Tick core_id) rather than executing the program’s next instruction. The countdown reloads from reload, and the cycle continues.

This is the simplest possible timer. A real one would let you program the reload, mask the interrupt, query the current value. The point here is the interface: from the language’s perspective, the timer is a thing that occasionally performs an effect.

The processes are CPU-bound. No perform Yield anywhere in proc_loop. The processes just compute. Without preemption, the first process to run would run to completion before the second got a chance.

What’s happening at the ISA level

This is where it gets interesting, because the mechanism for preemption involves the hardware doing something the previous chapter’s kernel didn’t require.

When proc_loop "A" 5 is running and the timer expires, here’s the sequence:

  1. The processor is executing some instruction inside proc_loop — say, the recursive call to go. The countdown hits zero.
  2. At the next instruction boundary, instead of executing the next instruction of proc_loop, the ISA inserts a perform. It’s as if the running code had written perform (Tick 0) at this point, except the code didn’t write that; the hardware did.
  3. The hardware captures the continuation at this instruction boundary. This is the critical operation. The continuation represents “the rest of proc_loop from this exact point” — including the recursive call that was about to happen, the local variable n, everything.
  4. The hardware searches up the handler chain for a frame handling Tick. It finds the scheduler’s handle ... with frame. It transfers control there, binding the continuation to k in the Tick _ k -> ... clause.
  5. The handler files k back in the ready queue and calls scheduler (), which pops the next process.

Steps 2 and 3 are the new ISA capability required for preemption. The hardware has to be able to capture a continuation at an arbitrary instruction boundary, not just at a perform written in the source code.

Design space for preemption points

Three options:

Precise, every boundary. The hardware can capture a continuation at any instruction boundary, full stop. Every instruction is potentially preemptable. Most flexible and most expensive — every instruction boundary must be in a state from which the program can resume coherently.

Safe points only. The hardware can capture only at safe points — instruction boundaries the compiler has marked as preemptable. Between safe points, the timer is checked but the preempt is deferred. The GC-style approach.

Atomic regions. Preemption is precise (any instruction boundary), but the type system can mark code regions as atomic where preemption is forbidden. Lifted into the type system.

Our design (committed in the ISA spec) is option 1: precise preemption at every instruction boundary, with the preempt sampled at the end of each sequencer cycle (Fetch→Decode→Execute→Write→Preempt). Every instruction either retires fully or doesn’t retire at all. The cost is that every instruction has to commit to a clean state before retiring, which modern out-of-order processors already do for precise exceptions — so the additional cost is less than it sounds.

Where the captured continuation points

When proc_loop "A" is preempted in the middle of a recursive call, the captured continuation represents the state inside go, with n bound to whatever value it had, with the call to go partway through. Invoking this continuation later resumes the recursion from that point.

This is different from a thread context switch in two important ways:

There’s no separate stack to save. The continuation is the tree of frames; the frames already live wherever frames live. Capturing the continuation is capturing a pointer to the root. No memcpy.

There’s no “kernel stack” vs. “user stack” distinction. The handler clause for Tick runs in the same frame-tree environment as the process that was preempted, just with control now in the handler’s frame. The handler’s frames are siblings to the process’s frames in the tree. When the handler eventually invokes some continuation, control transfers to that subtree; the handler’s frames may persist if the handler hasn’t returned yet.

This second point does a lot of work. On a conventional machine, kernel code runs on a different stack from user code — partly for protection, partly because the kernel needs to handle interrupts that arrive while user code is mid-instruction. On our machine, the handler chain itself is the protection boundary. The user code is “inside” the scheduler handler — under it, in the frame tree. When the timer perform fires, control transfers from the user code’s frame to the handler’s frame upward in the same tree. There’s no separate stack because there’s no separate execution context; it’s all one tree of frames with a clear hierarchical relationship.

What enforces protection? The type system. The handler frame holds references to sched.ready and other kernel state; the user code’s frame does not, because it was never given those references. The user code physically cannot access kernel state — not because it’s on a different stack, but because no name in scope refers to kernel state.

This is exactly the capability model. Protection comes from what’s reachable through the lexical environment, not from page-table permissions. In our ISA, hardware page protection may be unnecessary or merely redundant — the type system has already prevented the access.

Preemption during a handler

What about reentrancy? If the scheduler handler is itself running — say, in the middle of Queue.push — and the timer fires, what happens?

This is solvable in two ways:

Hardware mask. The PERF instruction in handler context defers any pending Tick until the handler returns. Like interrupt masking on conventional hardware.

Tower discipline. The kernel installs a handler tower such that, during scheduler critical sections, there is no in-scope Tick handler. The handler search finds the bootloader’s no-op handler instead. The timer fires but is silently dropped (or queued for later via the no-op handler’s side effect).

We choose the tower discipline. The boot frame installs a Tick handler that does nothing — just resumes the continuation. The scheduler frame, deep inside its critical sections, doesn’t install its own Tick handler until it’s about to invoke a process; once it does, the user process’s preemption is enabled. When the user process is preempted and the handler clause runs, it uninstalls the Tick handler (implicitly, by returning into the scheduler’s outer frame) and re-installs it just before invoking the next process.

This means timer ticks arriving during scheduler critical sections are effectively dropped. For a timer that fires every few thousand instructions, dropping the rare tick that lands during a scheduler critical section is fine — the next tick will arrive shortly, and the scheduler isn’t slow.

This is preemption masking implemented entirely in the language’s normal handler discipline, with no special interrupt-mask register, no cli/sti, no separate hardware mechanism. The mask is which handlers are currently in scope. The kernel structure expresses it directly.

Edges to acknowledge

The model has two places where the abstraction strains, worth flagging:

Atomicity of preemption. The hardware has to either commit fully before retiring an instruction (no partial state visible at any boundary) or expose the partial state in a way the continuation can capture. The former is the discipline we want; it constrains what instructions can exist (no implicit multi-step operations) but keeps the model clean. Multi-cycle operations (NoC sends, complex effects) are allowed, but they commit at the end of the cycle, not partway through.

Pending performs. If a timer perform arrives while a previous perform is mid-execution, it queues up rather than firing immediately. The queue has bounded depth; if it fills, the hardware applies backpressure to the timer (which can simply drop ticks, since the next will arrive in a fixed interval). For a normal teaching kernel, this never bites.

What we’ve built so far

The kernel is now about 60 lines, and it has:

  • Multiple processes, identified by id, represented as continuations
  • Cooperative yielding via perform Yield
  • Preemptive scheduling via perform Tick from a timer device
  • Output via perform Print (a stand-in for a proper UART driver)
  • A scheduler that’s a single match expression of three handler clauses

The same kernel, with no scheduler modification, would handle a process that mixes voluntary yields and involuntary preemptions. The same kernel would handle ten processes or a thousand. The same kernel, with one new effect clause, would handle process exit, blocking on I/O, sleeping for a duration, sending IPC to another process — any of these is another effect signature and another handler clause, slotting into the existing structure.

What this chapter committed to

Preemption added without modifying the scheduler. The unification of voluntary yield and involuntary preemption under a single handler clause. The precise-preemption-at-instruction-boundary commitment. The tower-discipline approach to interrupt masking. The hint that protection is enforced by lexical scoping plus type system, not by separate execution contexts.

The next chapter adds blocking on I/O.

Blocking, I/O, and the unification

So far our processes either run, yield, or get preempted. They never wait for something. A real kernel has to handle the case where a process can’t proceed until something external happens — input arrives, a disk read completes, a timer expires, another process sends a message.

This is where the unification pays its biggest dividend, because blocking is the same mechanism as everything else we’ve built. A blocked process is a continuation that’s been filed somewhere other than the ready queue. Unblocking is moving the continuation back to the ready queue. The data structures are different; the mechanism is identical.

Let’s add input. We’ll model a simulated input device — call it a keyboard for concreteness — that occasionally performs an effect saying “here’s a byte.” Processes that want input perform an effect saying “give me a byte, blocking until one arrives.”

Two new effects

effect ReadByte    : byte             (* user-process effect: block until byte available *)
effect KeyArrival  : byte -> unit     (* device-initiated effect: a byte arrived *)

ReadByte : byte is performed by user code. Its return type is byte — when the handler resumes the continuation, it provides a byte, which becomes the value of perform ReadByte at the call site. This is the first effect we’ve seen whose return type is not unit; the data flow goes both directions through the perform.

KeyArrival : byte -> unit is performed by the keyboard device. Its argument is the byte that arrived; its return type is unit because the device doesn’t care about a response, it just wants to deliver the byte.

The two effects together implement a channel: a one-byte queue that producers push to and consumers pull from. We’ve built a synchronization primitive — the kind you’d find in Erlang as a mailbox, in Go as a channel, in Linux as a pipe — out of the effect mechanism with no special support.

Extending the kernel

type sched_state = {
  mutable ready    : process queue;
  mutable waiters  : process queue;   (* processes blocked on ReadByte *)
  mutable inbox    : byte queue;      (* bytes received but not yet consumed *)
  mutable next_id  : int;
}

let rec scheduler () : unit =
  match Queue.pop sched.ready with
  | None -> 
      if Queue.is_empty sched.waiters then ()
      else scheduler ()    (* idle: spin waiting for device events *)
  | Some p ->
      handle
        invoke p.k ()
      with
      | Yield k ->
          Queue.push sched.ready { id = p.id; k };
          scheduler ()
      | Print msg k ->
          uart_write msg;
          invoke k ()
      | Tick _ k ->
          Queue.push sched.ready { id = p.id; k };
          scheduler ()
      | ReadByte k ->
          (match Queue.pop sched.inbox with
           | Some b -> 
               invoke k b               (* byte available, resume immediately *)
           | None ->
               Queue.push sched.waiters { id = p.id; k };
               scheduler ())
      | KeyArrival b k ->
          (match Queue.pop sched.waiters with
           | Some waiter ->
               Queue.push sched.ready { id = p.id; k };
               invoke waiter.k b
           | None ->
               Queue.push sched.inbox b;
               invoke k ())

let echo_proc () =
  let rec loop () =
    let b = perform ReadByte in
    perform (Print (Printf.sprintf "got: %d\n" b));
    loop ()
  in
  loop ()

let main () =
  spawn echo_proc;
  spawn (proc_loop "B" 5);
  scheduler ()

When run with a simulated keyboard that performs KeyArrival occasionally, this kernel correctly multiplexes: echo_proc blocks until input arrives, proc_b runs in the meantime, input arrival wakes echo_proc to handle the byte. Two new effect clauses, two new queues, one extra idle case.

The ReadByte clause

| ReadByte k ->
    (match Queue.pop sched.inbox with
     | Some b -> invoke k b
     | None ->
         Queue.push sched.waiters { id = p.id; k };
         scheduler ())

The handler is invoked when a process performs ReadByte. The continuation k represents “the process at the point just after perform ReadByte, waiting for a byte to be supplied.”

If the inbox has a byte, we deliver it by invoking k b. The process resumes with b as the value of its perform ReadByte expression. No yielding, no rescheduling — the process gets its byte and continues.

If the inbox is empty, we file the continuation in waiters and call the scheduler. The current process is now blocked — its continuation is in waiters, not in ready. The scheduler will pop a different ready process and run it. The blocked process will only resume when something puts it back in ready, which is exactly what KeyArrival does.

Key insight: the continuation k is the only thing that represents the blocked process. There’s no task_struct updated with a “blocked” status flag, no priority adjusted, no separate sleep queue. There is simply: the continuation is in waiters instead of ready. The location of the continuation is the process’s state.

The KeyArrival clause

| KeyArrival b k ->
    (match Queue.pop sched.waiters with
     | Some waiter ->
         Queue.push sched.ready { id = p.id; k };
         invoke waiter.k b
     | None ->
         Queue.push sched.inbox b;
         invoke k ())

The KeyArrival effect is performed by the device, not by user code. When the keyboard has a byte to deliver, it performs KeyArrival b from outside the user-process context.

But notice the structure: the handler is invoked while some user process is running. The device’s perform is, mechanistically, the same as a timer’s perform — it interrupts the currently-executing code. So when the KeyArrival clause runs, there’s a user process currently in flight, and k is that user process’s continuation, not the device’s.

This is a subtle bit. There are two continuations in play during KeyArrival:

  • k: the continuation of whatever user process was running when the device perform arrived. The interrupted process.
  • waiter.k: the continuation of the process that was previously blocked on ReadByte. The process we want to wake up.

The handler decides what to do with both. If there’s a waiter, we requeue the interrupted process (it’ll get its turn back later), and we invoke the waiter’s continuation with the byte. The waiter gets the byte, the interrupted process is fairly preempted, the scheduler will pick up where things continue.

If there’s no waiter, we buffer the byte in the inbox and invoke k () to resume the interrupted process. No scheduling happens; the user process gets right back to running.

This is exactly the semantics of read(2) in Unix when data arrives: a process blocked in read gets woken up; a process not blocked just sees data buffered for later. Implemented in eight lines of handler clause, with no special interrupt-context discipline, no separate kernel stack, no wake_up_interruptible.

The idle case

| None -> 
    if Queue.is_empty sched.waiters then ()
    else scheduler ()

When the scheduler tries to pop a ready process and finds the queue empty, there are two cases. Either there are no waiters either, in which case the system is genuinely done and we exit. Or there are waiters, in which case every process is blocked waiting on something — and we… spin.

This is the idle loop, and it’s where the design has its weakest moment. Spinning is bad: it burns CPU, generates heat, accomplishes nothing. A real kernel would issue HLT (x86) or WFI (ARM) to halt the core until an interrupt arrives.

On our ISA, the equivalent is perform Halt or similar: an effect that signals “I have nothing to do; wake me when a remote perform arrives.” The hardware implementation would gate the clock or otherwise reduce power until the NoC delivers a message. This is the HALT instruction in the ISA, and the scheduler’s idle case should perform it.

In the kernel as drafted, we busy-loop with scheduler () because the integration with HALT is an exercise for later. Worth flagging: real kernels need a hardware-supported idle, and a clean effect-based design surfaces this as just another instruction.

Why this is meaningfully different from threads

Take a moment with this: echo_proc is a function that calls ReadByte in a loop. It’s straight-line, blocking, sequential code. It reads like a script.

In Linux, the equivalent C code would call read(fd, &b, 1) — also blocking, also sequential. The kernel handles the blocking via its scheduler. So far this looks the same.

But the cost differs in important ways:

In Linux, each process has a kernel stack, a thread control block, page tables, and assorted machinery. A blocked process consumes kilobytes of kernel memory. Creating one is hundreds of microseconds of syscall and allocation. Linux can handle thousands of threads gracefully but starts to strain at tens of thousands.

In our kernel, a blocked process is a continuation in a queue. The continuation is a pointer to a frame tree. The frame tree’s size is the size of the process’s actual live state at the perform site. A process blocked at perform ReadByte after just starting has a tiny frame tree; a process blocked deep in a parser has a larger one. Cost is proportional to genuinely live state. Millions of blocked processes are plausible if their state is small.

This is why effect-handler-based concurrency is being explored as a replacement for thread pools in serious systems: the cost per blocked thing is fundamentally lower because the abstraction doesn’t bake in fixed-size overhead. You’re seeing the same thing async/await delivers in modern languages, except here it’s not bolted on: the language wasn’t designed with sync blocking and then patched to support async via state-machine transforms. The language is effect-native; sync and async are the same code path.

What’s still missing

This kernel can now:

  • Run multiple cooperating processes
  • Preempt them on a timer
  • Block them on input
  • Buffer input while no consumer is waiting
  • Deliver buffered input immediately when a consumer asks

It still can’t:

  • Talk to multiple devices (we have one keyboard; what about a UART send queue, a network card, a disk?)
  • Time out a blocking read
  • Send messages between processes
  • Spawn a process from inside another process
  • Exit a process cleanly

Each of these is another effect or two and another queue or table. The kernel grows feature by feature without restructuring. Process exit is an effect whose handler drops the continuation and calls the scheduler. Process-to-process messaging is a pair of effects and a per-process mailbox queue. Timeouts are an effect plus a sorted timer queue. Each addition fits cleanly into the existing match expression.

What this chapter committed to

Blocking I/O as a continuation-in-a-different-queue. The asymmetric data flow of ReadByte : byte (response carries value) vs. KeyArrival : byte -> unit (request carries value). The unification of “wake up a blocked process” with “schedule a runnable process” under one mechanism. The acknowledgment that real kernels need a HALT for the idle case.

The next chapter introduces linear types, which is what makes the next step — multicore — safe.

Linear types and the multicore generalization

This is the chapter where the type system stops being a clever discipline and becomes a load-bearing structural element. We’re going to introduce linear types from scratch, build the intuition for why they matter, and then show that the same mechanism that makes them useful for resource management is what makes multicore code safe.

Linear types are the single most important type-system feature for systems work, and they’re the reason the kernel’s “same code on single and multicore” property is real rather than aspirational.

What linear types are

Start from a familiar place. In ordinary type systems — including the ML we’ve built so far — a value of type int can be used as many times as you like:

let x = 5 in
let y = x + x in     (* x used twice *)
let z = x * 2 in     (* used a third time *)
y + z

You created x, you used it three times, nothing bad happened. This works because ints are copyable: making more of them out of one of them is free and harmless.

Now consider a different value. Suppose you have a file handle:

let f = open_file "data.txt" in
let bytes_a = read f in
let bytes_b = read f in
close f;
read f                   (* uh oh *)

The last read f happens after the file has been closed. The compiler doesn’t catch this. The runtime traps, or worse, returns garbage. This is the use-after-close bug that every C and C++ programmer has shipped to production at some point.

The problem is that f is being treated as copyable when it shouldn’t be. There’s only one real file handle — it’s a resource in the operating system, with a single identity — but the type system lets us use the name f as many times as we want, including after we’ve destroyed the underlying resource.

Linear types are types whose values must be used exactly once. If file_handle were a linear type, the code above wouldn’t compile. The compiler would tell you that f is used four times when it must be used exactly once. To make it compile, you’d have to write it like this:

let f0 = open_file "data.txt" in
let (bytes_a, f1) = read f0 in    (* read consumes f0, returns bytes AND a fresh handle *)
let (bytes_b, f2) = read f1 in    (* uses f1, not f0; f0 is gone *)
close f2                            (* consumes f2; no handle remains *)

Each operation consumes its input handle and produces a fresh one. After read f0 runs, f0 is no longer usable — it’s been consumed. The compiler tracks this. You cannot use f0 again because the type system has marked it as “spent.” The result of read is a pair: the bytes, and a new handle f1 that represents the file’s state after the read. To read again, you use f1. After close f2, no handle exists; the resource is gone, and the type system knows it.

This is linearity. The “exactly once” rule turns resource management from a runtime discipline into a compile-time guarantee. You cannot use-after-close. You cannot double-close. You cannot leak — the compiler will complain that f is used zero times, which is also illegal.

Affine types, the practical sibling

Strict linearity (“exactly once”) is sometimes too strict. What about errors? If open_file succeeds and then your program panics before closing, you’ve failed to use the handle. Real systems usually want a slightly weaker rule: at most once. You can drop a value on the floor (the runtime will clean it up), but you cannot use it twice.

This weaker rule is called affine typing. Rust’s borrow checker is built on affine types: values can be moved (consuming them) or dropped (also consuming them, automatically at end-of-scope), but they cannot be used after being moved. Most production “linear-ish” type systems are actually affine, including the one we’ll specify for CML.

For our purposes, the distinction matters but isn’t critical. When this book says “linear types,” it means “affine types in practice, linear in theory.” If you read other literature, this is the same elision most authors make.

Why this matters for our design

Three uses in CML, in order of obviousness:

First, resources. File handles, network sockets, memory allocations from a region, locks. These exist in the world with a single identity. Linear typing lets us write acquire and release operations that the compiler ensures are paired correctly.

Second, mutable state. A mutable array, conceptually, can’t be aliased — if two parts of the program both hold the array and both modify it, you have a data race. Linear typing lets you encode “this array has one owner at a time” in the type system. Pass it to a function; you no longer own it; the function returns it (possibly modified) and now you own it again.

Third — and this is the one that matters for multicore — capabilities. A capability is “permission to do a thing.” The capability to read from device 0x05. The capability to allocate from arena R. The capability to invoke continuation K. Capabilities are inherently linear: you have one, you can use it, and using it might consume it. The whole structure of “who is allowed to do what” in a multicore system can be encoded as the flow of linear capabilities through the program.

A small example

type linear region    (* a "region" capability — linear *)

effect Region {
  open  : () -> region              (* creates a fresh region capability *)
  alloc : region -> (region, ptr)   (* consumes capability, returns it back *)
  close : region -> ()              (* consumes capability, returns nothing *)
}

let example () : () / {Region} =
  let r0 = perform Region.open () in
  let (r1, p1) = perform Region.alloc r0 in    (* r0 consumed; r1 fresh *)
  let (r2, p2) = perform Region.alloc r1 in    (* r1 consumed; r2 fresh *)
  perform Region.close r2                       (* r2 consumed *)
  (* attempting `perform Region.close r0` here is a type error *)

This is the threading pattern: each operation consumes one capability and produces the next. The compiler counts uses; each must be exactly once. There is no way to use r0 after Region.alloc r0 returns, because r0 is no longer in scope under linearity rules — the value has been moved into the perform.

This pattern is verbose. Real linear-type languages have syntactic sugar to make it readable (Rust’s borrow notation, for instance). For CML, we’ll specify the sugar later; for now, the explicit threading is fine because it makes the semantics visible.

Now: back to the kernel

The cooperative kernel has a process type:

type process = {
  id : int;
  k  : unit continuation;
}

And a scheduler that holds processes in a ready queue, pops one, runs it, files the resulting continuation back.

Now imagine we want to run this on four cores. Each core has its own scheduler. A process running on core 0 might want to migrate to core 1 — perhaps because core 0 is overloaded.

What does migration look like? Conceptually:

effect Migrate {
  to_core : core_id -> ()
}

| Migrate.to_core target, k ->
    send_continuation_to_core target k;
    scheduler ()

The process performs Migrate.to_core 1. The handler captures the continuation k, sends it as a message to core 1’s scheduler, and resumes the local scheduler. Core 1 receives it and runs it.

This is what the unification promised. The mechanism for migration is the same mechanism we’ve used for everything else.

Where it breaks without linearity

But notice: when core 0 sends the continuation to core 1, core 0 has just shipped a value. If core 0 keeps a copy of k and core 1 receives a copy too, and both invoke it, what happens?

The process exists twice. Both cores think they’re running it. The process’s local variables get incremented independently on two cores. If the process holds a file handle, two cores try to read from the same file. If the process is an actor with a mailbox, messages get processed twice. The system’s state diverges. Welcome to a distributed-systems consistency problem.

The fix is: only one core may invoke k. Whichever sends it must lose ownership. Whichever receives it must gain ownership. The continuation is a resource with one identity, and ownership must be tracked.

This is exactly what linear types do.

Continuations are linear by their nature

Make continuation a linear type in CML:

type linear continuation a

Now every operation involving a continuation must thread it explicitly. Invoking a continuation consumes it. Sending it to another core consumes it. You cannot invoke a continuation twice; you cannot keep a copy after sending.

Look at the scheduler’s handler clauses, with linearity now visible:

| Yield k ->     (* k is linear *)
    Queue.push sched.ready { id = p.id; k };   (* consumes k by moving it *)
    scheduler ()

The k enters the handler, gets moved into the process record, gets moved into the ready queue. The handler does not have access to k after the Queue.push. If it tried to also invoke k after pushing, the compiler would reject — k has been moved.

This sounds like a constraint on the kernel author. It is — but a productive one. Every place a continuation is touched, the type system has tracked who owns it. Use-after-resume becomes a compile error. Double-invoke becomes a compile error. Forgetting to file the continuation becomes a compile error (it would be unused, violating linearity).

Now the migration handler:

| Migrate.to_core target, k ->
    perform NoC.send target (encode_continuation k);   (* consumes k *)
    scheduler ()

The continuation k is consumed by NoC.send — it gets serialized into a packet and shipped to target. Core 0 no longer has k. Core 1 will receive a fresh handle to the deserialized continuation in its incoming-message handler. Ownership has transferred. The type system enforces it.

Why this makes single-core and multicore the same code

This is the payoff. The kernel code we’ve written — the scheduler, the handler clauses, spawn, proc_loop, all of it — doesn’t change for multicore. Not a line. What changes is:

  1. The Queue.push / Queue.pop operations. On a single core, they’re local data structure operations. On multiple cores, they might be remote-aware: pushing to a “ready queue” might mean pushing to the local queue or sending a message to another core whose queue is the target.

  2. The Mem.alloc handler. Single core: bumps a local heap pointer. Multicore: allocates from per-core arenas, with cross-core sharing handled via specific effects.

  3. The migration handler clause is added, if you want it. If you don’t, processes run on whatever core they were spawned on, forever.

But the body of the scheduler — the part that does the actual work — is identical. Same match expression, same handler clauses, same logic. The differences are entirely in the implementation of the data structures the scheduler uses.

The reason this works is the handler-as-policy / continuation-as-mechanism separation, plus linearity to make ownership explicit. The kernel writes “file this continuation in the ready queue.” The implementation decides whether that means a memory write or a NoC packet send. The kernel doesn’t have to know, because the type system has already proven the operation is sound — linearity ensures the continuation is consumed exactly once regardless of where it ends up.

This is what people mean when they say “effect handlers + linear types compose.” The handler abstracts the policy. The linear type abstracts the ownership. Together, they make the user of the abstraction (the kernel) invariant under changes in the implementation. Single-core, multicore, distributed, doesn’t matter to the kernel code. It matters to the queue implementation.

What linearity bought us, specifically

Linearity is doing work in three places:

  1. At the continuation type. Continuations are linear. They cannot be duplicated. This means k in any handler clause can be moved into a queue, sent over the NoC, or invoked — but only one of these.

  2. At process records. A process contains a linear continuation, so by transitive linearity, process is also linear. A process record cannot be in two queues simultaneously. When core 0 sends a process to core 1, core 0 loses it.

  3. At any future “shared resource” types. File handles, NIC descriptors, locks, allocator capabilities — all linear. The system’s resource discipline is uniform.

Linearity is not doing work for:

  • Ordinary integers, bools, strings, etc. These are copyable.
  • Read-only data structures.
  • Closures, in general (but closures that capture linear values become linear).
  • Effect declarations or handler descriptions.

The discipline targets exactly the things that matter: resources with identity. Everything else stays cheap and copyable.

How CML expresses linearity

A type can be declared linear:

type linear continuation a
type linear file_handle
type linear region

Functions taking linear arguments must consume them — pass them on to another function, store them somewhere, or explicitly drop them. The compiler tracks ownership through pattern matching and let-bindings.

Effect operations whose argument or return type is linear thread ownership through the perform/resume cycle. perform Ready.push p consumes p; the handler receives p and can use it once.

Records with linear fields are themselves linear. Sum types with linear variants are linear in those variants. The discipline propagates up.

The compiler can usually infer linearity, but explicit annotations are encouraged at API boundaries for documentation.

What this chapter committed to

Linear types defined from scratch. Their use for resources, mutable state, and capabilities. The recognition that continuations are inherently linear. The realization that linearity is what makes multicore migration safe by enforcing single-ownership of continuations. The setup for the next chapter, which assembles the full multicore kernel.

The full multicore kernel

We have everything we need. Linear continuations, effects as the dispatch mechanism, NoC packets as the cross-core transport, and the discipline that the kernel code is invariant to whether it’s running on one core or many. Let me write the full multicore kernel, then walk through what’s new and what isn’t.

The structure: I’ll show the complete kernel first, then dissect it. Pay attention to how much of it you’ve already seen.

The full kernel

(* === Core identification === *)

type core_id = int    (* 0..3 for our 4-core toy *)

let this_core () : core_id / {Unsafe} =
  unsafe { __read_core_id_register () }

(* === Process representation === *)

type process = {
  id        : int;
  home_core : core_id;       (* where this process was spawned *)
  k         : unit continuation;
}

(* The 'k' field is linear; therefore process is linear. *)

(* === Effect interfaces === *)

effect Ready {
  push : process -> ()
  pop  : () -> option process
  steal_from : core_id -> option process    (* for work-stealing *)
}

effect Migration {
  to_core : core_id -> ()
}

effect ProcLifecycle {
  spawn : (() -> ()) -> ()
  exit  : () -> empty
  yield : () -> ()
}

effect IO {
  print     : string -> ()
  read_byte : () -> byte
}

effect Mem {
  alloc : { layout : layout; size : int } -> ptr
  free  : ptr -> ()
}

effect Hardware {
  tick           : core_id -> ()
  noc_arrival    : { src : core_id; op : int; payload : int } -> ()
  key_arrival    : byte -> ()
}

(* === Core-local scheduler state === *)

type sched_state = {
  mutable ready    : process queue;
  mutable waiters  : process queue;
  mutable inbox    : byte queue;
  mutable next_id  : int;
  this_core        : core_id;
}

let sched : sched_state = init_sched_state (this_core ())

(* === The scheduler === *)

let rec scheduler () : () / {Ready, Mem, IO, ProcLifecycle, Hardware} =
  match perform Ready.pop () with
  | None -> idle_or_steal ()
  | Some p ->
      with proc_handlers p in {
        invoke p.k ()
      }

and proc_handlers p = {
  | ProcLifecycle.yield (), k ->
      perform Ready.push { id = p.id; home_core = p.home_core; k };
      scheduler ()

  | ProcLifecycle.exit (), _k ->
      (* k is dropped; the process is gone *)
      scheduler ()

  | ProcLifecycle.spawn body, k ->
      let new_p = make_process body (this_core ()) in
      perform Ready.push new_p;
      resume k ()

  | Migration.to_core target, k ->
      if target == this_core () then
        perform Ready.push { id = p.id; home_core = p.home_core; k };
        scheduler ()
      else
        perform NoC.send target (Wrap_RemoteProcess
          { id = p.id; home_core = p.home_core; k });
        scheduler ()

  | Hardware.tick _, k ->
      perform Ready.push { id = p.id; home_core = p.home_core; k };
      scheduler ()

  | IO.read_byte (), k ->
      match Queue.pop sched.inbox with
      | Some b -> resume k b
      | None ->
          Queue.push sched.waiters { id = p.id; home_core = p.home_core; k };
          scheduler ()

  | IO.print msg, k ->
      perform UartTx msg;
      resume k ()

  | Hardware.key_arrival b, k ->
      match Queue.pop sched.waiters with
      | Some waiter ->
          perform Ready.push { id = p.id; home_core = p.home_core; k };
          invoke waiter.k b
      | None ->
          Queue.push sched.inbox b;
          resume k ()

  | Hardware.noc_arrival { src; op; payload }, k ->
      handle_noc_message src op payload;
      resume k ()
}

and idle_or_steal () : () / {Ready, Mem, IO, ProcLifecycle, Hardware} =
  let victim = pick_victim_core () in
  match perform Ready.steal_from victim with
  | Some p ->
      perform Ready.push p;
      scheduler ()
  | None ->
      unsafe { __halt () };
      scheduler ()

and handle_noc_message src op payload : () / {Ready, Mem} =
  match op with
  | NOC_OP_MIGRATE ->
      let p = decode_process payload in
      perform Ready.push p
  | NOC_OP_STEAL_REQUEST ->
      match Queue.pop sched.ready with
      | Some stolen ->
          perform NoC.send src (Wrap_StealReply (Some stolen))
      | None ->
          perform NoC.send src (Wrap_StealReply None)
  | NOC_OP_STEAL_REPLY ->
      ()
  | _ -> ()

Bottom handlers — the part that differs per deployment

(* === Single-core bottom handler === *)

handler local_ready_only = {
  | Ready.push p, k         -> Queue.push sched.ready p; resume k ()
  | Ready.pop (), k         -> resume k (Queue.pop sched.ready)
  | Ready.steal_from _, k   -> resume k None
  | return v                -> v
}

(* === Multicore bottom handler with work-stealing === *)

handler distributed_ready = {
  | Ready.push p, k ->
      if Queue.length sched.ready > LOAD_THRESHOLD then
        let target = least_loaded_neighbor () in
        perform NoC.send target (Wrap_Migrate p)
      else
        Queue.push sched.ready p;
      resume k ()

  | Ready.pop (), k ->
      resume k (Queue.pop sched.ready)

  | Ready.steal_from victim, k ->
      perform NoC.send victim Wrap_StealRequest;
      let reply = wait_for_steal_reply () in
      resume k reply

  | return v -> v
}

(* === Core-local memory handler === *)

handler local_arena_mem = {
  | Mem.alloc { layout; size }, k ->
      let p = bump_allocate_local size in
      let _ = write_header p layout size in
      resume k p
  | Mem.free p, k ->
      resume k ()
  | return v -> v
}

Boot

let boot () : () =
  unsafe { configure_effect_table () };

  with local_arena_mem in {
    with distributed_ready in {       (* or local_ready_only on single-core *)
      if this_core () == 0 then begin
        perform ProcLifecycle.spawn proc_a;
        perform ProcLifecycle.spawn proc_b;
        perform ProcLifecycle.spawn proc_echo
      end;
      scheduler ()
    }
  }

That is the entire kernel. About 150 lines including the bottom handlers. Let’s walk through what’s new.

What changed from single-core: very little

Compare against the single-core version. The handler clauses for Yield, Tick, ReadByte, Print, and KeyArrival are almost identical. The only differences:

  1. Process records now carry home_core. Useful for debugging and affinity-based scheduling.

  2. Queue operations are performs, not direct calls. Instead of Queue.push sched.ready p, we write perform Ready.push p. The handler-resolved version might be the same direct call (single-core) or might involve a NoC send (multicore). The kernel doesn’t know.

  3. New Migration effect. Two clauses to handle migrating to self (trivial) and migrating to a remote core (ship the continuation as a NoC payload).

  4. New idle_or_steal for the idle case. When the ready queue is empty, the kernel tries to steal work from another core before halting.

  5. A NoC message dispatch handler. When a packet arrives from another core, dispatch based on the op code: migration, steal request, steal reply.

That’s the entire delta. The structure — handler-per-effect, scheduler-as-tail-loop, continuation-as-process-state — is preserved exactly.

How the same kernel runs on one or many cores

This is the architectural payoff. Look at the boot sequence: with local_arena_mem in { with distributed_ready in { ... scheduler () } }. The choice between local_ready_only and distributed_ready is one line of code. Swap them and the kernel is single-core or multicore. The scheduler, handler clauses, process types — all unchanged.

Three reasons it works:

The kernel writes against effect interfaces, not concrete implementations. perform Ready.push is an abstract operation. The bottom handler defines what it does. By varying the bottom handler, we vary the deployment without varying the code.

Linearity makes ownership semantics deployment-invariant. When the kernel writes perform Ready.push p, it’s saying “I hand over ownership of p.” The type system enforces this regardless of whether the handler ships p to another core or stores it locally. The single-use guarantee holds in both cases. Use-after-migration is a compile error before we pick a bottom handler.

Continuations are first-class values that can travel. A process is a continuation. A continuation is a heap object with a well-defined layout. It can be serialized into a NoC packet, sent, deserialized, and invoked on another core. The receiving core’s CONT_REC is functionally identical to what the sending core had.

Work-stealing

and idle_or_steal () =
  let victim = pick_victim_core () in
  match perform Ready.steal_from victim with
  | Some p -> perform Ready.push p; scheduler ()
  | None   -> unsafe { __halt () }; scheduler ()

The Ready.steal_from effect is what makes this clean. The kernel doesn’t write “send a packet to core 2, wait for a response, deserialize the reply.” It writes “steal from core 2,” and the bottom handler implements the protocol.

The steal request is itself an effect-handler conversation. The victim core’s NoC handler receives STEAL_REQUEST, pops from its local ready queue if non-empty, and sends back STEAL_REPLY with the process. The thief’s handler receives the reply and resumes the waiting Ready.steal_from perform.

Correctness point: When core 1 steals a process from core 0, that process’s continuation transfers ownership. Core 0 popped it (consuming local ownership), sent it via NoC packet (the linear-type transfer), core 1 received it and pushed to its local ready queue (taking ownership). The single-use invariant holds. If core 0 tried to keep a copy, compile error. If core 1 tried to send it on to core 2 and invoke it locally, compile error. Linearity makes work-stealing safe by construction.

What got waved at

A few places I cheated, that the compiler team handles:

Wrap_RemoteProcess, Wrap_StealRequest, Wrap_Migrate, Wrap_StealReply — I used these as if they were values, but they’re actually serialized forms. Sending a process over the NoC requires writing its data into a packet payload and reconstructing it on receipt. The compiler needs serialization machinery (probably auto-derived for linear types).

For the toy, you can punt: ship pointers and assume shared memory across cores. The semantics work, the linearity is preserved, and you lose only the option of cores having genuinely separate address spaces. For the four-core mesh design, this is fine.

__read_core_id_register — Assumes an unsafe primitive that returns the current core’s ID. The ISA needs this; it’s a special-purpose read of a hardware register that the boot ROM hardwires.

wait_for_steal_reply — Inside Ready.steal_from, “perform yields the current process until a response packet arrives.” The mechanism: register a one-shot continuation in a “pending replies” table keyed by request ID, then yield to the scheduler. When the matching reply arrives, the NoC handler looks up the continuation and resumes it.

Cross-core synchronization for shared structures. The sched.ready queue is core-local in this design, so no synchronization needed for normal scheduling. Work-stealing is implemented as request-reply over the NoC (so the request is processed by the victim core in its own handler context, not by the thief directly touching the queue). For genuinely shared structures, we’d need atomics (the CAS instruction flagged for later in the ISA spec).

Boot, end to end

Each core runs boot(). The boot ROM is the same for all cores; the behavior differs because this_core () == 0 is only true on core 0. Core 0 spawns the initial processes; all four cores enter the scheduler loop.

Once running, work distributes naturally:

  • Core 0 starts with processes; it runs one.
  • When core 0’s process yields or ticks, the scheduler pushes it back to ready.
  • If core 0’s ready queue gets long, distributed_ready rebalances by migrating a process to a less-loaded neighbor.
  • Cores 1, 2, 3 wake up from idle_or_steal, steal a process from core 0, run it.
  • As processes block on I/O, they move to waiters; when events occur, they’re pushed back to ready.
  • The system runs.

A nice property: processes don’t know they’re on a multicore system. The user code in proc_loop or echo_proc performs effects exactly as before. The same processes run on single-core or multicore deployments. The kernel mediates everything.

What this chapter committed to

The full multicore kernel, in approximately 150 lines. The single-line deployment-time choice between single and multicore. The work-stealing protocol implemented as effect handlers over the NoC. The realization that the only differences between deployments are in the bottom handlers, not in the kernel proper.

Part III is complete. The book has built up, from algebraic principles, an ISA, a language, and a kernel that runs on it across single and multicore. Part IV applies the design to surprising places.

A TCP receive path as an effect tower

This is the chapter where the design has to prove it can handle something genuinely complicated. Networking is the standard test for systems-language expressiveness. TCP in particular is notorious: it has a state machine with a dozen states, retransmission logic, congestion control, reassembly buffers, and a flow-control window. Production TCP stacks in Linux and the BSDs run to tens of thousands of lines of carefully-managed C, and getting them right took decades.

We’re not going to build a production stack. What we’re going to do is sketch the receive path — packet arrives at the NIC, eventually surfaces as bytes a process can read — and show that the structure the design suggests is dramatically cleaner than the conventional approach. The structure is: each layer of the protocol stack is a handler that performs the next layer’s effect to the layer above. The full receive path is a tower of handlers.

If this works, it’s evidence that the unification scales: from kernel scheduling to protocol implementation, the same shape applies. If it doesn’t work, it’s evidence the design has a ceiling. Either is useful.

The effect tower

The protocol layers, written out as effect signatures:

effect Link {
  recv_frame : () -> ethernet_frame
}

effect IP {
  recv_datagram : () -> ip_datagram
  send_datagram : ip_datagram -> ()
}

effect TCP {
  open    : { src : port; dst : socket_addr } -> connection
  recv    : connection -> bytes
  send    : connection -> bytes -> ()
  close   : connection -> ()
}

Read this as a stack. Link is the bottom — raw Ethernet frames in and out of the NIC. IP is the middle — IP datagrams, after fragmentation/reassembly. TCP is the top — a connection-oriented byte stream.

A user process performs TCP.recv conn and receives bytes. Under the hood:

  • A TCP handler is the implementation of recv. To do its job, it performs IP.recv_datagram to get the next IP datagram on the relevant connection.
  • An IP handler implements recv_datagram. To do its job, it performs Link.recv_frame to get the next raw Ethernet frame.
  • A Link handler implements recv_frame. To do its job, it waits for Hardware.nic_packet to fire from the NIC device.

Each layer translates between the abstraction below and the abstraction above. Each layer has local state (the IP reassembly buffer, the TCP receive window, the per-connection state machine). Each layer is composable: you can plug in a different IP handler (IPv6 vs. IPv4) without changing the TCP handler, or a different Link handler (Ethernet vs. PPP) without changing the IP handler.

type ethernet_frame = {
  src_mac    : mac;
  dst_mac    : mac;
  ethertype  : int;
  payload    : bytes;
}

handler ethernet_link = {
  | Link.recv_frame (), k ->
      let pkt = perform Hardware.nic_recv () in
      match parse_ethernet pkt with
      | Ok frame ->
          if not_for_us frame.dst_mac then
            perform Link.recv_frame ()   (* try again — drop this frame *)
          else
            resume k frame
      | Error _ ->
          perform Link.recv_frame ()      (* drop malformed *)

  | Hardware.nic_packet pkt, k ->
      (* This shouldn't happen unless we're acting as a low-level
         NIC handler too. In the normal stack, the NIC handler
         is below us. Resume and continue. *)
      resume k ()
}

The Link handler performs Hardware.nic_recv to get the next raw packet (waiting if necessary), parses the Ethernet header, decides whether to keep or drop, and resumes the caller with the parsed frame.

A subtle detail: the Link handler is itself using perform to recursively call its own effect if it wants to skip a frame. This is fine — the recursion goes back through the handler installation, finds itself, and runs again. No special looping construct needed.

Even more subtle: Hardware.nic_recv is a synchronous request-response over the NoC. The Link handler doesn’t have its own poll loop; it asks the hardware, blocks, gets a packet. The blocking is implemented as we saw in chapter 12 — the continuation goes into a waiters queue for the NIC device, and the NIC fires an arrival event when a packet is ready. This is all the same mechanism, applied recursively.

The IP handler

type ip_datagram = {
  src_addr  : ip_addr;
  dst_addr  : ip_addr;
  protocol  : int;
  payload   : bytes;
}

type ip_state = {
  mutable reassembly : reassembly_table;
  mutable my_addr    : ip_addr;
}

handler ip_v4 (st : ip_state) = {
  | IP.recv_datagram (), k ->
      let frame = perform Link.recv_frame () in
      if frame.ethertype != ETH_IPV4 then
        perform IP.recv_datagram ()    (* not IP, try again *)
      else
        match parse_ip frame.payload with
        | Error _ -> perform IP.recv_datagram ()
        | Ok ip ->
            if ip.dst_addr != st.my_addr then
              perform IP.recv_datagram ()
            else if is_fragment ip then
              match add_fragment st.reassembly ip with
              | Complete datagram -> resume k datagram
              | Incomplete -> perform IP.recv_datagram ()
            else
              resume k ip

  | IP.send_datagram dg, k ->
      let frame = encapsulate_in_ethernet dg in
      perform Link.send_frame frame;
      resume k ()
}

The IP handler holds two pieces of state: a reassembly table for in-flight fragmented datagrams, and our IP address. When IP.recv_datagram is performed, the handler:

  1. Performs Link.recv_frame to get the next Ethernet frame.
  2. Drops anything that isn’t IPv4.
  3. Parses the IP header.
  4. Drops anything not addressed to us.
  5. Handles fragmentation: if the datagram is whole, deliver it; if fragmented, file the fragment and try again.

Notice the structure. The IP handler is exactly an IP packet processor, expressed in terms of the Link layer below and the abstraction it provides above. Nothing in the IP handler knows about TCP. Nothing in the IP handler knows about UDP. It just translates Link frames into IP datagrams.

A real implementation would also handle ICMP, IP options, TTL decrement on forwarding, etc. These are additions to this handler, not changes to the layers above or below.

The TCP handler

This is the interesting one because TCP has real state: per-connection state machines, retransmission buffers, sequence number tracking, etc. The handler is bigger.

type connection = {
  conn_id   : int;
  local     : socket_addr;
  remote    : socket_addr;
  mutable state : tcp_state;
  mutable recv_window : int;
  mutable recv_next   : seqnum;
  mutable recv_buffer : ordered_byte_buffer;
  mutable send_unacked : seqnum;
  mutable send_next    : seqnum;
  mutable send_buffer  : retransmit_buffer;
  mutable read_waiter  : (bytes continuation) option;
}

type tcp_state =
  | Closed
  | Listen
  | SynSent
  | SynReceived
  | Established
  | FinWait1
  | FinWait2
  | CloseWait
  | Closing
  | LastAck
  | TimeWait

type tcp_global = {
  mutable connections : connection table;   (* keyed by (local, remote) tuple *)
  mutable listeners   : listener table;     (* keyed by local port *)
}

handler tcp (g : tcp_global) = {
  | TCP.recv conn, k ->
      (* Try to deliver buffered data immediately if any *)
      let available = ordered_byte_buffer_take conn.recv_buffer in
      match available with
      | Some bytes -> resume k bytes
      | None ->
          (* No data buffered. Save k as the read_waiter; pump the network. *)
          conn.read_waiter <- Some k;
          pump_until_data conn

  | TCP.send conn bytes, k ->
      (* Add to retransmit buffer, send segment, return *)
      let seg = build_segment conn bytes in
      retransmit_buffer_add conn.send_buffer seg;
      perform IP.send_datagram (wrap_in_ip seg);
      conn.send_next <- conn.send_next + length bytes;
      resume k ()

  | TCP.open params, k ->
      let conn = make_new_connection params in
      g.connections <- table_add g.connections (params.src, params.dst) conn;
      conn.state <- SynSent;
      conn.send_next <- random_initial_seqnum ();
      let syn = build_syn conn in
      perform IP.send_datagram (wrap_in_ip syn);
      (* Wait for SYN-ACK; this is itself a perform on IP.recv_datagram
         in the pump loop *)
      pump_until_state conn Established;
      resume k conn

  | TCP.close conn, k -> ...
}

and pump_until_data conn =
  (* Receive IP datagrams, process them as TCP segments, until conn has data *)
  let rec loop () =
    let dg = perform IP.recv_datagram () in
    if dg.protocol == IP_PROTO_TCP then begin
      let seg = parse_tcp dg.payload in
      let target = lookup_conn g (dg.dst_addr, seg.dst_port) (dg.src_addr, seg.src_port) in
      handle_tcp_segment target seg;
      (* Check if our connection got data *)
      if has_data conn then
        let bytes = ordered_byte_buffer_take conn.recv_buffer in
        match conn.read_waiter with
        | Some waiter ->
            conn.read_waiter <- None;
            invoke waiter (unwrap bytes)
        | None -> loop ()
      else
        loop ()
    end else
      loop ()
  in loop ()

and handle_tcp_segment conn seg =
  match conn.state, seg with
  | Listen, { syn = true; ack = false } -> ...
  | SynSent, { syn = true; ack = true } -> ...
  | Established, { ack = true; payload } -> 
      if seg.seq == conn.recv_next then begin
        ordered_byte_buffer_add conn.recv_buffer seg.payload;
        conn.recv_next <- conn.recv_next + length seg.payload;
        send_ack conn
      end else
        send_dup_ack conn
  | ...

Long, yes — but look at its shape. The handler is doing what a TCP implementation does: managing per-connection state, processing incoming segments according to the protocol state machine, buffering ordered data, holding waiters who want to read.

Three things to notice.

First: the pump pattern

pump_until_data is the core idiom of a protocol layer. When a TCP.recv is performed and there’s no data buffered, the handler can’t just resume the caller — it has no data to give. It also can’t return failure — that would change the semantics. So it does what blocking implementations always do: it drives the network forward until data is available.

The way it does so is by performing IP.recv_datagram () in a loop, processing whatever datagrams arrive (which may or may not be for the connection that’s waiting), and updating per-connection state until the connection in question has data, at which point it invokes the waiter.

This is the pattern that, in a conventional TCP implementation, lives as “the receive softirq” or “the kernel thread that drains the NIC.” Here it’s just a function inside the handler. It’s structured. It’s blocking (it does perform IP.recv_datagram, which blocks at the Link layer, which blocks at the Hardware layer, all the way down to the NIC interrupt). It composes naturally with the kernel’s existing blocking mechanism.

Second: the waiter pattern

When a TCP.recv finds no data, it stores the caller’s continuation in conn.read_waiter and starts pumping. When the pump eventually fills the buffer, it pulls the waiter out and invokes it with the data.

This is exactly the same pattern as the kernel’s IO.read_byte waiter — store the continuation in a wait slot, resume it from elsewhere. The difference is that the wait slot lives inside the TCP connection state, not in a global scheduler structure. Per-resource wait slots is the natural extension of the kernel-wide wait queue.

The waiter is linear (continuations always are). It’s stored exactly once, retrieved exactly once, and invoked exactly once. The connection cannot accidentally have two readers blocked simultaneously — the type would forbid it (the field is option, and assigning a new value when one was already there is a programming error that linear types help surface).

Third: each layer is a handler, and handlers compose

The full receive path:

NIC hardware
   ↓ fires Hardware.nic_packet
NIC driver handler
   ↓ delivers via Link.recv_frame
IP handler
   ↓ delivers via IP.recv_datagram  
TCP handler
   ↓ delivers via TCP.recv
user process

Each ↓ is a resume k after the layer below has performed its effect and delivered a value. Each layer is a handler installed in scope below the user process. The user process does perform TCP.recv conn; control transfers to the TCP handler; the TCP handler does perform IP.recv_datagram; control transfers to the IP handler; and so on. The stack of handlers is the protocol stack.

In a conventional implementation, this layering exists as a structural convention enforced by code review — each layer is a separate file, with its own functions, and the calls between layers are conventional function calls. Here, the layering is enforced by the effect type system: the IP handler can perform Link.recv_frame because the Link handler is in scope below it; if you forgot to install the Link handler, the program wouldn’t typecheck.

This is more than a stylistic choice. It means:

Testing. You can stub out the Link layer by installing a test handler that returns pre-recorded frames. The IP and TCP handlers above will run unmodified, processing the test frames. No mocking framework required — the test stub is just another handler.

Interposition. Want to log every IP datagram for debugging? Install a logging handler between IP and the layer above; pass through everything but log first. Nothing else changes.

Composability across deployments. The single-core kernel and the multicore kernel can both use the same TCP handler — the underlying queue and waiter mechanisms differ, but the protocol logic is invariant.

Sending

I focused on receive; sending is structurally the same, going the opposite way. TCP.send conn bytes performs IP.send_datagram (wrap_in_ip seg), which performs Link.send_frame (encapsulate seg), which performs Hardware.nic_send pkt. Each layer adds its header on the way down. The send path is shorter than receive in this design because no demultiplexing is needed — you know exactly where each packet goes.

What’s hard, made explicit

The structural cleanness should not obscure the work TCP demands. The hard parts of a real TCP implementation are not in the layering; they’re inside each handler:

TCP state machine. The eleven-state machine of TCP (Closed → Listen → SynSent → … → Closed) has subtle rules about which transitions are legal under which segment patterns. A real implementation has to handle simultaneous opens, abort sequences, half-close, and the famous time-wait state with its 2*MSL timer. This logic lives inside the TCP handler’s handle_tcp_segment function.

Retransmission. Tracking which bytes have been acknowledged and which need to be resent if no ack arrives in time. The retransmit buffer, the RTO timer, the exponential backoff on consecutive losses.

Congestion control. Slow start, congestion avoidance, fast retransmit, fast recovery, possibly newer algorithms like BBR or CUBIC. All operating on per-connection state.

Flow control. Advertising a receive window, respecting the peer’s window, the silly window syndrome, the Nagle algorithm.

Out-of-order reassembly. When segments arrive out of order, buffering them and delivering bytes in sequence.

None of this goes away. None of it is made easier by being inside a handler. But none of it is made harder, either, and the parts that are made easier — the layering, the testing, the composition — are pure wins.

The bet of this chapter is that the things made easier are the things conventional implementations spend disproportionate effort on. The conventional TCP stack’s complexity has some essential complexity (state machine, retransmission) and some accidental complexity (layering bookkeeping, threading discipline, context-switch management between user-space and kernel-space). The accidental complexity is what the effect-tower structure eliminates.

What this chapter committed to

A TCP receive path as a tower of effect handlers. Each layer translating from the abstraction below to the abstraction above. The pump pattern for driving the network forward when data is needed. Per-resource waiter slots for blocking semantics. The acknowledgment that essential protocol complexity remains, but accidental layering complexity goes away.

The next chapter is the one where the move is most explicitly SICP-shaped: building a database without ever using the word.

The database that appeared from nowhere

This is the SICP chapter. We’re going to start with something embarrassingly simple — a key-value store implemented as a handler — and progressively add features. At each step the additions are small, local handler refinements. By the end we’ll have a small distributed database with transactions, persistence, and replication, and at no point will we have written code that announces itself as building a database. We will have built one in the negative space of solving small problems one at a time.

The chapter is here to make a specific argument: that effect handlers are not just a way to implement coroutines, schedulers, and protocol stacks — they’re a way of composing arbitrarily complex stateful systems out of small, independently-testable pieces. The database is incidental; the structural property is the point.

Step 1: A key-value store

effect KV {
  get : key -> option value
  set : key -> value -> ()
  del : key -> ()
}

handler in_memory_kv = {
  let store = Hash.empty () in
  {
    | KV.get k, c -> resume c (Hash.find store k)
    | KV.set k v, c -> Hash.insert store k v; resume c ()
    | KV.del k, c -> Hash.remove store k; resume c ()
    | return v -> v
  }
}

The handler holds a hash table. The three operations do what you’d expect. Forty-some lines including the hash table, none of which is interesting. We’ve built a hash table with a typed interface; nobody is impressed yet.

Notice the new syntax: handler X = { let s = ... in { clauses } }. This is a handler with its own local state, allocated when the handler is installed and freed when it goes out of scope. The hash table store is private to this handler — nothing outside can see or mutate it directly. Access is only through KV.get, KV.set, KV.del. This is the encapsulation property that OO uses class boundaries for and Rust uses module boundaries for; here it’s the handler boundary.

Step 2: Add transactions

A transaction is “do a sequence of operations atomically — either they all happen or none do.” The straightforward extension is to add a new effect:

effect Txn {
  begin    : () -> ()
  commit   : () -> ()
  rollback : () -> ()
}

And a new handler that wraps the kv handler, buffering writes until commit:

handler with_transactions = {
  let mutable pending : (key * op) list option = None in
  {
    | Txn.begin (), c ->
        pending <- Some [];
        resume c ()
    | Txn.commit (), c ->
        (match pending with
         | None -> resume c ()    (* not in txn, no-op *)
         | Some ops ->
             List.iter apply_op (List.rev ops);
             pending <- None;
             resume c ())
    | Txn.rollback (), c ->
        pending <- None;
        resume c ()
    | KV.set k v, c ->
        (match pending with
         | Some ops ->
             pending <- Some ((k, Set v) :: ops);
             resume c ()
         | None ->
             perform (KV.set k v);     (* pass through *)
             resume c ())
    | KV.del k, c ->
        (match pending with
         | Some ops ->
             pending <- Some ((k, Del) :: ops);
             resume c ()
         | None ->
             perform (KV.del k);
             resume c ())
    | KV.get k, c ->
        (* Reads see uncommitted writes from this txn *)
        let from_pending = lookup_in_pending pending k in
        match from_pending with
        | Some Tombstone -> resume c None
        | Some (Value v) -> resume c (Some v)
        | None -> resume c (perform (KV.get k))
  }
}

The user writes:

with in_memory_kv in {
  with with_transactions in {
    perform Txn.begin ();
    perform KV.set "x" 10;
    perform KV.set "y" 20;
    perform Txn.commit ()
  }
}

The transaction handler is between the user and the in-memory kv handler. It intercepts every KV.set, KV.del, and KV.get to honor transaction semantics. On commit, it forwards the buffered operations to the underlying kv. On rollback, it discards them.

The kv handler doesn’t know transactions exist. The user doesn’t know how transactions are implemented. The handler boundary is the contract.

Step 3: Add persistence

Suppose we want changes to survive a crash. The natural place is another handler between transactions and storage:

handler journaling (j : journal) = {
  | KV.set k v, c ->
      journal_write j (Set k v);
      perform (KV.set k v);
      resume c ()
  | KV.del k, c ->
      journal_write j (Del k);
      perform (KV.del k);
      resume c ()
  | KV.get k, c ->
      resume c (perform (KV.get k))    (* pass-through *)
}

This handler logs every mutation to a journal before forwarding to the underlying handler. On crash, we can replay the journal to reconstruct state.

Stack it:

with in_memory_kv in {
  with journaling (open_journal "kv.log") in {
    with with_transactions in {
      user_code ()
    }
  }
}

The order matters. From bottom to top: the in-memory store is the base; journaling logs every operation that reaches it; transactions buffer until commit. When a user performs KV.set, the operation flows up to transactions (buffered), eventually flows down via commit to journaling (logged), eventually flows down to the in-memory store (applied).

On crash and restart, we read the journal at boot, replay each operation against a fresh in-memory store, and we’re back where we were. Recovery is “fold over the journal.”

We’ve added durability without modifying the previous handlers. Each handler is a thin wrapper around the layer below; new behaviors are introduced by sliding new handlers into the stack.

Step 4: Add replication

Now suppose we want changes to be applied on multiple cores, so failure of one core doesn’t lose data. Yet another handler:

handler replication (peers : core_id list) = {
  | KV.set k v, c ->
      List.iter (fun peer ->
        perform NoC.send peer (Wrap_KvOp (Set k v))
      ) peers;
      perform (KV.set k v);
      resume c ()
  | KV.del k, c ->
      List.iter (fun peer ->
        perform NoC.send peer (Wrap_KvOp (Del k))
      ) peers;
      perform (KV.del k);
      resume c ()
  | KV.get k, c ->
      resume c (perform (KV.get k))
}

This handler, for every mutation, sends a copy of the operation to each peer core via the NoC. The peers have their own kv handlers (with the same journaling handler stacked, presumably); they receive the message in their NoC arrival handler and apply it to their local store.

We’ve added replication without modifying anything else. The kv handler is unchanged. The journaling handler is unchanged. The transaction handler is unchanged. The user code is unchanged.

This is fire-and-forget replication, which is the wrong semantics for a real database — you want acknowledgments, conflict resolution, quorum, leader election, the whole consensus apparatus. But the shape of the solution is already there. To add an “wait for ack from N peers” semantic, the replication handler captures the user’s continuation, files it pending acks, and resumes it when enough acks arrive. This is the exact same pattern as TCP’s recv waiting for data: continuation as the wait slot. The handler grows; the layers around it don’t.

What we just built

Let me count what’s in the stack now:

with in_memory_kv in {
  with journaling (open_journal "kv.log") in {
    with replication [peer1; peer2] in {
      with with_transactions in {
        user_code ()
      }
    }
  }
}

Four handlers, each maybe 30-50 lines. The user gets transactions, durability, and replication by writing perform KV.set. The handlers compose. Each handler is independently testable, independently swappable, independently inspectable.

The database is not anywhere. There’s no database.cml file. There’s no class called Database. There’s no module that announces itself. The database is the emergent property of stacking these four handlers. The user can call this configuration whatever they want; whatever it’s called, it has the properties of a database.

The general pattern

The structural move that’s been made repeatedly:

  1. Identify a behavior that some users want (transactions, durability, replication).
  2. Write a handler that provides that behavior by intercepting the relevant effects, doing the necessary bookkeeping, and forwarding to the layer below.
  3. Stack the handler.

The handler is the abstraction unit. The stack is the program. The layers compose because their interfaces (the effects) are typed and the type system enforces compatibility.

This is the same move that:

  • A kernel makes when it adds preemption by installing a Tick handler (chapter 11).
  • A protocol stack makes when it layers IP on Link on Hardware (chapter 15).
  • A debugger makes when it interposes a tracing handler.
  • A test harness makes when it stubs out side-effecting layers.
  • A profiler makes when it counts effects.

The reason it composes well: handlers are first-class language constructs with type-checked interfaces. Adding a layer doesn’t require modifying any other layer’s source. Removing a layer is identically easy. Reordering them changes semantics in well-defined ways.

In OO, the closest analog is the decorator pattern, which is hampered by the lack of type-level enforcement and the need for boilerplate. In functional programming without effects, the closest analog is monad transformers, which are hampered by the lift/wrap bookkeeping that makes them tedious in practice. Effect handlers, given a type system that tracks them, make the composition free.

The honest caveats

A few things this chapter glossed over that matter for a real database:

Concurrency control. With transactions and concurrent users, you need isolation (serializable, read-committed, snapshot, etc.). The transaction handler shown is single-user. Multi-user requires either a lock manager handler or an MVCC-style version-tracking handler. Both are doable; both add complexity inside the handler. The composition story doesn’t change; the inside of one handler grows.

Crash recovery semantics. The journaling handler shown is write-ahead-logging. To get the right durability guarantee, the journal must be flushed to durable storage before acknowledging the write. The handler should perform a Storage.fsync effect; the handler that interprets Storage.fsync decides what durability means (write to actual flash, write to NoC peer, etc.).

Replication consistency. Fire-and-forget gives no consistency guarantee. Strong consistency requires consensus (Paxos, Raft) which is substantially harder. The handler structure can express consensus, but a real implementation is hundreds of lines per replica. The structural argument of this chapter holds; the engineering argument doesn’t extend to “we trivially get consensus.”

The point of the chapter isn’t that databases are easy. The point is that the compositional structure of a database — layers, with clearly typed interfaces between them, each layer adding capability — is what effect handlers naturally express. Real systems have always been built this way; the design we’ve assembled makes the structure explicit and the composition cheap.

What this chapter committed to

A key-value store, transactions, persistence, and (sketched) replication, built as four stackable handlers. The recognition that this composition pattern is the same one used for kernel preemption, protocol layering, and other apparently-distinct concerns. The acknowledgment that the structural ease does not eliminate the essential complexity of real database engineering.

The next chapter takes the same compositional move into concurrency: building a coroutines runtime, and discovering that what we’ve built is the runtime we already had.

Coroutines, or: discovering you’ve built the runtime

This is the chapter where the design’s recursion shows. We’ve built a kernel out of continuations and handlers. We’ve built a TCP stack out of continuations and handlers. We’ve built a database out of continuations and handlers. In each case, the kernel/stack/database is not a special-purpose runtime — it’s a use of the language’s general-purpose primitives, arranged into the right shape.

Now we’re going to build something whose conventional implementation is a special-purpose runtime: coroutines. Co-routines, generators, async/await, fibers — call them what you want, they’re “stop running this thing here, resume it later, possibly returning a value.” Every modern language has a runtime to support this, written in C or Rust or assembly, hooked into the language as a special feature.

We’re going to write one in CML, in about a hundred lines, using effects. And we’re going to discover that what we’ve written is exactly the same shape as what the kernel does. They’re the same construction. The kernel is a particular use of the coroutine machinery. The coroutine machinery is a generalization of what the kernel needed.

What a coroutine is

A coroutine is a function that can pause itself, return a value to whoever invoked it, and be resumed later from where it paused. Compared to an ordinary function: an ordinary function runs to completion before returning; a coroutine can return many times, with state preserved between returns.

The canonical use case is generators:

let fib_gen () =
  let a = ref 0 in
  let b = ref 1 in
  loop {
    yield !a;
    let next = !a + !b in
    a := !b;
    b := next
  }

Each yield returns a value to the consumer and pauses the generator. The next time the consumer asks for a value, the generator resumes at the line after the yield, with its local state (a, b) intact.

In Python this is a yield statement and the runtime manages the magic. In JavaScript this is a generator function with function* syntax and yield keyword, with the runtime managing the magic. In C++ it’s coroutines (TS 1, C++20), with the compiler doing CPS transform and managing state allocation. The runtime is, in every case, a non-trivial piece of language-implementation engineering.

Here, we’re going to write it as a handler.

The naive version

The simplest possible coroutine is just an effect:

effect Coro a {
  yield : a -> ()
}

A generator is a function that performs Coro.yield operations. A consumer installs a handler that captures each yield, stores the continuation, and returns the value:

type 'a generator =
  | Done
  | Yielded of 'a * (unit -> 'a generator)

handler run_coro (body : () -> ()) : 'a generator =
  match body () with
  | () -> Done
  with
  | Coro.yield v, k ->
      Yielded (v, fun () -> resume_handler run_coro k)

run_coro installs a handler for Coro.yield and runs the body. If the body completes, return Done. If the body yields a value, return Yielded (v, k_resume) where v is the yielded value and k_resume is a thunk that, when called, will run the body further until the next yield or completion.

The consumer pattern-matches on the generator:

let rec consume gen =
  match gen with
  | Done -> ()
  | Yielded (v, k_next) ->
      print_int v;
      consume (k_next ())

This is a cooperative coroutine: the generator and consumer alternate. The generator yields; the handler returns a value and a “how to continue”; the consumer extracts the value, decides whether to ask for more, and if so calls the continuation to resume the generator.

This is exactly the shape that Python generators have. The runtime implementation is different (Python uses frame objects, stack manipulation, and a C-level generator type) but the semantics are identical.

Generalizing: a coroutine system

Generators are useful, but the more interesting move is to support many coroutines that can be scheduled however the user wants. This is where it starts to resemble the kernel.

effect Coro {
  spawn : (() -> ()) -> coro_id
  yield : () -> ()
  exit  : () -> empty
  await : coro_id -> ()
}

type coro = {
  id : int;
  k  : unit continuation;
}

type coro_state = {
  mutable ready    : coro queue;
  mutable waiters  : (coro_id, coro list) table;   (* waiting on completion *)
  mutable next_id  : int;
}

handler coroutines (s : coro_state) = {
  | Coro.spawn body, k ->
      let id = s.next_id in
      s.next_id <- id + 1;
      let cb = fun () -> body (); perform Coro.exit () in
      let cont = reify cb in
      Queue.push s.ready { id; k = cont };
      resume k id

  | Coro.yield (), k ->
      Queue.push s.ready { id = currently_running (); k };
      schedule_next s

  | Coro.exit (), _k ->
      let me = currently_running () in
      wake_waiters_for me s;
      schedule_next s

  | Coro.await target, k ->
      let me = currently_running () in
      add_waiter s target { id = me; k };
      schedule_next s

  | return v -> v
}

and schedule_next s =
  match Queue.pop s.ready with
  | None -> ()           (* all coroutines done *)
  | Some c ->
      set_currently_running c.id;
      invoke c.k ()

Notice anything? This is the kernel. The structure is identical:

  • A coro_state record with a ready queue and a waiters table → just like the kernel’s sched_state
  • A spawn effect that creates new units of execution → just like ProcLifecycle.spawn
  • A yield effect that puts the current unit at the back of the queue → just like ProcLifecycle.yield
  • An exit effect that drops the current unit and runs the next → just like ProcLifecycle.exit
  • An await effect that blocks the current unit until something happens → just like IO.read_byte waiting for input
  • A scheduler loop that pops from ready and invokes → just like the kernel’s scheduler

The “coroutines runtime” we just wrote is the kernel. The only differences are that coroutines run in user-space (no preemption, no devices, no NoC) and that the names are different (Coro.spawn vs. ProcLifecycle.spawn).

You could, in fact, implement this coroutines runtime as a use of the kernel. A coroutine is just a process. Coro.yield is ProcLifecycle.yield. The handler is just an alias. We didn’t build a new runtime; we re-discovered the one we already had.

What about actors?

You can play the same game with actors. An actor system is “many independent units of execution, each with private state, communicating by message-passing.” Let’s build one as a handler:

effect Actor a {
  spawn  : (mailbox a -> ()) -> actor_id
  send   : actor_id -> a -> ()
  recv   : () -> a
  exit   : () -> empty
}

type 'a actor = {
  id : int;
  mailbox : 'a queue;
  k : unit continuation;
}

handler actors (s : actor_state) = {
  | Actor.spawn body, k ->
      let id = s.next_id in
      s.next_id <- id + 1;
      let mb = Queue.empty () in
      let new_a = { id; mailbox = mb; k = reify (fun () -> body mb) } in
      Hash.insert s.actors id new_a;
      Queue.push s.ready new_a;
      resume k id

  | Actor.send target msg, k ->
      let a = Hash.find s.actors target in
      Queue.push a.mailbox msg;
      (match Hash.pop s.recv_waiters target with
       | Some waiter_k ->
           let m = Queue.pop a.mailbox in    (* immediately delivered *)
           Queue.push s.ready { /* current */ };
           invoke waiter_k (unwrap m)
       | None ->
           resume k ())

  | Actor.recv (), k ->
      let me = currently_running_actor () in
      (match Queue.pop me.mailbox with
       | Some msg -> resume k msg
       | None ->
           Hash.insert s.recv_waiters me.id k;
           schedule_next s)

  | Actor.exit (), _k ->
      cleanup_actor (currently_running_actor ());
      schedule_next s
}

Once again: this looks exactly like the kernel with the addition of per-unit mailboxes. The mailbox is what makes it an actor system rather than a generic coroutine system — units communicate via messages, with sends and receives mediated by mailbox queues.

But again, structurally: there’s a ready queue, a per-unit state record (now including a mailbox), a spawn/yield/exit interface, and the same handler shape. Erlang spent considerable engineering effort on its actor runtime (the BEAM VM with its lightweight processes), and we’ve sketched its essentials in about 60 lines.

The realization

You haven’t built a kernel and then separately built a coroutines runtime and separately built an actor system. You’ve built one general-purpose mechanism — handlers acting on continuations — and applied it three times.

This is the deepest claim of the design: the same mechanism, applied recursively, gives you arbitrary stateful runtimes for free. A scheduler is a special handler. An actor system is a special handler with mailboxes. A coroutine generator is a special handler that returns control to a consumer. The distinction between “kernel” and “user-space runtime” dissolves; both are configurations of the same primitive.

Compare this to how things work in Linux. Linux has a kernel scheduler (CFS, EEVDF, or whatever) implemented in C. On top of that, libraries like libuv or asyncio provide user-space coroutine runtimes, implemented in C or Python, using epoll/io_uring/kqueue to multiplex I/O. On top of that, application code uses the coroutines. Three distinct runtimes, written by three different teams, in three different languages, with different invariants, different debugging tools, different performance characteristics.

In our system, there’s one runtime. The kernel handler stack and the user-space handler stack are written in the same language with the same primitives. A debugger can see across the layers. The performance characteristics are uniform. The invariants compose.

What this buys, in practice

A few practical consequences:

Custom schedulers are cheap. Want round-robin? Use a FIFO queue. Want priority-based? Use a heap. Want fair-share? Track CPU time per process. The scheduler is one handler; swap it out at the deployment site without touching anything else. In Linux you write a kernel module; here you write a handler.

Custom concurrency models are cheap. Actors, CSP-style channels, futures, software transactional memory — all of them are handler patterns. None require new language features. None require runtime support beyond the universal primitive.

Debugger and profiler hooks are cheap. A profiler is “a handler that counts effects, then forwards.” A debugger trace is “a handler that logs effects, then forwards.” Adding observability is adding a handler.

Testing is cheap. A test harness installs handlers that record/replay/stub. The system under test is the same code as in production; only the handlers differ.

What this chapter committed to

A coroutines runtime in CML, about 60 lines. The realization that it’s structurally identical to the kernel. A sketch of an actor system, same shape. The argument that effect handlers + continuations give you, by composition, the runtimes that conventional languages bake in as special features.

The next chapter is the honest assessment: what this design is for, and what it isn’t.

What this design is for, and what it isn’t

The book has been advocating. The closing chapter should situate. The design described in this book is not a panacea, not a production target, not a competitor to anything that ships in real silicon today. It is a specific point in the design space, useful for specific things, with specific costs. Let me be honest about both.

What this design is for

Teaching. The book itself is the artifact. Reading it should give you new conceptual handles for thinking about kernels, type systems, and concurrency, regardless of whether you ever write a line of CML or run a CML kernel. The unification of preemption, blocking, and IPC under a single effect mechanism is the kind of idea that, once seen, makes you ask whether the systems you’re using should have been built that way. The answer might be no — there are real reasons not to. But asking the question productively requires having the example.

Edge servers and L7 control-plane work. The natural workload for this design is many concurrent flows, each doing small amounts of work, blocked frequently on I/O. Web servers. Reverse proxies. API gateways. Service meshes. Anywhere a conventional design would use thousands of OS threads or an event loop with handler callbacks, this design substitutes a uniform effect mechanism. The cost per blocked flow is the size of its actual live state, which is small. The cost per active flow is comparable to threaded code, possibly better because the context-switch cost is the cost of an effect dispatch (a handful of cycles in this design) rather than the cost of an OS context switch (microseconds in modern Linux).

The argument here is the same one BEAM (Erlang’s VM) made forty years ago and that Go made fifteen years ago: lightweight units of execution with cheap context switching enable concurrency patterns that heavyweight thread-based systems can’t reach. This book’s design extends that argument to hardware co-design: rather than building a virtual machine on top of an OS on top of an ISA, build the lightweight-process model directly into the ISA, and you can simplify dramatically.

Protocol implementations where correctness is the dominant cost. The TCP receive path in chapter 15 illustrates the model. A protocol stack written as a tower of handlers is structurally easier to reason about than a protocol stack written as a thread of control with state machines. The state machines still exist — they’re inside the handlers — but the layering is enforced by the type system rather than by convention. Bugs that come from “we called the wrong layer’s function” become compile errors. Bugs that come from “we forgot to handle this state transition” become exhaustiveness warnings.

For protocols where the cost of getting it wrong is catastrophic — financial, medical, infrastructure — this structural cleanness is worth real money. The same TCP stack will be a few times slower than a tuned production stack on conventional hardware, and several times faster to verify.

Research vehicles for effect systems. The hardware support for effects in this ISA — the handler CAM, the continuation-as-value, the NoC-mediated effect dispatch — is doing something no commercial ISA does. If you want to investigate “what would the runtime look like if effect dispatch were as cheap as a function call” or “how does GC change when continuations are first-class hardware values,” this design is a vehicle for those investigations. It is not itself the answer; it is a place to start asking.

Anywhere small, sovereign software is the value. This is more aspirational than commercial. The design philosophy of this book — ontologically minimal, parts that fit each other, software you can hold the whole of in your head — argues for a class of systems that are small enough to verify, modify, and own. The 150-line kernel of chapter 14 is the kind of artifact you can read end-to-end and understand. A real Linux kernel on real x86 is not, regardless of how much you’d like it to be. If you have a problem where the team that runs the software needs to also understand the software — a boutique financial firm with a custom trading system, a hospital with a custom EHR, a small-state government with a custom voter-roll system — designs in this style are worth investigating, not because they’ll outperform Linux on benchmarks but because they have a different epistemic relationship with the people running them.

This last category is the most philosophical and the hardest to defend in a tweet. It’s also the one this author believes is the most important. Most professional software is built to a scale that exceeds any single mind’s capacity to hold; this is treated as a virtue (look how big we are) when it could be treated as a cost (look how few of us understand the whole). Designs that prioritize understandability, even at the expense of scale, are a small but persistent minority position. This book is a contribution to that minority.

What this design is not for

Hard real-time systems. Garbage collection, if any, introduces latency. Effect dispatch has variance (CAM hit vs. CAM miss differs by an order of magnitude). The handler search is bounded but not predictable. For systems where missing a 100µs deadline is unacceptable — flight control, anti-lock brakes, certain trading systems — this design is not appropriate without considerable hardening.

It is possible to make it suitable: bounded handler depths, no-allocation hot paths, statically-allocated continuations. But this is design-by-restriction, and the restrictions remove much of what makes the design pleasant. If hard real-time is your constraint, look at seL4 or Ada/SPARK or specialized RTOSes designed for the purpose.

Maximum-throughput data plane. A CPU running tuned C code with SIMD intrinsics, processing packets in a polling loop with no operating-system involvement, will move bits faster than this design. DPDK, XDP, and similar systems are tuned to this case, and they win. Our design pays for its abstraction in cycles; not many, but enough that you can measure them and notice.

The design wins on concurrency, not on single-flow throughput. If you have a million slow connections, our design plausibly wins. If you have one connection moving 100 Gbps, you want different hardware and different software.

Anything requiring an existing software ecosystem. There is no CML port of OpenSSL. There is no CML port of glibc. There is no CML port of anything. The language is brand new, the ISA is brand new, the kernel is brand new. You cannot run npm install. You cannot grab a Python interpreter and prototype. If you need to use existing software libraries, this is not the system to build on, and bridging to existing software (via FFI to C) is itself a substantial design problem that this book does not address.

Production today, in any serious sense. The book describes a working artifact — there’s an emulator, a compiler, a language server, and a kernel that runs example programs. But “working artifact” and “production-ready system” are separated by many engineer-years of work: standard libraries, debuggers, profilers, package managers, third-party tools, documentation outside this book, training material, hiring pipelines for people who know the system. Real production software requires all of these. The book is not pretending to deliver them.

This is a research artifact released early. The author’s bet is that the design is interesting enough to influence other people’s work, even if no system that ships in real silicon ever uses exactly this design. The history of computer architecture is full of designs that were never commercially successful but whose ideas propagated — the Lisp Machines didn’t survive but their tagged memory influenced JIT compilation throughout the industry; Smalltalk never displaced production languages but its garbage collector and IDE shaped every modern language; the Connection Machine never made it past Thinking Machines but SIMD lives on in every modern GPU. Influence happens through ideas, not through products.

Where this design could go

Plausible extensions, in rough order of effort and risk:

A 32-bit version. The 16-bit toy serves the book, but real applications need 32-bit (or 64-bit) addresses. A 32-bit ISA would widen the tag to 4 or 5 bits, allowing more primitive types (float, byte, int64), and would extend memory to 4 GB. The language changes are minor; the ISA design has open space for it.

A real silicon prototype. The ISA is intentionally simple enough that a single grad student could build it in an FPGA over a summer. A serious effort would target a small ASIC. This would teach much about whether the design’s performance claims hold up in practice or are paper-thin.

Integration with existing software via FFI. If you can call C from CML, you can use existing libraries. The cost is giving up the type system’s guarantees at the FFI boundary, which is what every other language with FFI also pays. Done well, this is the easiest way to make CML practical.

More cores. The four-core toy demonstrates the multicore mechanism. A real version would have 16, 64, 256 cores. The NoC mesh scales naturally; the bottom handlers for distributed scheduling need refinement; work-stealing patterns need to be informed by real workload data.

A persistent boot system. The boot ROM as described is hand-coded. A real version would have a bootloader, a way to load compiled images from storage, a way to update them. None of this is interesting design work but all of it is necessary.

Verification. Many of the design choices in this book were made with verification in mind — linear types, effect tracking, explicit handler stacks, no implicit shared state. These properties make formal verification of the kernel tractable. A version that proved properties of the scheduler (“no process is forgotten,” “every effect is handled or trapped”) would be a meaningful contribution.

A second language on the same ISA. The ISA is designed for an ML, but its primitives are general enough that you could compile other languages to it. A Lisp would fit naturally — the tagged memory matches Lisp’s history. A small dependently-typed language could exploit the verification angle. A logic programming language could use continuations for choice points. The ISA is, in this sense, more general than the language.

A meditation, in closing

The design described in this book is wrong about specific things. Some of the specific decisions — 3-bit tags, 8-entry CAM, particular instruction encodings — are arbitrary choices that a real design would revisit. Some of the more architectural decisions — verified-tagged ALU, hardware-effect dispatch, NoC-as-primary-fabric — are bets that may not pay off at scale. The book’s confident tone should not be confused with the underlying engineering’s certainty.

But the direction the design points in is, the author believes, correct. The C-shaped hardware we inherited is not a fact of nature; it is an accident of which language got popular first. If we could design hardware and language together, we would not design what we have. What we would design, exactly, is open. The book is one proposal, sketched in enough detail to be argued with.

If you read the book and disagree with specific decisions, write your own. The hardware-language co-design space is vastly under-explored. There is room for a hundred more attempts.

If you read the book and walk away thinking the things conventional systems treat as separate concerns might be the same concern in disguise — preemption and blocking, interrupts and yields, message-passing and shared state — that’s the takeaway the book was written to deliver. The rest is implementation detail.

Coda

This is the end of the body. Two appendices follow, with the canonical specifications of the ISA and the language; two more, with device protocols and a glossary. They are reference material, not narrative; consult them as needed.

The design lives. The compiler is being written. The emulator runs. The book is one snapshot of a project that, by the time you read this, has likely changed in ways the author cannot predict. If you want to know what’s current, look for the project’s home page, or ask one of the people working on it.

Whatever you build next, build it small enough to hold in your head, fit its parts to each other rather than to an ideal abstract function, and let the type system do the bookkeeping you’d otherwise pay for in tests, comments, and reviews. That, in the end, is the whole argument.

A 16-bit ISA for an Effects-Native Toy Machine — v0.4

Synthesis revision. Incorporates the compiler team’s v0.3 working notes, decisions surfaced by language spec v0.2, and the new requirements that the language places on the ISA.

Status

This is v0.4 of the ISA. It supersedes v0.2.1.

Major changes from v0.2.1:

  • Bit operations added (AND, OR, XOR, SHL, SHR)
  • Typed pointer arithmetic added (INC, DEC)
  • Multiple-install disambiguation specified (install-ID mechanism)
  • Continuation in-memory layout pinned (no longer implementation-defined)
  • Handler CAM specified as LRU write-back cache (concrete cost model)
  • Register-class conventions ratified (no callee-saved registers)

Decisions deferred to a future revision are flagged at the end.

1. Design parameters

Instruction width16 bits, fixed
Data word width16 bits
Address space64 KB word-addressed
Registers8 × (3-bit tag, 16-bit value) = 19-bit slots
Handler CAM8 entries, LRU write-back to memory
Effect table256 entries (16 families × 16 ops), runtime-modifiable
HeapWord-addressed, with per-object headers
Data stackFor CALL/RET; separate from heap

2. Tag namespace

3-bit tags, fixed assignment:

TagNameMeaning
0INT16-bit signed integer
1PTRPointer to a heap object (with header at offset -1)
2CONTContinuation handle (heap pointer, distinguished for GC and for RESU typechecking)
3CLOSUREClosure record (heap pointer, distinguished for CALL typechecking)
4UNITThe unique unit value; value bits ignored
5BOOL1 or 0 in value bits
6RESERVEDFuture primitive
7RESERVEDFuture primitive

All user-defined sum types are heap-allocated (PTR-tagged); variant discriminator lives in the object’s header.

3. Effect namespace

Effects are addressed as (family : 4 bits, op : 4 bits), giving 256 distinct operations.

The effect table is a 256-entry table in memory at a known fixed address, mapping (family, op) to handler-installation-info.

Family 0 is reserved (privileged control plane). Operations:

(family=0, op=N)Operation
(0, 0)Install handler for (family, op) at PC offset
(0, 1)Uninstall handler for (family, op)
(0, 2)Save effect table state
(0, 3)Restore effect table state
(0, 4)Query if (family, op) is handled
(0, 5..15)Reserved

Family 0xF is reserved (hardware-originated). NoC packet arrivals, timer ticks, device events.

Families 1–0xE are available to user programs.

4. Handler CAM

The CAM is an LRU-managed write-back cache for the handler stack.

OperationCost (cycles)
CAM hit1
CAM miss, spill-entry inspection12 per entry (4 words × 3 cycles per word)

INST writes to the MRU slot. If full, the LRU entry is written back to a spill region in memory. PERF searches CAM first, then walks the spill on miss. A spill hit promotes the entry to CAM MRU.

Multiple-install disambiguation (NEW in v0.4):

Each handler frame carries a monotonic install-ID, assigned at INST time. When PERF finds multiple matches for the same (family, op) — whether in CAM, in spill, or split across both — the match with the highest install-ID wins.

This delivers correct effect-shadowing semantics: nested handlers shadow outer handlers regardless of cache state. The CAM is purely a performance optimization; correctness depends only on install-ID ordering.

Implementation cost: one 16-bit field per frame. Lookup cost: in the common case (one match), no change; in the multi-match case, walk the small candidate set and pick max(install_id).

UNIN semantics: Uninstalls the most-recently-installed handler for (family, op). If the CAM contains it, remove and refill from spill if available. If only in spill, scan and remove. Re-establishes the previous shadowing.

5. Heap object layout

Per-object header at offset -1 from each PTR/CONT/CLOSURE-tagged value:

  15        8 7      4 3      0
 ┌──────────┬────────┬────────┐
 │variant(8)│size(4) │layout(4)│
 └──────────┴────────┴────────┘
  • layout (4 bits): RECORD, ARRAY, CLOSURE_REC, CONT_REC, SUM, STRING, others reserved.
  • size (4 bits): Number of payload words after the header. 15 is the escape value meaning “consult next word for actual size.”
  • variant (8 bits): For SUM layout, the variant discriminator. For CONT_REC, the used/unused flag. Repurposed per layout otherwise.

ALLOC initializes header from immediates. SETV rewrites only the variant byte.

6. Continuation in-memory layout

A CONT_REC heap object has the following payload after its header:

Word  Field                  Notes
----  ---------------------  --------
  0   saved PC               16-bit code address
  1   saved SP               stack pointer at capture time
  2   prompt SP              stack pointer of handler frame (delimits captured slice)
  3   saved v6 value         (16 bits)
  4   saved v6 tag           (3 bits + reserved)
  5   saved v0 value
  6   saved v0 tag
  7   saved v1 value
  8   saved v1 tag
  9   saved v2 value
 10   saved v2 tag
 11   saved v3 value
 12   saved v3 tag
 13   saved v4 value
 14   saved v4 tag
 15   saved v5 value
 16   saved v5 tag
 17   saved v7 value
 18   saved v7 tag

19 payload words. Exceeds the 15-word inline header size limit, so size = 15 is used and the actual size (19) is stored in a second header word at offset 0 (the original payload-offset 0 becomes offset 1, etc.).

Two-word-per-register saves are chosen to ease future expansion to wider data words. Permitted implementations may use packed encoding if they remain semantically equivalent to the spec’d layout — but the layout above is canonical for tooling (debuggers, migration, serialization).

The header’s variant byte holds a used/unused flag. RESU sets used = 1 atomically (with respect to preemption boundaries); a second RESU on a used CONT traps.

7. Instruction encoding

Five forms:

                  15 14 13 12 11  10 9 8  7 6 5  4 3 2  1 0
Form R3:          ┌─────────────┬───────┬──────┬──────┬─────┐
                  │  opcode (5) │vd (3) │vs (3)│vt (3)│ ─── │
Form RR5:         ├─────────────┼───────┼──────┼──────┴─────┤
                  │  opcode (5) │vd (3) │vs (3)│  imm5      │
Form RI8:         ├─────────────┼───────┼──────┴────────────┤
                  │  opcode (5) │vd (3) │      imm8         │
Form J11:         ├─────────────┴───────┴───────────────────┤
                  │  opcode (5) │           imm11           │
Form E:           ├─────────────┬─────────┬─────────┬───────┤
                  │  opcode (5) │family(4)│ op (4)  │hpc(3) │ (for INST)
                  ├─────────────┼─────────┼─────────┼───────┤
                  │  opcode (5) │family(4)│ op (4)  │vr (3) │ (for PERF, UNIN)
                  └─────────────┴─────────┴─────────┴───────┘

32 opcode slots; 27 used (v0.4); 5 reserved.

8. Instruction set

OpMnemonicFormSemantics
0x00LI vd, imm8RI8vd ← (INT, sign-ext(imm8))
0x01LIH vd, imm8RI8vd.v ← (vd.v & 0xFF) | (imm8 << 8)
0x02ADD vd, vs, vtR3both must be INT; result INT
0x03SUB vd, vs, vtR3both must be INT; result INT
0x04BZ vs, off8RI8if vs.v == 0 then pc += sext(off8)
0x05JMP off11J11pc += sext(off11)
0x06LD vd, vs, off5RR5vd ← (INT, mem[vs.v + off5]); vs must be PTR/CONT/CLOSURE
0x07ST vd, vs, off5RR5mem[vs.v + off5] ← vd.value; vs must be PTR; tag of vd discarded
0x08ALLOC vd, layout4, n5(custom)bump-allocate (n+1) words including header; vd ← (PTR, hp+1); hp += n+1
0x09MOV vd, vsR3vd ← vs (copies tag and value)
0x0ACALL vf, vaR3vf must be CLOSURE; push (return PC, v6 value, v6 tag); v6 ← vf; v0 ← va; jump to closure code pointer
0x0BRET vaRI8pop return PC and saved v6; v0 ← va; jump
0x0CMATCH vd, vs, tbl5RR5dispatch on vs (see §9)
0x0DINST family4, op4, hpc3Einstall handler at pc + sext(hpc3); assign monotonic install-ID
0x0EUNIN family4, op4Euninstall most-recent handler for (family, op)
0x0FPERF family4, op4, vrEperform effect with argument in vr; see §10
0x10RESU vk, vvR3vk must be CONT and unused; resume continuation with vv
0x11SETV vd, variant8RI8set variant byte of heap header at vd-1; vd must be PTR
0x12SETTAG vd, tag3(custom)assert tag on vd; debug: trap on mismatch; release: no-op
0x13HALTJ11stop core
0x14AND vd, vs, vtR3NEW: bitwise AND; both INT
0x15OR vd, vs, vtR3NEW: bitwise OR; both INT
0x16XOR vd, vs, vtR3NEW: bitwise XOR; both INT
0x17SHL vd, vs, vtR3NEW: vs.v << vt.v; both INT; shift count masked to 4 bits
0x18SHR vd, vs, vtR3NEW: vs.v >> vt.v logical; both INT; shift count masked to 4 bits
0x19INC vd, vs, imm4(custom)NEW: vd ← (PTR, vs.value + imm4); vs must be PTR; result PTR
0x1ADEC vd, vs, imm4(custom)NEW: vd ← (PTR, vs.value - imm4); vs must be PTR; result PTR
0x1B-0x1FRESERVED

Custom encodings (need confirmation from emulator team):

  • ALLOC: [op:5 | vd:3 | layout:4 | n:4] — note n is now 4 bits (max 15 words), aligned with header size field; the size=15 escape applies
  • SETTAG: [op:5 | vd:3 | tag:3 | reserved:5]
  • INC, DEC: [op:5 | vd:3 | vs:3 | imm:4 | reserved:1]

9. MATCH semantics

MATCH vd, vs, tbl5:

  • If vs.tag is PTR: load the header word from mem[vs.value - 1], extract the variant byte (bits [15:8]), and use it as the index into the jump table at pc + tbl5.
  • If vs.tag is a primitive (INT, BOOL, UNIT, etc.): use the tag value itself as the index.

vd is reserved for future use (a “scratch” output register for the loaded header word, if useful).

The jump table is a sequence of 16-bit code-relative offsets immediately following the MATCH instruction, at pc + tbl5. The CPU adds the selected offset to pc (post-MATCH) and jumps.

Tables longer than the variant range are permitted; entries beyond the actual variant count are unreachable. Missing entries trap.

10. PERF semantics

PERF family, op, vr:

  1. CAM search. Look for an entry matching (family, op). If multiple, pick the one with the highest install-ID.
  2. Spill search if CAM misses. Walk the spill region for matches; among CAM and spill matches, pick highest install-ID.
  3. Effect table fallback. If no installed handler is found, consult effect_table[(family, op)]:
    • Local(pc): behave as if a CAM entry pointed at this PC.
    • Remote(dst, dev_op): build a NoC packet and emit; do NOT capture a continuation; continue at next instruction.
    • Unhandled: trap.

On local dispatch (CAM hit, spill hit, or effect table Local):

  1. Allocate a CONT_REC on the heap; capture saved registers, PC (of next instruction), SP, and the prompt SP (the SP at the handler frame).
  2. Set v0 ← vr (effect argument).
  3. Set v1 ← (CONT, ptr_to_CONT_REC) (continuation handle).
  4. Set v7 ← (INT, op) (op-tag).
  5. Jump to the handler’s PC.

On remote dispatch:

  1. Build NoC packet with dst = device_addr, src = this_core, op = dev_op, payload = vr.value (zero-extended).
  2. Hand to NoC interface; stall if FIFO full.
  3. Continue at the next instruction.

Clobber notice: v0, v1, and v7 are unconditionally clobbered by every PERF that does local dispatch. The compiler treats them as not-live across perform sites.

11. RESU semantics

RESU vk, vv:

  1. vk must be CONT-tagged. vk’s header variant byte must indicate “unused”; else trap.
  2. Mark the CONT_REC as used (atomic with respect to preemption boundaries).
  3. Restore registers v0..v5, v7, and v6 from the CONT_REC’s saved slots.
  4. Restore SP to saved SP.
  5. Set v0 ← vv (the value being delivered to the perform site).
  6. Jump to the saved PC.

The handler frame issuing the RESU is destroyed; control flow transfers entirely to the resumed continuation.

12. Calling conventions

Function call

CALL vf, va:

  • vf must be CLOSURE-tagged; points to a closure record (code_ptr, env_word_0, env_word_1, ...)
  • Hardware pushes return PC (1 word), v6.value (1 word), v6.tag (1 word, packed in low 3 bits) to data stack — 3 words per call
  • v6 ← vf
  • v0 ← va
  • Jump to code_ptr (word 1 of closure record)

Function return

RET va:

  • Pop 3 words from data stack: saved v6 tag, saved v6 value, return PC
  • Restore v6
  • v0 ← va
  • Jump to return PC

Register classes

RegisterClassNotes
v0arg / return / PERF argclobbered by every CALL and every PERF
v1scratch / PERF continuationclobbered by every PERF
v2..v5caller-saved scratchcallee may freely clobber
v6$clo (closure environment)preserved across CALL via implicit stack save
v7PERF op-tagclobbered by every PERF

No callee-saved registers. Function prologues do not preserve registers; the cost of preservation moves to callers and is paid only where needed.

Handler entry

When PERF dispatches locally:

  • v0 ← effect argument
  • v1 ← continuation handle (CONT tag)
  • v7 ← op tag (INT)
  • v2..v5 ← undefined; handler must not assume any value
  • v6 ← undefined; handler’s closure (if any) is recorded in the CONT_REC and restored only on RESU

13. NoC packet format

(Recorded here for completeness; details may evolve as NoC implementation matures.)

[63:58] dst_id     (6 bits, 64 devices)
[57:52] src_id     (6 bits)
[51:48] op         (4 bits)
[47:44] flags      (4 bits, priority/control)
[43:12] payload    (32 bits)
[11:0]  reserved

PERF with remote dispatch builds packets with op = dev_op (from the effect table entry) and payload = vr.value (zero-extended from 16 bits). For larger payloads, the compiler emits a pointer-passing version (allocate, store, perform with pointer).

14. Boot semantics

(Unchanged from v0.2.1.) The bootstrap core comes up at a reset vector with empty CAM and empty effect table. Boot ROM populates the effect table, installs initial handlers, enumerates devices via PERF probes, and releases other cores.

15. Preemption semantics

(Unchanged from v0.2.1.) Preemption is sampled at the end of each sequencer cycle (Fetch→Decode→Execute→Write→Preempt). Instructions either retire fully or do not retire. Pending remote PERFs queue at the core’s NoC input; the queue is drained at preemption sample points.

HALT halts the core until any remote PERF arrives (idle state).

16. What’s deferred to v0.5

The following items from the compiler team’s v0.3 notes are deferred:

  • Memory operation timing model. LD/ST/ALLOC currently 1 cycle, 0 abstract time. A coordinated timing pass after the first real workloads.
  • Fused tag-aware loads (LD.PTR, LD.CONT, LD.CLOSURE). Defer until profiling shows the two-instruction pattern in a hot path.
  • Multi-shot continuations. First-shot are one-shot only.
  • Arithmetic shift right (SAR). Add when a workload demonstrates the need.
  • Shift-by-immediate (SHLI, SHRI). Peephole optimization in assembler should fuse LI + SHR patterns; promote to ISA only if peephole proves insufficient.
  • Cache hierarchy modeling. Flat memory cost for now.
  • Atomic operations (CAS). Needed for multicore synchronization; not yet needed in single-core kernel. Add as one of the reserved slots when multicore begins.

17. Open questions

  • Custom encoding details for ALLOC, SETTAG, INC, DEC need ratification from the emulator team. The fields are specified; the exact bit positions within the 11-bit operand region need to be agreed.
  • Behavior on tag-check trap. Currently unspecified what happens when ADD encounters a non-INT, or RESU encounters non-CONT, or similar. Proposal: tag-check faults raise a hardware-originated PERF on family 0xF op 0 with a fault-code payload. Kernel installs a handler that decides whether to terminate the offending process or attempt recovery.
  • CONT_REC variant byte format. Currently spec’d as a used/unused flag, but 8 bits is more than needed for that. Possible future use: epoch counter for ABA-style protection on continuation reuse in long-running systems.

18. Versioning

This document is ISA spec v0.4.

The Ruby emulator implementation should be updated to add the new instructions (AND, OR, XOR, SHL, SHR, INC, DEC), install-ID tracking, and the canonical CONT_REC layout. The “side-table index” implementation of continuations remains permitted as long as it is semantically equivalent to the spec’d on-heap form.

A Small ML for Effects-Native Hardware — Language Spec v0.2

Specification for the language compiled to the 16-bit effects-native ISA. This revision synthesizes findings from the compiler team’s v0.3 working notes (their notes on the ISA, which surface implications for the language) with the original v0.1 spec.

Status

This is v0.2 of the language spec. Changes from v0.1 are flagged inline. Resolutions to several v0.1 open questions are now pinned. New decisions surfaced by the compiler team’s implementation work are incorporated.

This document remains normative for the compiler implementation and informative for the kernel and library implementations.

The language is unnamed for now. Working name: CML (Concrete ML).

What changed from v0.1

  • OPEN 3 resolved: Frame allocation is hybrid (stack for normal frames, heap for PERF-captured continuations). Was previously specified as all-heap.
  • Effect-shadowing semantics pinned: Nested handlers shadow by install order, not by cache location. Has implications for both compiler lowering and the ISA’s handler-stack semantics.
  • Pointer arithmetic via typed instructions: Specifies that the language relies on tag-preserving pointer operations (INC style) rather than untagged raw arithmetic. The ISA team is asked to provide these.
  • Bit operations required from ISA: AND, OR, XOR, SHL, SHR needed for variant extraction and other bitfield work. Was previously OPEN 4.
  • Continuation in-memory layout pinned: On-heap shape specified, no longer implementation-defined.
  • Handler-tower locality is a first-class concern: The CAM-vs-spill cost (12×) is significant enough that the language and compiler should help programmers manage it.

1. Design principles (unchanged from v0.1)

One: Effects are the only mechanism for I/O, allocation, scheduling, and communication. There are no built-in side effects in the language proper. Every operation that touches the world goes through perform.

Two: The type system tracks effects via row polymorphism. Every function’s type includes its effect row. Effects propagate by inference; annotations are optional but checked.

Three: unsafe is itself an effect. Operations that emit raw ISA instructions are gated by unsafe { ... } blocks, which add Unsafe to the surrounding function’s effect row.

Four: Memory allocation is an effect. There is no built-in allocator. The bottom handler decides the allocation strategy.

Five: The language and the ISA are co-designed. Language constructs lower directly to ISA primitives without runtime support beyond the kernel and its handlers.

2. Syntax overview

(Unchanged from v0.1. Reproduced here for self-containment of this document. Skip to §3 if familiar.)

2.1 Types

type list a =
  | Nil
  | Cons of a * list a

type option a =
  | None
  | Some of a

type pair a b = { fst : a; snd : b }

Algebraic data types with constructors (sum) and records (product). Type parameters are lowercase identifiers.

2.2 Functions

let id (x : a) : a = x

let rec sum_list (xs : list int) : int =
  match xs with
  | Nil          -> 0
  | Cons (h, t)  -> h + sum_list t

Functions are curried. let rec for recursion. Pattern matching on sums via match ... with.

2.3 Effects

effect State a {
  get : () -> a
  put : a -> ()
}

effect IO {
  print : string -> ()
  read_byte : () -> byte
}

Effect declarations introduce a family with named operations. Each operation has a signature; the return type is what the perform-site receives back when the handler resumes.

Effects can be parameterized over types.

2.4 Perform and handle

let inc () : int / {State int} =
  let n = perform State.get () in
  perform State.put (n + 1);
  n

let main () =
  with ref_state 0 in {
    let n = inc () in
    print_int n
  }

perform Effect.op (args) performs an operation. The type carries an effect row.

with handler args in { body } installs handler for the duration of body. The handler discharges the corresponding effect from the body’s effect row.

2.5 Handler declarations

Stateless handler:

handler print_to_uart = {
  | IO.print s, k    -> uart_send s; resume k ()
  | IO.read_byte (), k -> resume k (uart_recv ())
  | return v         -> v
}

Parameterized handler (state-carrying):

handler<a> ref_state(init : a) = {
  | State.get(), k    -> resume k init      with state := init
  | State.put(v), k   -> resume k ()        with state := v
  | return x          -> x
}

(Syntax for parameterized state remains an open compiler-team question; see §10.)

2.6 Unsafe blocks

let alloc_int_cell () : ptr / {Unsafe} =
  unsafe {
    let cell = __alloc Record 1 in
    __st 0 cell 0;
    cell
  }

unsafe { ... } adds Unsafe to the surrounding function’s effect row. Inside, primitives prefixed with __ are available; see §6.

3. Type system

3.1 Types

T ::= int | byte | bool | unit | string | empty
    | ptr | cont           -- low-level types, primarily from unsafe blocks
    | a                    -- type variable
    | T -> T / R           -- function type with effect row
    | (T, T, ...)          -- tuple
    | { l : T; ... }       -- record
    | type_name T1 T2 ...  -- type constructor application

int is 16-bit signed. byte is 8-bit unsigned. empty is the empty type, used for never-returning functions (exit : () -> empty / {Scheduler}).

3.2 Effect rows

R ::= {}                              -- pure
    | { E1, E2, ... }                 -- closed row
    | { E1, E2, ... | r }             -- open row with variable

Row polymorphism is supported. A function’s effect row is inferred; annotations are optional and checked.

Decision (v0.2): The language uses effect rows à la Koka, not effect handles à la Frank/OCaml 5. This was a major implementation choice flagged in the compiler team’s notes; ratifying it explicitly. Effect rows give the kernel the ability to statically prove “this function does not allocate,” which is load-bearing for the design.

3.3 Inference

Hindley-Milner with effect rows. Implementation follows the algorithm in [Leijen 2017]. The compiler team’s working notes mention an option to start with inference-only (no surface effect syntax) and add surface annotations later. Acceptable as an implementation phase; the semantics is row-based from day one.

3.4 The Unsafe effect

Unsafe is built-in. Any use of an __-prefixed primitive adds Unsafe to the surrounding function’s row. The compiler permits unsafe { ... } blocks anywhere; the type system tracks them via the row.

4. Memory model

4.1 No built-in allocation

Every allocation is perform Mem.alloc { ... }. The Mem effect is in the prelude:

effect Mem {
  alloc : { layout : layout; size : int } -> ptr
  free  : ptr -> ()
}

The bottom handler implements Mem. The kernel installs different bottom handlers for different memory strategies.

4.2 Construction syntax

let xs = Cons (1, Nil) desugars to a sequence of perform Mem.alloc calls and field writes, wrapped in an implicit unsafe. The user does not see the desugaring.

Optimization permitted: When the compiler can prove that the active Mem handler is a known bump-allocator (e.g., inside a with region r in { ... } block whose handler is statically resolvable), it may compile allocation directly to ALLOC instructions, bypassing the perform-and-handler dispatch. This is a semantics-preserving optimization, not a language change.

4.3 Region handlers

The canonical user-level memory strategy is region-based:

with region r in {
  let xs = Cons (1, Nil) in
  let n = sum_list xs in
  n
}

The region handler intercepts Mem.alloc and routes allocations into a bump arena. When the with block exits, the arena is reset.

OPEN (carried from v0.1): Region escape detection. Forbidding ptr return from with region blocks is the v0.1 answer and remains the v0.2 answer. A full region-polymorphism system is deferred.

4.4 GC

Not in v0.2. A future bottom handler can implement tracing GC; the language doesn’t change.

5. Compilation overview

5.1 Frame allocation: hybrid stack/heap (NEW in v0.2)

Resolved from v0.1 OPEN 3.

Function-call frames live on the data stack used by CALL/RET. Continuation frames captured by PERF are migrated to the heap as CONT_REC objects at capture time.

This is the production design from the compiler team’s implementation work and proves to be tractable. The simplification “all frames on the heap” from v0.1 is withdrawn.

Implications:

  • Normal function calls do not allocate. CALL pushes return PC + saved closure to the stack; RET pops them. No Mem.alloc perform involved.
  • PERF is the allocation point. When PERF captures a continuation, the live register state and the relevant stack slice are migrated to a fresh CONT_REC on the heap. The on-heap layout is specified in §5.6.
  • Tail calls (§5.8) reuse the current frame; no stack growth.

5.2 Lowering of expressions

SourceTarget
42 (int literal)LI vd, 42 (INT tag default)
x (variable)depends on x’s storage class
f x (application)place x in v0, f in v6, emit CALL v6, v0
let x = e1 in e2compile e1, store result in x’s slot, compile e2
match v with ...compile v to a register, emit MATCH with dispatch table
perform E.op argsplace args in v0, emit PERF (family, op), v0
resume k vplace v in v0, k in v1, emit RESU v1, v0
unsafe { ... __op ... }direct emission of corresponding ISA instruction

5.3 Register classes (NEW in v0.2, ratified from compiler team)

RegisterClassNotes
v0argument / return / PERF argumentclobbered by every CALL and every PERF
v1scratch / PERF continuationclobbered by every PERF (receives continuation)
v2..v5caller-saved scratchcallee may freely clobber
v6$clo (closure environment)preserved across CALL via implicit save/restore
v7PERF op-tagclobbered by every PERF (receives op-tag)

No callee-saved registers. Function prologues are empty; the cost of preserving values moves to callers and is paid only where needed. For an effect-heavy language, this is the right tradeoff — call frequency exceeds the relative gain from callee-saved registers.

v0, v1, and v7 are clobbered by PERF. The compiler must treat them as not-live across any perform site. This constrains register allocation in effect-heavy code; programs should be written assuming three of eight registers are always at risk near perform expressions.

5.4 Function lowering

A function let f x = body compiles to:

  1. A heap-allocated closure record with layout = CLOSURE_REC, containing:

    • Word 0: header
    • Word 1: code pointer
    • Words 2..N+1: captured free variables
  2. A code body at the address in word 1. The body’s prologue:

    • Receives argument in v0, closure pointer in v6
    • Loads needed captured free variables from v6 + offset
    • Executes the body
    • Returns via RET v0

5.5 Sum type lowering

A sum type type t = | A of int | B of (int, int) lowers per v0.1 §5.4, with the addition that variant extraction from headers requires bit operations (§9).

Pattern match on v1 : t:

LD       v3, v1, -1     ; load header (header sits before payload)
LI       v_shift, 8
SHR      v3, v3, v_shift ; shift variant byte down to low bits
LI       v_mask, 0xFF
AND      v3, v3, v_mask  ; mask off layout/size fields
MATCH    v3, v3, table   ; dispatch on variant

The compiler emits this sequence at every match on a PTR-tagged value.

Optimization hook: When the compiler can prove a match has only one or two variants, it can replace MATCH with a single BZ against the variant. Worth doing; common in practice for option types.

5.6 Continuation in-memory layout (NEW in v0.2)

A CONT_REC on the heap has the following shape:

Offset  Field                  Notes
------  ---------------------  -----
  -1    header                 layout = CONT_REC, size = 11, variant = used/unused flag
   0    saved PC               16-bit code address
   1    saved SP               stack pointer at capture time
   2    prompt SP              stack pointer at handler frame (delimits the captured slice)
   3    saved v6 value
   4    saved v6 tag           (low 3 bits in a 16-bit word; remaining bits reserved)
   5    saved v0 packed        high byte = tag, low byte = value (or two words; see note)
   6    saved v1 packed
   7    saved v2 packed
   8    saved v3 packed
   9    saved v4 packed
  10    saved v5 packed
  11    saved v7 packed

Note on packing: Current 16-bit values + 3-bit tags fit awkwardly. The implementation may use either (a) one word per register with tag in the high 3 bits and value in the low 13 bits, sacrificing 3 bits of value range in saves, or (b) two words per register (tag word + value word), doubling the CONT_REC size. Option (b) is more future-proof for a 32-bit expansion. Decision: option (b) for v0.2. The continuation record is larger but the layout extends cleanly to wider data words. Total size grows to approximately 19 words; still well within the 15-word inline-header limit if we use the “size = 15, consult next word” escape.

The continuation handle returned by PERF is a CONT-tagged pointer to the header of this record.

RESU vk, vv reads the CONT_REC pointed to by vk, restores the register state and stack slice, places vv in v0 at the resume site, and transfers control to the saved PC. The variant flag in the header is updated to mark the continuation consumed; a second RESU traps.

This layout is specified, not implementation-defined. Future implementations (silicon, alternative emulators, debuggers) must agree. The Ruby emulator’s current “side-table index” approach is a permitted implementation strategy only if it maintains semantic equivalence with the spec’d layout — i.e., a debugger or migration tool would see the same logical contents.

5.7 Effect lowering and handler installation

(Mostly unchanged from v0.1.)

The compiler emits effect declarations into a manifest. Linker resolves to (family, op) pairs.

Reserved families:

  • Family 0: Privileged control plane.
  • Family 0xF: Hardware-originated effects.
  • User available: 1 through 0xE.

Handler installation lowers to:

INST  family, op, +1
JMP   past_handler
handler_body:
  ...
  RESU v1, vresult
past_handler:
  ...

Handler-tower locality (NEW in v0.2):

The handler CAM holds 8 entries. Beyond that, handlers spill to memory, with a 12× cost penalty on PERF searches that miss the CAM. The compiler should:

  • Warn when a with block’s compilation would cause the static handler tower depth to exceed 8.
  • Avoid emitting handler installations in inner loops where the same handler could be hoisted outside.
  • Document handler depth as a programmer-visible cost.

The language does not statically prevent deep handler towers — they remain expressible — but the compiler issues guidance, and runtime profiling can flag spill-thrashing.

Handler shadowing semantics (NEW in v0.2):

Multiple installations of the same (family, op) must resolve to the most-recently-installed match. This is the standard algebraic-effects semantics; it is non-negotiable.

This places a constraint on the ISA: the handler-stack lookup mechanism must respect install order, not merely cache recency. The compiler team’s v0.3 notes propose install-ID disambiguation; that’s the ISA’s problem to solve. The language requires the semantics; the ISA must deliver it.

5.8 Tail calls

Tail calls must be optimized. The compiler identifies tail-position calls and emits JMP (after argument setup) rather than CALL+RET. Tail calls do not grow the stack.

This is a hard requirement. The kernel scheduler is written assuming TCO; non-TCO compilers cannot compile it.

6. Pointer arithmetic (NEW in v0.2)

The language relies on typed pointer arithmetic for kernel-level code: bumping heap pointers, walking array elements, etc.

The ISA is asked to provide tag-preserving operations:

INC vd, vs, imm4    ; vd ← (PTR, vs.value + imm4); requires vs.tag = PTR
DEC vd, vs, imm4    ; vd ← (PTR, vs.value - imm4); requires vs.tag = PTR

Inside the language, these surface as primitive operations in unsafe blocks:

unsafe {
  let hp' = __inc hp 3 in     (* bump heap pointer by 3 words *)
  ...
}

The language does NOT expose untagged arithmetic. No __add_raw or ADD.RAW. Type-checked arithmetic over INT operands is the only ALU operation form. Pointer arithmetic uses typed INC/DEC.

Rationale: untagged arithmetic would punch a hole in the system’s type-tagging invariant — a value of unknown tag could enter the ALU, and downstream type-checked operations could not safely consume the result. Typed pointer ops preserve the invariant while solving the actual need (the kernel almost never needs general pointer arithmetic; it needs +N and -N for small N).

If the ISA team objects, the fallback is ADD.RAW/SUB.RAW inside unsafe blocks only, with the type system tracking Unsafe aggressively. But INC/DEC are preferred.

7. Bit operations (NEW in v0.2)

The language relies on the ISA providing:

AND vd, vs, vt       ; bitwise AND; both INT
OR  vd, vs, vt       ; bitwise OR; both INT
XOR vd, vs, vt       ; bitwise XOR; both INT
SHL vd, vs, vt       ; vs << vt.value; both INT
SHR vd, vs, vt       ; vs >> vt.value, logical (zero-fill); both INT

These are required for variant extraction (§5.5), bitfield manipulation in headers, and bitmap operations in the kernel.

These are non-negotiable; the language cannot compile without them. The ISA team has been notified.

Arithmetic shift right (SAR) is not yet required and is deferred.

8. Standard prelude

effect Mem {
  alloc : { layout : layout; size : int } -> ptr
  free  : ptr -> ()
}

effect IO {
  print : string -> ()
  read_byte : () -> byte
}

effect Scheduler {
  yield : () -> ()
  spawn : (() -> ()) -> ()
  exit  : () -> empty       (* never returns; uses empty type *)
}

effect Unsafe { /* raw ISA primitives */ }

The kernel installs bottom handlers for Mem, IO, Scheduler before user code runs.

9. Compiler-emitted optimizations

The compiler is permitted (and encouraged) to:

  • Inline handler dispatch when the handler is statically known. A perform Mem.alloc inside a with region r in { ... } block can compile directly to ALLOC without going through PERF/handler-search.
  • Elide SETTAG vd, INT after LD (since LD already produces INT).
  • Replace MATCH with BZ-chains for sum types of one or two variants.
  • Inline small functions to avoid the CALL overhead.
  • Optimize tail calls (required, not optional).

The compiler is forbidden from:

  • Reordering perform sites within a function. The order of perform calls is observable.
  • Eliminating perform sites, even if the handler is known to be effect-free. Future handlers may not be.
  • Caching the result of a perform across calls.

10. Open questions, consolidated

Carried from v0.1 or surfaced anew:

  • OPEN (compiler team to resolve): Parameterized handler state syntax. (with state := v vs. resume k v new_state vs. other.)
  • OPEN (v0.1 carryover): Region escape detection. Forbid ptr return from regions for v0.2; consider region-polymorphism for v0.3.
  • OPEN (ISA team to resolve): Multiple-install disambiguation. Language requires install-order semantics; ISA must deliver. Compiler team’s v0.3 notes propose install-ID disambiguation, which is acceptable.
  • OPEN (deferred): Fused tag-aware load variants (LD.PTR, LD.CONT, LD.CLOSURE). Profile first, decide later.
  • OPEN (deferred): Multi-shot continuations. v0.2 specifies one-shot.

11. Versioning

This document is Language Spec v0.2.

Major semantic changes will produce v0.3. Clarifications without semantic change will produce v0.2.1.

Appendix C: NoC and Device Protocols

This appendix documents the protocols that live below the ISA but above the silicon: how NoC packets are structured, how devices and cores communicate, and how the effect-table is populated to route effects to the right destination.

C.1 NoC packet format

The NoC is a 4-local mesh (each tile has four neighbors: N, E, S, W) carrying fixed-size packets. The current packet format is 64 bits, but the wire format may pad or quantize as needed.

 0      6      12     16     20     52     64
 |------|------|------|------|------|------|
 | dst  | src  | op   | flag | pl   | rsvd |
 |  6b  |  6b  |  4b  |  4b  | 32b  | 12b  |
FieldWidthMeaning
dst6 bitsDestination tile ID (64 tiles max)
src6 bitsSource tile ID
op4 bitsOperation code (per-destination semantics)
flag4 bitsReserved flags (priority, ack, etc.)
pl32 bitsPayload — typed value, with implicit tag
rsvd12 bitsReserved for future use

Total: 64 bits per packet.

The payload is 32 bits, accommodating an INT (with explicit tag), a PTR (memory address with tag), a CONT (continuation handle with tag), or a CLOSURE (closure handle with tag). The 32-bit width covers more than the 16-bit native data word for forward compatibility with a 32-bit successor ISA.

C.2 Tile ID assignments

Tile ID rangeUse
0x00 – 0x07Cores (0–7)
0x10 – 0x17UART devices (multiple)
0x18 – 0x1FKeyboard interface devices
0x20 – 0x27Timer devices
0x28 – 0x2FDisk controllers
0x30 – 0x37Network interfaces
0x38 – 0x3FReserved devices
0x40 – 0xFFReserved future use

The tile ID space is sparse; many of the device slots are unused in the toy. The space leaves room for adding devices without renumbering existing ones.

C.3 Per-device op codes

Each device interprets the op field according to its protocol. Op codes are device-private; the same op code means different things to different devices.

UART (0x10)

OpDirectionPayloadMeaning
0x01core→UARTbyte (low 8 bits)Transmit byte
0x02UART→corebyte (low 8 bits)Received byte
0x03core→UARTconfig wordConfigure baud rate, parity, stop bits
0x04UART→corestatus wordStatus report (overrun, framing error, etc.)
0x0Fbidirectional(none)Identify (returns device type)

Keyboard Interface Device (KID, 0x18)

OpDirectionPayloadMeaning
0x01KID→corescancodeKey event (high bit = release)
0x02core→KIDcore_idBind: future key events to this core
0x03core→KID(none)Request current modifier state
0x0Fbidirectional(none)Identify

Timer (0x20)

OpDirectionPayloadMeaning
0x01core→Timerreload countSet countdown reload value
0x02core→Timermask flagEnable/disable interrupt
0x03Timer→corecore_idTick fired
0x04core→Timercore_idBind: future ticks to this core
0x0Fbidirectional(none)Identify

Inter-core (peer cores, 0x00-0x07)

OpDirectionPayloadMeaning
0x01core→corecontinuation handleMigrate process to target core
0x02core→corerequest_idSteal request
0x03core→corecontinuation handle or nullSteal reply
0x04core→coremessage payloadGeneric IPC message
0x05core→corebarrier_idBarrier signal
0x0Fcore→core(none)Identify (core_id, status)

C.4 NoC ordering guarantees

Endpoint-to-endpoint ordering is guaranteed. Packets sent from tile A to tile B arrive in the order A sent them. This is implemented by the deterministic dimension-ordered routing (X first, then Y) used by the mesh router.

Cross-endpoint ordering is not guaranteed. Packets from A to C and from B to C may arrive at C in either relative order. This is fine for almost all uses; the few cases where it matters (e.g., distributed consensus) need higher-level sequencing built on top.

No silent drops in normal operation. The NoC fabric does not drop packets due to congestion; it applies backpressure. A sending tile that can’t make progress will stall its NoC output, which propagates back through the sequencer pipeline as the PERF instruction blocking.

The exception: if a destination tile’s input FIFO is full and remains full for more than a configurable threshold of cycles, the packet may be dropped and a NocOverflow effect raised on the sender. This protects against pathological cases (a hung receiver) at the cost of relaxing the no-drop guarantee. Default threshold: 1024 cycles.

C.5 Effect table population

The effect table (256 entries indexed by (family, op)) maps an effect ID to a destination. Each entry is one of:

  • Unhandled — trap if performed
  • Local(handler_pc) — fall back to this PC if not in CAM (normal local handler)
  • Remote(dst_tile, dev_op) — emit a NoC packet to dst_tile with op dev_op

The effect table is in memory, at a kernel-controlled address. Each entry is 32 bits:

 31       28 27    26 25  20 19         0
 |---------|------|------|------------|
 |   tag   | rsvd | op2  |    addr    |
TagVariantaddr interpretation
0x0Unhandled(ignored)
0x1Localhandler_pc
0x2Remotedst_tile in low 8 bits, dev_op in op2

The kernel populates the effect table at boot, after enumerating devices. For example, to bind the UART transmit effect:

unsafe {
  __effect_table_set
    (family=2, op=0)     (* UartTx *)
    (Remote { dst_tile = 0x10; dev_op = 0x01 })
}

C.6 Device enumeration protocol

At boot, the bootstrap core enumerates devices by sending an Identify packet (op 0x0F) to each candidate tile. Devices respond with a packet identifying their type and capabilities.

The Identify reply payload format:

 31           24 23           16 15            0
 |------------|------------|----------------|
 | device_kind| version   |    capabilities|
 |    8b      |    8b      |       16b      |
device_kindType
0x01UART
0x02KID
0x03Timer
0x04Disk
0x05NIC
0x06DMA controller
0xFFReserved/unknown

version is a per-device-type protocol version. capabilities is a per-device-type bitfield documenting optional features.

A tile that does not respond to Identify within a timeout is presumed absent. The bootstrap core builds a device map from the responses.

C.7 Effect family conventions

The 16 effect families are allocated as follows. Each family has 16 ops, for 256 total effect IDs.

FamilyUse
0x0Reserved for control plane (scheduler, lifecycle)
0x1Memory (allocation, region operations)
0x2UART and console I/O
0x3Keyboard input
0x4Timer and clock
0x5Disk
0x6Network
0x7Inter-process communication
0x8User-defined (general)
0x9 – 0xEReserved for future use
0xFHardware-originated effects (NoC arrival, etc.)

User code can perform effects in family 0x8. Library code typically defines its own effects in 0x8 (allocating ops by convention or by registration). Effects in 0x9–0xE may be used by future ISA revisions.

Family 0xF is the only family whose effects are originated by hardware. When the NoC interface receives a packet and surfaces it as a perform, the perform’s family is 0xF. This convention lets handlers distinguish hardware-driven effects from software-driven ones.

C.8 Boot ROM responsibilities

The boot ROM is the first code that runs on each core after reset. It performs the following sequence:

  1. Initialize core state. Set up the data stack pointer, the heap pointer, the handler CAM (empty), the effect table base register.

  2. Install bottom handlers. The boot ROM installs no-op or panic handlers for the catch-all of each family. The kernel will install real handlers later, but in the gap between boot ROM and kernel start, errant performs should not deadlock.

  3. For the bootstrap core (core 0): enumerate devices (send Identify to each candidate tile), build the device map, populate the effect table with default bindings for known devices.

  4. For non-bootstrap cores: wait for a release signal from core 0 (a NoC packet with op 0x06).

  5. Spawn the kernel main. Reify the kernel’s entry point as a process and invoke it. The kernel takes over from there.

The boot ROM is approximately 200 instructions in the current design. It is the only code that uses raw PERF instructions to specific device addresses; everything afterward goes through kernel-provided abstractions.

C.9 Open issues

A few protocol-level decisions are deferred:

  • Boot ROM placement. The boot ROM is in a memory region the kernel cannot overwrite. The exact address layout is implementation-defined.

  • Multicast NoC packets. Currently each NoC packet has a single destination. Some use cases (cache invalidation, barriers) would benefit from multicast. The packet format reserves bits for this but the implementation is deferred.

  • Priority and QoS. The flag field reserves bits for priority but the implementation uses none currently. A real version would prioritize control-plane packets over data-plane packets.

  • Encryption and authentication. None at present. A real system intending to run multiple mutually-distrusting tenants would need cryptographic guarantees on inter-tile traffic.

These are all surmountable; they are simply not in the current scope.

Appendix D: Glossary

Terms that get heavy use in this book, defined for quick reference.


Affine type. A type whose values can be used at most once. Stronger than ordinary types (which allow unbounded use), weaker than linear types (which require exactly once). Rust’s borrow checker is built on affine types. CML uses affine-in-practice / linear-in-theory.

Algebraic data type (ADT). A type built from products (records, tuples) and sums (tagged unions). The “algebra” refers to the fact that the type operators obey arithmetic identities (distribution, currying, etc.) that follow from cardinality.

Arena. A region of memory with a single, simple allocation strategy (typically bump allocation) and a single free operation that releases the whole arena. Used in the kernel for per-process or per-request memory management. See Region.

Boundary tagged. A representation strategy where values carry type information only at structural boundaries (function calls, message-passing), not while resident in registers. An alternative to verified tagged not chosen for this ISA.

Bottom handler. The outermost handler in a tower, responsible for terminating an effect’s bubble-up search. The kernel’s bottom handlers define what happens to effects that no inner handler caught.

Bump allocation. An allocation strategy where memory is allocated by incrementing a pointer through a contiguous region. Deallocation is bulk-only (release the whole region). Used by the kernel’s local arena handler.

CAM (content-addressable memory). A hardware data structure that returns matches for a given key in parallel. The handler CAM in this ISA holds the most-recently-installed handlers, allowing single-cycle handler lookup for the common case.

Capability. A first-class value representing permission to do something — read from a device, allocate from a region, invoke a continuation. Capabilities are typically linear, so they cannot be forged or copied.

Cardinality. The count of distinct values of a type. bool has cardinality 2; int32 has cardinality 2^32; a product type’s cardinality is the product of its components’. The unifying principle behind ML’s type algebra.

CML. “Concrete ML” — the language designed in this book. A small ML with effect handlers, linear types, and a tight relationship with the ISA it compiles to.

Concretization. Gilbert Simondon’s notion of how a technical object becomes more “itself” over time, as its parts come to fit each other rather than fitting an idealized abstract function. The animating philosophy behind this book’s co-design approach.

Continuation. The rest of the computation, reified as a value. A continuation k represents “what would happen next if the value v were supplied”; invoke k v runs that next thing with that value.

Continuation passing style (CPS). A program transformation where every function takes its continuation as an explicit argument. The continuation is the function that consumes the function’s result. CML programs are transformed to a CPS-adjacent form during compilation, though most of the CPS structure is reified into the runtime continuation values rather than being explicit in the IR.

Effect (formal). A typed operation that user code can perform, dispatched to some surrounding handler. Distinct from “side effect” in the casual sense; an effect in this book’s vocabulary is a specific, declared operation.

Effect handler. Code that catches one or more effects, runs in response to a perform, and decides whether to resume the continuation. The kernel’s scheduler is an effect handler. The TCP stack is a stack of effect handlers.

Effect row. A type-level set of effects that a function may perform. Written as () / {IO, Mem} in CML: “returns unit, performs effects from IO and Mem.” Tracked by the type system to ensure handlers are installed where needed.

Effect table. A hardware-managed table (256 entries in this ISA) mapping effect IDs to default handlers or device destinations. The fallback when the handler CAM misses.

Family-and-op. The 8-bit effect ID in this ISA, structured as 4 bits of family plus 4 bits of op. Allows 16 families × 16 ops = 256 distinct effect IDs.

Frame. A unit of execution state — locals, a return address, a partial computation. Multiple frames form a frame tree. The kernel’s hybrid frame allocation puts most frames on a contiguous stack and PERF-captured frames on the heap.

Free theorem. A property of a polymorphic function deducible from its type alone, without reading its implementation. From parametricity. The classical example: a function of type 'a list -> 'a list cannot invent new elements; it can only rearrange or drop existing ones.

Handler CAM. The 8-entry content-addressable memory in this ISA that caches the most-recently-installed effect handlers. Single-cycle dispatch on hit; spill cost ~12× on miss.

Hindley-Milner. The classical type inference algorithm for ML languages. Infers principal types automatically, without annotations. CML uses a Hindley-Milner extension that includes effect rows.

Install-ID. A counter used by the ISA to disambiguate multiple installations of handlers for the same effect. When handlers nest, the most recently installed must take precedence; the install-ID is how the hardware identifies “most recent.”

Linear type. A type whose values must be used exactly once. Stronger than affine (which allows zero-or-one uses). Continuations in CML are linear: they cannot be invoked twice nor forgotten.

Match expression. Pattern-matching dispatch: match v with | pat1 -> e1 | pat2 -> e2 | .... ML’s primary control-flow primitive. Exhaustive by default — the compiler warns or errors if a case is missing.

Monomorphization. A compilation strategy where each polymorphic function is specialized for every type it is called with. Trades binary size for runtime efficiency. The alternative is uniform representation.

NoC (Network-on-Chip). The on-chip mesh fabric that carries packets between tiles (cores and devices). In this book’s design, the primary communication mechanism — most cross-tile operations are NoC packets, surfaced as effects.

Ontological minimality. A design principle, drawn from Stillwork’s philosophical doctrine: prefer the fewest, truest abstractions a domain genuinely requires. Related to Simondonian concretization.

Parametric polymorphism. A form of polymorphism where a function works for any type, with no information about the type available at runtime. ML’s 'a -> 'a is parametric. Distinct from ad-hoc polymorphism (which dispatches on type) and subtype polymorphism (which dispatches through inheritance).

Pattern matching. See Match expression.

PERF. The ISA instruction that performs an effect. Captures the continuation, searches the handler CAM and effect table, transfers control to a handler or sends a NoC packet.

Polymorphism. “Many shapes.” A function or type is polymorphic if it can be used with different concrete types. ML’s parametric polymorphism is the simplest form: a type variable 'a stands for any type.

Precise preemption. The property that a preempt can occur at any instruction boundary, with the preempted state being a clean state from which execution can resume coherently. This ISA commits to precise preemption.

Process. A unit of execution in the kernel — a continuation, an id, optionally a home core. Distinct from an OS process in conventional systems because there is no separate address space or kernel context.

Reify. Convert something abstract (a function-to-be-run, an effect to be performed later) into a first-class value (a continuation, a token). reify (fun () -> body ()) creates a continuation that, when invoked, runs body.

Region. A memory area with bulk allocation and bulk free. Region handlers manage allocation within their scope; on handler exit, the region is freed. Linear region capabilities make this safe.

Resume. The operation that delivers a value back to a perform site, continuing the computation that was suspended. resume k v invokes the continuation k with value v.

Row polymorphism. A type-system feature allowing functions to abstract over the extension of a row (record or effect set). A function with type 'a / {IO | r} -> 'a / {IO | r} works for any effect row that includes IO, leaving the other effects (r) abstract.

Soundness theorem. A statement that the type system rules out a class of errors. CML’s intended soundness theorem: well-typed programs do not fault on hardware tag checks. (Not yet formally proved; the design is structured to make such a proof tractable.)

Spill. When the handler CAM is full and a new handler must be installed, the least-recently-used CAM entry is spilled to memory. Subsequent dispatch to that effect costs the full memory-walk price (about 12× a CAM hit).

Sum type. A tagged union — a value is one of several alternatives, with a runtime tag indicating which. Pattern matching dispatches on the tag.

Tag (hardware). The 3-bit field carried with each value in this ISA, indicating its primitive type. INT, PTR, CONT, CLOSURE, UNIT, BOOL, and two reserved.

Tower (handler tower). The hierarchical stack of effect handlers in scope at any point. When a perform happens, the search starts at the top of the tower and walks down until it finds a matching handler.

Uniform representation. A compilation strategy where every value is represented as a single machine word, possibly a pointer to a heap-allocated object. Trades runtime efficiency for compactness and simpler compilation. The alternative is monomorphization. CML defaults to uniform representation, with monomorphization as an optimization.

Unit type. A type with exactly one value, (). Cardinality 1. The multiplicative identity in the type algebra. What void should have meant.

Unsafe. A scope marker indicating that the enclosed code may use ISA primitives without the language’s normal guarantees. Tracked as an effect (Unsafe) in the type system, so callers know they’re calling unsafe code. Required for inline assembly, bare pointer operations, etc.

Verified-tagged. The hardware representation strategy chosen for this ISA: values carry tags at runtime, and the ALU checks tags before operating. The tag check is verification (the compiler proves it should pass), not dispatch (no opcode varies based on tag). See chapter 6.

With (with handler in expr). The CML syntax for installing a handler over an expression. The handler is active for the duration of expr, catching any effects expr performs.

Acknowledgments and provenance

This book emerged from a series of extended conversations between William and Claude (an AI assistant made by Anthropic), conducted in May 2026, while William was simultaneously designing an effects-native ISA and the small ML that targets it. The book and the actual hardware/software project grew up together: design decisions made in the conversation got tested against compiler and emulator implementations built in parallel by other agents, and findings from those implementations flowed back into the book.

The structure of credit is therefore atypical. The voice of the book is largely the AI’s, since most of the prose was generated by the AI in response to William’s questions, redirections, and pushback. The direction of the book — what to investigate, what to defer, what to elaborate, what was missing — came from William, informed by his systems engineering background and the parallel work on the silicon and software side.

Specific things worth noting:

  • The ISA design started from a draft William produced before the book conversation began, and went through four revisions during the writing process. The final v0.4 specification in Appendix A is the synthesis of those revisions.
  • The language design started from a proposal by a separate compiler-implementation agent, and was revised twice during the writing. The final v0.2 specification in Appendix B incorporates findings from that agent’s implementation work.
  • The Ruby emulator that runs the example programs in this book was built by a third agent. Its existence is what makes the kernel chapters concrete rather than hypothetical.

A specific debt is owed to the broader effects-and-handlers research community — particularly Daan Leijen’s work on Koka, Andrej Bauer and Matija Pretnar’s work on Eff, and the OCaml 5 effects implementation team. The language design owes its row-polymorphism and handler syntax to these sources. Errors in implementation or pedagogy are not theirs.

Earlier predecessors of this design space deserve mention: the Lisp Machines for the radical proposition that hardware tags can be a load-bearing part of a system’s semantics; the BEAM virtual machine and its OTP libraries for the proof that millions of cheap continuations are not only possible but practical; Singularity OS for the idea that message-passing isolation can replace memory-protection isolation; seL4 for the proof that capability-based kernels can be formally verified. None of these projects is the same as what this book describes, but the book would not exist without their having existed first.

The mistakes are entirely mine, William’s, or some combination thereof.

— W.