Skip to content

Container Runtime Architecture

The container runtime provides Docker-inspired isolated execution using fishbowl’s existing kernel primitives. No sandbox code — isolation falls out of the architecture.

Design Principles

  1. Kernel-per-container. Each container gets its own createKernel() with independent process table, namespace, and srvFS. Signal isolation is free by construction.
  2. The namespace IS the container boundary. NamespaceSpec declares what the container sees. ProcessSpec declares what runs. The separation is the security model.
  3. The kernel doesn’t know about containers. All container types live in runtime/container/ (not kernel/). The kernel provides primitives; the container runtime composes them.
  4. Everything is a file. Container lifecycle is controllable through containerFS — a Fileserver view over the ContainerManager.

Three-Layer Model

┌─────────────────────────────────────────────────────────┐
│ ContainerSpec (types.ts, spec.ts) │
│ Declarative — what the container is │
│ specFromImage(), mergeSpecs() │
├─────────────────────────────────────────────────────────┤
│ ContainerRuntime (runtime.ts) │
│ Kernel factory — create(spec) → Container │
│ boot(image) for backward compat │
├─────────────────────────────────────────────────────────┤
│ ContainerManager (runtime.ts) │
│ Lifecycle + registry — list, get, destroy, run, fs() │
│ Extends ContainerRuntime │
└─────────────────────────────────────────────────────────┘

Spec layer defines the contract. Pure data, immutable, serializable (except ImageRef.kind === 'live' which holds a live JS reference — kind === 'manifest' is the M2 serialization path).

Runtime layer is a kernel factory. create(spec) instantiates a ContainerImpl but doesn’t start it. boot(image) is the backward-compatible legacy path that delegates directly to bootFromImage().

Manager layer adds lifecycle registry. Tracks containers by name, enforces uniqueness, provides destroy() (stop + remove) and run() (create + start + wait + remove). Exposes fs() for the Plan 9 containerFS view.

ContainerSpec Type System

ContainerSpec
├── name: string
├── labels?: Record<string, string>
├── namespace: NamespaceSpec
│ ├── image: ImageRef (live | manifest)
│ ├── mounts?: Record<string, MountSpec> (volume | bridge | fresh | none)
│ ├── binds?: BindSpec[] (from, to, mode — applied AFTER mounts)
│ ├── devices?: DeviceSpec (dev, proc, srv, user, tty)
│ └── allow?: NamespacePermissions (mount, srv, spawn, signal)
├── process: ProcessSpec
│ ├── bin: string (resolved at start() via resolveBin)
│ ├── argv?: string[]
│ ├── env?: Record<string, string>
│ └── cwd?: string
└── limits?: ResourceLimits
├── maxPids?: number (default: 256)
├── maxFds?: number (default: 1024)
└── maxFsBytes?: number (default: unlimited)

All fields are readonly. Specs are immutable after creation.

specFromImage(image, overrides?) constructs a spec from a UnixImage. Env merge order: DEFAULT_ENV < ctx.env < overrides.process.env. If the image has services, bin defaults to 'init'; otherwise 'sh'.

mergeSpecs(base, override) deep-merges two specs. Namespace merges deeply (mounts merge, binds concatenate, devices/allow shallow-merge). Process merges shallowly.

MountSpec discriminated union:

  • volume — mount an existing Fileserver directly
  • bridge — create an importFS() proxy over a Channel (cross-boundary)
  • fresh — mount a new empty memoryFS()
  • none — explicitly remove an image-default mount at that path

Lifecycle State Machine

create(spec) start()
(none) ──────────────── created ──────────────── running
│ │
│ remove() │ PID 1 exits / stop()
v v
removed stopped ──restart()──> running
│ remove()
v
removed
TransitionTriggerWhat happens
none → createdmanager.create(spec)ContainerImpl instantiated. No kernel yet.
created → runningcontainer.start()buildNamespace()createContainerKernel() → spawn PID 1 → install spawn guard.
running → stoppedPID 1 exits / stop(timeout?)SIGTERM → grace period (5s default) → SIGKILL survivors → cleanup. Record exit code.
stopped → runningrestart()Reset to created, clear kernel/ns references, call start(). Fresh upper overlay, new kernel, same spec.
created/stopped → removedremove()Call ns.dispose(), release all references.
created → createdstart() failsError thrown, ns.dispose() called for cleanup, state unchanged. Retry allowed.

PID 1 death = container death. When PID 1’s wait() resolves, the container transitions to stopped. PID 1 can be init, shell, or any bin.

buildNamespace() — Shared Foundation

Both bootFromImage() (legacy path) and ContainerImpl.start() delegate namespace construction to buildNamespace() in runtime/platform/boot.ts. This eliminates ~150 lines of duplication and ensures a single source of truth for the mount order.

Signature:

buildNamespace(ctx: BootContext, caps: PlatformCapabilities, opts?: NamespaceBuildOpts): Promise<InterpretedNamespace>

Mount order (load-bearing, HC-5):

StepWhatGate
1overlayFS(memoryFS(), ctx.rootFs) at /Always
2Image-declared mounts (ctx.mounts)Always
3Spec mounts — dispatch by MountSpec kindopts.specMounts
4Legacy volumesopts.volumes
5devFS at /devdevices.dev !== false
6procFS at /proc (forward-ref kernel)devices.proc !== false
7consFS at /dev/cons via mountConsTerminal()tty provided AND devices.tty === 'inherit'
8userFS at /dev/userAlways
9srvFS at /srv (callbacks gated by allow.srv)devices.srv !== false
10Catalog metadata → /pkg/.catalog/*/manifest.jsonCatalog exists
11Env merge: DEFAULT_ENV < ctx.env < opts.specEnvAlways
12catalogInstall callbackCatalog exists
13ensureCommonDirsAlways
14pkg bin override (importEsm injection)Always
15registerExec callbackAlways
16Union bindsopts.binds
17Build permissions bag + dispose()Always

Binds are last (step 16) because they alias names in the already-constructed namespace. Applying them before mounts means the target might not exist.

InterpretedNamespace holds the constructed state: mount map, process table, kernel log, upper FS, default env, srv callbacks, bridge proxies, permissions, and a dispose() method that aggregates cleanup (bridge disconnect + kernel log dispose).

Cross-Boundary Proxy

exportFS / importFS serialize the 10-method Fileserver protocol over any bidirectional Channel (matches MessagePort shape).

Container A Container B
┌──────────┐ Channel ┌──────────┐
│ Fileserver│◄──exportFS() │importFS()│──► Fileserver (proxy)
│ (real) │ ────────► │ (client) │
└──────────┘ messages └──────────┘

Channel interface: { postMessage(msg, transfer?), onmessage } — zero adapter needed for MessagePort.

createMemoryChannel() produces a [Channel, Channel] pair using structuredClone + queueMicrotask for testing without Workers.

Key behaviors:

  • Server maintains fdMap translating integer proxy fds to opaque server fds
  • Reads use Transferable ([data.buffer]) for zero-copy
  • Disconnect → all pending promises reject with FsError('EPIPE', 'channel closed')
  • ID-based request/response correlation (no ordering dependency)
  • Concurrent same-fd writes don’t corrupt — memoryFS operations are synchronous within a microtask (validated by prototype, Decision D7)

readonlyFS(inner) wraps any Fileserver to deny mutations (write, mkdir, remove, rename, wstat). Lives in kernel/fileservers/readonly.ts — general-purpose, not container-specific. Readonly is a mount-time decision: the same server can be mounted RW in one container and RO in another from the same export.

Permission Enforcement

Three mechanisms, one security model: capability decided at construction, immutable thereafter.

PermissionMechanismWhere
mount/bind/unmountKernel permissions bagcreateKernel({ permissions: { mount: false } }) — 3 guard clauses in proc/index.ts
spawnContainer wrapperAfter PID 1 spawn in container.ts — wraps kernel.spawn with deny-all. Extends existing maxPids wrapper pattern.
srvCallback omissionpostServer/getServer/removeServer set to undefined on InterpretedNamespace when allow.srv === false. Construction-based — capability literally doesn’t exist.

Why three mechanisms?

  • Injected callbacks (srv): pluggable by design, omission is natural
  • Kernel-native operations (mount/bind/unmount): implemented inside createKernel(), permissions bag avoids touching 15+ files
  • Bootstrapped operations (spawn): PID 1 must be exempt, so enforcement happens after PID 1 spawn at the container level

Defaults: mount: false, srv: true, spawn: true, signal: true. Secure by default — namespace modification is opt-in.

containerFS — Plan 9 Orchestration Interface

containerFS(manager) returns a Fileserver view over a ContainerManager:

/ → readdir lists container names
/<name>/ → readdir: status, spec, ctl
/<name>/status → read: "<state> [exit=<code>]\n"
/<name>/spec → read: JSON-serialized ContainerSpec
/<name>/ctl → write: "stop", "start", "restart", "remove"

Key design:

  • Snapshot-at-open for status/spec: content captured when fd is opened, reads are consistent
  • Write-buffer-flush-on-close for ctl: commands accumulate, execute on newline or close
  • ctl is synchronous within write: echo stop > ctl blocks until the container actually stops

Mounting: The host mounts manager.fs() at a chosen path (e.g., /containers) via volumes at boot time:

const host = await runtime.boot(image, { volumes: { '/containers': manager.fs() } })

File Layout

packages/core/src/runtime/container/
types.ts — ContainerSpec, Container, ContainerRuntime, ContainerManager
spec.ts — specFromImage(), mergeSpecs()
container.ts — ContainerImpl (lifecycle state machine)
runtime.ts — ContainerRuntimeImpl, ContainerManagerImpl, factory functions
proxy.ts — exportFS, importFS, Channel, createMemoryChannel
container-fs.ts — containerFS Fileserver factory
index.ts — re-exports
packages/core/src/kernel/fileservers/readonly.ts — readonlyFS() wrapper
packages/core/src/runtime/platform/boot.ts — buildNamespace(), InterpretedNamespace

M2 Touch Points

The container runtime was designed with M2 (Image Format & Distribution) in mind:

  • ImageRef.kind === 'manifest' is a discriminated union variant that exists in the type system but throws “not yet supported” at runtime. M2 implements this path.
  • ImageManifest is sketched in types.tsbins, env, services, layerDigests. M2 fills in the serialization format.
  • buildNamespace() is the reconstruction path: deserialize manifest → build BootContext → buildNamespace()createContainerKernel().
  • overlayFS layer model is the caching primitive — frozen rootFs layers are already shared across containers.

Architectural Record

Design decisions, divergence resolutions, and prototype findings are documented in:

  • Round 1 baseline: docs/product/sketches/hackathon/step1_baseline.md (35 locked decisions)
  • Round 2 baseline: docs/product/sketches/hackathon/container_gaps/step1_baseline.md (18 locked decisions)
  • Retros: docs/product/sketches/hackathon/retro.md, docs/product/sketches/hackathon/container_gaps/retro.md