Container Runtime Architecture
The container runtime provides Docker-inspired isolated execution using fishbowl’s existing kernel primitives. No sandbox code — isolation falls out of the architecture.
Design Principles
- Kernel-per-container. Each container gets its own
createKernel()with independent process table, namespace, and srvFS. Signal isolation is free by construction. - The namespace IS the container boundary.
NamespaceSpecdeclares what the container sees.ProcessSpecdeclares what runs. The separation is the security model. - The kernel doesn’t know about containers. All container types live in
runtime/container/(notkernel/). The kernel provides primitives; the container runtime composes them. - Everything is a file. Container lifecycle is controllable through
containerFS— a Fileserver view over the ContainerManager.
Three-Layer Model
┌─────────────────────────────────────────────────────────┐│ ContainerSpec (types.ts, spec.ts) ││ Declarative — what the container is ││ specFromImage(), mergeSpecs() │├─────────────────────────────────────────────────────────┤│ ContainerRuntime (runtime.ts) ││ Kernel factory — create(spec) → Container ││ boot(image) for backward compat │├─────────────────────────────────────────────────────────┤│ ContainerManager (runtime.ts) ││ Lifecycle + registry — list, get, destroy, run, fs() ││ Extends ContainerRuntime │└─────────────────────────────────────────────────────────┘Spec layer defines the contract. Pure data, immutable, serializable (except ImageRef.kind === 'live' which holds a live JS reference — kind === 'manifest' is the M2 serialization path).
Runtime layer is a kernel factory. create(spec) instantiates a ContainerImpl but doesn’t start it. boot(image) is the backward-compatible legacy path that delegates directly to bootFromImage().
Manager layer adds lifecycle registry. Tracks containers by name, enforces uniqueness, provides destroy() (stop + remove) and run() (create + start + wait + remove). Exposes fs() for the Plan 9 containerFS view.
ContainerSpec Type System
ContainerSpec├── name: string├── labels?: Record<string, string>├── namespace: NamespaceSpec│ ├── image: ImageRef (live | manifest)│ ├── mounts?: Record<string, MountSpec> (volume | bridge | fresh | none)│ ├── binds?: BindSpec[] (from, to, mode — applied AFTER mounts)│ ├── devices?: DeviceSpec (dev, proc, srv, user, tty)│ └── allow?: NamespacePermissions (mount, srv, spawn, signal)├── process: ProcessSpec│ ├── bin: string (resolved at start() via resolveBin)│ ├── argv?: string[]│ ├── env?: Record<string, string>│ └── cwd?: string└── limits?: ResourceLimits ├── maxPids?: number (default: 256) ├── maxFds?: number (default: 1024) └── maxFsBytes?: number (default: unlimited)All fields are readonly. Specs are immutable after creation.
specFromImage(image, overrides?) constructs a spec from a UnixImage. Env merge order: DEFAULT_ENV < ctx.env < overrides.process.env. If the image has services, bin defaults to 'init'; otherwise 'sh'.
mergeSpecs(base, override) deep-merges two specs. Namespace merges deeply (mounts merge, binds concatenate, devices/allow shallow-merge). Process merges shallowly.
MountSpec discriminated union:
volume— mount an existing Fileserver directlybridge— create animportFS()proxy over a Channel (cross-boundary)fresh— mount a new emptymemoryFS()none— explicitly remove an image-default mount at that path
Lifecycle State Machine
create(spec) start() (none) ──────────────── created ──────────────── running │ │ │ remove() │ PID 1 exits / stop() v v removed stopped ──restart()──> running │ │ remove() v removed| Transition | Trigger | What happens |
|---|---|---|
| none → created | manager.create(spec) | ContainerImpl instantiated. No kernel yet. |
| created → running | container.start() | buildNamespace() → createContainerKernel() → spawn PID 1 → install spawn guard. |
| running → stopped | PID 1 exits / stop(timeout?) | SIGTERM → grace period (5s default) → SIGKILL survivors → cleanup. Record exit code. |
| stopped → running | restart() | Reset to created, clear kernel/ns references, call start(). Fresh upper overlay, new kernel, same spec. |
| created/stopped → removed | remove() | Call ns.dispose(), release all references. |
| created → created | start() fails | Error thrown, ns.dispose() called for cleanup, state unchanged. Retry allowed. |
PID 1 death = container death. When PID 1’s wait() resolves, the container transitions to stopped. PID 1 can be init, shell, or any bin.
buildNamespace() — Shared Foundation
Both bootFromImage() (legacy path) and ContainerImpl.start() delegate namespace construction to buildNamespace() in runtime/platform/boot.ts. This eliminates ~150 lines of duplication and ensures a single source of truth for the mount order.
Signature:
buildNamespace(ctx: BootContext, caps: PlatformCapabilities, opts?: NamespaceBuildOpts): Promise<InterpretedNamespace>Mount order (load-bearing, HC-5):
| Step | What | Gate |
|---|---|---|
| 1 | overlayFS(memoryFS(), ctx.rootFs) at / | Always |
| 2 | Image-declared mounts (ctx.mounts) | Always |
| 3 | Spec mounts — dispatch by MountSpec kind | opts.specMounts |
| 4 | Legacy volumes | opts.volumes |
| 5 | devFS at /dev | devices.dev !== false |
| 6 | procFS at /proc (forward-ref kernel) | devices.proc !== false |
| 7 | consFS at /dev/cons via mountConsTerminal() | tty provided AND devices.tty === 'inherit' |
| 8 | userFS at /dev/user | Always |
| 9 | srvFS at /srv (callbacks gated by allow.srv) | devices.srv !== false |
| 10 | Catalog metadata → /pkg/.catalog/*/manifest.json | Catalog exists |
| 11 | Env merge: DEFAULT_ENV < ctx.env < opts.specEnv | Always |
| 12 | catalogInstall callback | Catalog exists |
| 13 | ensureCommonDirs | Always |
| 14 | pkg bin override (importEsm injection) | Always |
| 15 | registerExec callback | Always |
| 16 | Union binds | opts.binds |
| 17 | Build permissions bag + dispose() | Always |
Binds are last (step 16) because they alias names in the already-constructed namespace. Applying them before mounts means the target might not exist.
InterpretedNamespace holds the constructed state: mount map, process table, kernel log, upper FS, default env, srv callbacks, bridge proxies, permissions, and a dispose() method that aggregates cleanup (bridge disconnect + kernel log dispose).
Cross-Boundary Proxy
exportFS / importFS serialize the 10-method Fileserver protocol over any bidirectional Channel (matches MessagePort shape).
Container A Container B┌──────────┐ Channel ┌──────────┐│ Fileserver│◄──exportFS() │importFS()│──► Fileserver (proxy)│ (real) │ ────────► │ (client) │└──────────┘ messages └──────────┘Channel interface: { postMessage(msg, transfer?), onmessage } — zero adapter needed for MessagePort.
createMemoryChannel() produces a [Channel, Channel] pair using structuredClone + queueMicrotask for testing without Workers.
Key behaviors:
- Server maintains
fdMaptranslating integer proxy fds to opaque server fds - Reads use
Transferable([data.buffer]) for zero-copy - Disconnect → all pending promises reject with
FsError('EPIPE', 'channel closed') - ID-based request/response correlation (no ordering dependency)
- Concurrent same-fd writes don’t corrupt — memoryFS operations are synchronous within a microtask (validated by prototype, Decision D7)
readonlyFS(inner) wraps any Fileserver to deny mutations (write, mkdir, remove, rename, wstat). Lives in kernel/fileservers/readonly.ts — general-purpose, not container-specific. Readonly is a mount-time decision: the same server can be mounted RW in one container and RO in another from the same export.
Permission Enforcement
Three mechanisms, one security model: capability decided at construction, immutable thereafter.
| Permission | Mechanism | Where |
|---|---|---|
| mount/bind/unmount | Kernel permissions bag | createKernel({ permissions: { mount: false } }) — 3 guard clauses in proc/index.ts |
| spawn | Container wrapper | After PID 1 spawn in container.ts — wraps kernel.spawn with deny-all. Extends existing maxPids wrapper pattern. |
| srv | Callback omission | postServer/getServer/removeServer set to undefined on InterpretedNamespace when allow.srv === false. Construction-based — capability literally doesn’t exist. |
Why three mechanisms?
- Injected callbacks (srv): pluggable by design, omission is natural
- Kernel-native operations (mount/bind/unmount): implemented inside
createKernel(), permissions bag avoids touching 15+ files - Bootstrapped operations (spawn): PID 1 must be exempt, so enforcement happens after PID 1 spawn at the container level
Defaults: mount: false, srv: true, spawn: true, signal: true. Secure by default — namespace modification is opt-in.
containerFS — Plan 9 Orchestration Interface
containerFS(manager) returns a Fileserver view over a ContainerManager:
/ → readdir lists container names/<name>/ → readdir: status, spec, ctl/<name>/status → read: "<state> [exit=<code>]\n"/<name>/spec → read: JSON-serialized ContainerSpec/<name>/ctl → write: "stop", "start", "restart", "remove"Key design:
- Snapshot-at-open for status/spec: content captured when fd is opened, reads are consistent
- Write-buffer-flush-on-close for ctl: commands accumulate, execute on newline or close
- ctl is synchronous within write:
echo stop > ctlblocks until the container actually stops
Mounting: The host mounts manager.fs() at a chosen path (e.g., /containers) via volumes at boot time:
const host = await runtime.boot(image, { volumes: { '/containers': manager.fs() } })File Layout
packages/core/src/runtime/container/ types.ts — ContainerSpec, Container, ContainerRuntime, ContainerManager spec.ts — specFromImage(), mergeSpecs() container.ts — ContainerImpl (lifecycle state machine) runtime.ts — ContainerRuntimeImpl, ContainerManagerImpl, factory functions proxy.ts — exportFS, importFS, Channel, createMemoryChannel container-fs.ts — containerFS Fileserver factory index.ts — re-exports
packages/core/src/kernel/fileservers/readonly.ts — readonlyFS() wrapperpackages/core/src/runtime/platform/boot.ts — buildNamespace(), InterpretedNamespaceM2 Touch Points
The container runtime was designed with M2 (Image Format & Distribution) in mind:
ImageRef.kind === 'manifest'is a discriminated union variant that exists in the type system but throws “not yet supported” at runtime. M2 implements this path.ImageManifestis sketched intypes.ts—bins,env,services,layerDigests. M2 fills in the serialization format.buildNamespace()is the reconstruction path: deserialize manifest → build BootContext →buildNamespace()→createContainerKernel().overlayFSlayer model is the caching primitive — frozen rootFs layers are already shared across containers.
Architectural Record
Design decisions, divergence resolutions, and prototype findings are documented in:
- Round 1 baseline:
docs/product/sketches/hackathon/step1_baseline.md(35 locked decisions) - Round 2 baseline:
docs/product/sketches/hackathon/container_gaps/step1_baseline.md(18 locked decisions) - Retros:
docs/product/sketches/hackathon/retro.md,docs/product/sketches/hackathon/container_gaps/retro.md