ZXFoundation™ Development Guide

Document Revision: 26h1.0
Applies to: ZXFoundation™ release 26h1 and later
Status: Active development


About This Document

This guide is the primary technical reference for the ZXFoundation™ kernel and its associated toolchain. It is written for:

  • OS developers who wish to understand the z/Architecture boot and execution environment.
  • Kernel contributors who need a precise description of subsystem contracts and initialization order.
  • Integrators who want to load their own kernel or module using the ZXFL bootloader.

Familiarity with C23, ELF64, and general operating-system concepts is assumed. Background on IBM z/Architecture is provided in the Architecture chapter.


What Is ZXFoundation™?

ZXFoundation™ is a freestanding, SMP-capable kernel for IBM z/Architecture (s390x) mainframes and emulators. It is written in C23 and targets the s390x-unknown-none-elf ABI.

The project comprises three independently versioned components:

ComponentOutput artifactDescription
ZXFLcore.zxfoundationloader00.sys, core.zxfoundationloader01.sysTwo-stage bootloader
Nucleuscore.zxfoundation.nucleusKernel ELF64 image
Host toolsbin2rec, zxsignBuild-time utilities

All three are built from a single CMake project using a cross-compiler toolchain targeting s390x.


Version Scheme

Releases follow the scheme YYhN, where YY is the two-digit year and N is the half-year index (1 = first half, 2 = second half). The current release is 26h1.

The boot protocol carries its own version field (ZXFL_VERSION_*). A kernel must check this field and refuse to boot if the version is not one it understands.


Document Organization

ChapterContents
Architecturez/Architecture fundamentals: PSW, DAT, CCW, IPL, paging
BootloaderZXFL design, stage descriptions, boot protocol
KernelSubsystem table, initialization sequence, memory management
Build SystemCMake modules, toolchains, configuration variables
Host Toolsbin2rec and zxsign reference

Quick Start

# Configure with the Clang toolchain (recommended)
cmake -B build -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake

# Build everything
cmake --build build

# Generate the DASD image and launch Hercules
cmake --build build --target dasd
hercules -f build/hercules.cnf

In the Hercules console, issue ipl 0100 to start the boot sequence.

See Build System for full configuration options and Build Targets for a description of each output artifact.

Architecture Overview

Document Revision: 26h1.0
Reference: IBM z/Architecture Principles of Operation, SA22-7832


1. z/Architecture

z/Architecture (s390x) is IBM's 64-bit mainframe instruction set, introduced with the z900 in 2000. It supersedes ESA/390 (31-bit) and System/370 (24-bit). ZXFoundation™ targets z/Architecture exclusively; ESA/390 compatibility mode is used only during the first instruction of the IPL sequence.

Key properties that distinguish z/Architecture from commodity architectures:

  • All I/O is performed through the Channel Subsystem (CSS). There is no memory-mapped I/O.
  • The Program Status Word (PSW) encodes the instruction address, addressing mode, DAT enable, and all interrupt masks in a single 128-bit register.
  • The Lowcore at physical address 0x0 is the hardware-defined interrupt vector table with a fixed layout.
  • Inter-processor communication uses the SIGP instruction rather than memory-mapped registers or MSIs.
  • The STFLE instruction enumerates optional hardware facilities (analogous to CPUID on x86).

2. Program Status Word (PSW)

The PSW is 128 bits wide. It is loaded atomically by LPSWE and saved atomically on every interrupt.

Bits  0–63:  Mask word
  Bit  1:    PER enable
  Bit  5:    DAT enable
  Bit  6:    I/O interrupt mask
  Bit  7:    External interrupt mask
  Bit  8:    Problem state (0=supervisor, 1=user)
  Bits 12–15: Condition Code
  Bit  31:   EA (Extended Addressing) — must be 1 for 64-bit
  Bit  32:   BA (Basic Addressing)    — must be 0 for 64-bit

Bits 64–127: Instruction address (64-bit)

EA=1, BA=0 selects 64-bit addressing mode. SAM64 sets this without altering other PSW fields.

Disabled-wait PSW: All interrupt masks cleared, wait bit set. The CPU halts permanently. Used as the panic state.

New PSWs: For each interrupt class (I/O, external, machine check, program, restart, SVC), the architecture reserves a fixed lowcore offset for a "new PSW" — the PSW loaded when that interrupt fires. The kernel must install valid new PSWs before enabling the corresponding interrupt class.


3. Lowcore (Prefix Area)

The lowcore is the 4 KB region at physical address 0x0. Its layout is fixed by the architecture.

OffsetContent
0x000IPL PSW
0x008IPL CCW1
0x010IPL CCW2
0x068Restart new PSW
0x0B8Subchannel ID of IPL device
0x1C0External new PSW
0x1C8SVC new PSW
0x1D0Program new PSW
0x1D8Machine check new PSW
0x1E0I/O new PSW

The prefix register (set by SPX, read by STPX) maps a per-CPU physical page to the logical lowcore address 0x0. Each CPU has its own private lowcore page; the BSP uses physical page 0, APs use separately allocated pages.


4. Channel Command Words (CCW) and I/O

All device I/O is performed through the Channel Subsystem. The CPU constructs a Channel Program — a linked list of CCWs — and submits it via SSCH (Start Subchannel).

CCW Format-1 (8 bytes)

Bits  0–7:   Command code  (0x02=Read, 0x01=Write, 0x08=Sense)
Bits 32–63:  Channel Data Address (CDA) — physical address of data buffer
Bit  65:     Chain Command (CC) — link to next CCW
Bits 80–95:  Byte count

Critical constraint: The CDA field is 31 bits. All I/O data buffers must reside below physical address 0x80000000. This is why ZONE_DMA covers [0, 16 MB).

I/O Sequence

CPU                        Channel Subsystem
 │                              │
 ├─ SSCH (schid, ORB) ────────► │  Submit channel program
 │                              ├─ Execute CCW chain, transfer data
 │◄──────── I/O interrupt ──────┤  Subchannel status available
 ├─ TSCH (schid, IRB) ────────► │  Read Interrupt Response Block
 │◄──────── IRB ────────────────┤  Device status, residual count

5. Initial Program Load (IPL)

When the operator issues a LOAD command, the channel subsystem performs the following automatically:

  1. Reads the first physical record from the IPL device (ECKD: C=0, H=0, R=1) into physical address 0x0.
  2. The record contains an IPL PSW at 0x0 and two CCWs at 0x8/0x10.
  3. The CSS executes the CCW chain to load additional data.
  4. The CPU loads the IPL PSW and begins execution.

For ZXFL, the IPL PSW is a 31-bit ESA/390 PSW pointing to the Stage 0 entry. The first instruction switches to z/Architecture mode via SIGP SET ARCHITECTURE.


6. Dynamic Address Translation (DAT)

DAT is enabled by PSW bit 5. When on, every memory access is translated through the page table hierarchy rooted at the ASCE in CR1.

Address Space Control Element (ASCE)

The ASCE is a 64-bit value in CR1 encoding the physical address of the root table, the Designation Type (DT), and the Table Length (TL). ZXFoundation™ uses DT=11 (Region-First), selecting 5-level paging.

5-Level Page Table Hierarchy

LevelNameEntriesCoverage per entry
ASCE →R1 (Region-First)20488 PB
R1 →R2 (Region-Second)20484 TB
R2 →R3 (Region-Third)20482 GB
R3 →Segment Table20481 MB
Seg →Page Table2564 KB

Each R1–Segment table is 16 KB (2048 × 8 bytes). Each page table is 4 KB (256 × 8 bytes).

Virtual Address Decomposition (DT=11)

 63      53 52      42 41      31 30      20 19    12 11       0
 ┌────────┬──────────┬──────────┬──────────┬────────┬──────────┐
 │  RFX   │   RSX    │   RTX    │    SX    │   PX   │    BX    │
 │ 11 bit │  11 bit  │  11 bit  │  11 bit  │  8 bit │  12 bit  │
 └────────┴──────────┴──────────┴──────────┴────────┴──────────┘
   R1 idx   R2 idx    R3 idx    Seg idx    PT idx   Byte offset

Large Pages (EDAT)

FacilitySTFLE bitPage sizeMechanism
EDAT-181 MBFC=1 in Segment Table Entry
EDAT-2782 GBFC=1 in Region-Third Entry

7. Virtual Address Space Layout

0x0000000000000000  User space (future)
        ...
0x00007FFFFFFFFFFF  User space top

        [ unmapped — translation exception ]

0xFFFF800000000000  HHDM base (CONFIG_KERNEL_VIRT_OFFSET)
                    Physical memory linearly mapped here.
                    PA 0x0 → VA 0xFFFF800000000000

0xFFFFC00000000000  vmalloc / ioremap region

0xFFFFFFFFFFFFFFFF  Top of address space

The HHDM offset is 0xFFFF800000000000. The bootloader builds this mapping before transferring control; all kernel pointers in the boot protocol are HHDM virtual addresses.


8. Physical Memory Zones

ZoneRangePurpose
ZONE_DMA[0, 16 MB)Channel I/O buffers (31-bit CDA constraint)
ZONE_NORMAL[16 MB, RAM limit)General kernel allocations

9. Control Registers

RegisterPurpose
CR0I/O/external interrupt subclass masks, feature enables
CR1Primary ASCE (page table root)
CR6I/O interrupt subclass mask (extended)
CR14Machine check interrupt mask

The bootloader saves CR0, CR1, and CR14 snapshots in the boot protocol so the kernel can inspect the handover state.

Bootloader Overview

Document Revision: 26h1.0


1. What Is ZXFL?

ZXFL (ZXFoundation™ Loader) is the two-stage bootloader for ZXFoundation™. It is the only supported mechanism for loading the kernel nucleus. Its responsibilities are:

  1. Transition the CPU from ESA/390 to z/Architecture 64-bit mode.
  2. Locate and load the kernel ELF64 image from DASD.
  3. Verify kernel integrity (ZXVL structural lock, handshake, SHA-256 checksums).
  4. Probe hardware: memory, CPUs, TOD clock, system identification.
  5. Build the 5-level page tables (identity map + HHDM).
  6. Populate the boot protocol structure.
  7. Transfer control to the kernel entry point with DAT enabled.

2. Two-Stage Design

The split is imposed by a hard architectural constraint: the IPL mechanism loads exactly one record from the IPL device into physical address 0x0 and executes it. That record must contain the IPL PSW and enough code to load a larger second stage.

StageInternal nameDatasetLoad addressSize limit
0zxfl_stage1CORE.ZXFOUNDATIONLOADER00.SYS0x012 KB
1zxfl_stage2CORE.ZXFOUNDATIONLOADER01.SYS0x20000~512 KB

Stage 0 is a minimal DASD reader. Its only job is to find Stage 1 in the VTOC, load it to 0x20000, and jump to it.

Stage 1 is the full loader. It performs all hardware detection, ELF loading, integrity verification, page table construction, and the final jump to the kernel.


3. IPL Flow

Power-on / LOAD button
  │
  ▼
Channel subsystem reads IPL record (C=0, H=0, R=1) → 0x0
  │
  ▼
Stage 0  (arch/s390x/init/zxfl/stage1/)
  ├─ SIGP SET ARCHITECTURE → z/Architecture mode
  ├─ SAM64 → 64-bit addressing
  ├─ Clear BSS
  ├─ Find CORE.ZXFOUNDATIONLOADER01.SYS in VTOC
  ├─ Read it to 0x20000
  └─ Jump to 0x20000
       │
       ▼
Stage 1  (arch/s390x/init/zxfl/stage2/)
  ├─ Install disabled-wait new PSWs (lowcore)
  ├─ Clear BSS (MVCL)
  ├─ STFLE — detect facilities
  ├─ Probe IPL device (ECKD / FBA Sense ID + RDC)
  ├─ Read parmfile (ETC.ZXFOUNDATION.PARM)
  ├─ Find CORE.ZXFOUNDATION.NUCLEUS in VTOC
  ├─ Load ELF64 PT_LOAD segments to physical memory
  ├─ ZXVL: structural lock + handshake + SHA-256 checksums
  ├─ Probe memory (write-pattern test)
  ├─ Load sysmodule= modules
  ├─ Detect SMP (SIGP Sense), STSI, TOD (STCK)
  ├─ Build 5-level page tables (identity + HHDM)
  ├─ Translate all protocol pointers to HHDM virtual
  └─ LPSWE → kernel entry point (DAT on, interrupts masked)

4. Dataset Names

All datasets reside on the IPL DASD volume. Names follow the IBM MVS convention (uppercase, dot-separated, max 44 characters).

DatasetContents
CORE.ZXFOUNDATIONLOADER00.SYSStage 0 IPL record
CORE.ZXFOUNDATIONLOADER01.SYSStage 1 flat binary
CORE.ZXFOUNDATION.NUCLEUSKernel ELF64
ETC.ZXFOUNDATION.PARMBoot parameters (parmfile)

Additional datasets may be listed in the parmfile via sysmodule= entries.


5. Parmfile

The parmfile ETC.ZXFOUNDATION.PARM is a plain-text file read by Stage 1. Supported keys:

KeyDescriptionDefault
syssize=Memory probe limit in MB512
sysmodule=Dataset name of an additional module to load(none)

Multiple sysmodule= lines are permitted (up to 16).


6. Constraints

  • All CCW channel data addresses must be 31-bit (< 0x80000000). Static BSS buffers satisfy this automatically.
  • Stage 0 must fit within 12 KB (enforced by ASSERT in stage1.ld).
  • The Stage 1 stack is 32 KB. The kernel must switch to its own stack before consuming more than ~8 KB.
  • The kernel entry point must be ≥ 0xFFFF800000040000 (HHDM + 256 KB). The loader enforces this.

Stage 0

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/stage1/


1. Purpose

Stage 0 is the minimal IPL loader. It occupies the first record on the IPL DASD volume and is loaded by the channel subsystem into physical address 0x0. Its sole responsibility is to locate Stage 1 (CORE.ZXFOUNDATIONLOADER01.SYS) in the VTOC, read it to 0x20000, and jump to it.


2. Entry Point (head.S)

The channel subsystem loads the IPL record and executes the PSW at offset 0x0. This PSW is a 31-bit ESA/390 PSW pointing to stage1_entry.

The entry sequence:

stage1_entry:
  1. SIGP SET ARCHITECTURE (order 0x12) → switch to z/Architecture
     Retry with "restore PSWs" flag if first attempt fails.
  2. SAM64 → enable 64-bit addressing mode
  3. Clear BSS (byte loop — MVCL is unsafe before architecture switch)
  4. Set stack pointer to stage1_stack_top − 160
  5. Load schid from lowcore offset 0xB8
  6. Call zxfl00_entry(schid)
  7. Disabled-wait PSW (fallback — zxfl00_entry is [[noreturn]])

The 160-byte stack offset is the standard z/Architecture register save area size.


3. Main Function (entry.czxfl00_entry)

Execution order:

  1. diag_setup() — flush any partial DIAG 8 output line.
  2. Print the Stage 0 banner via DIAG 8.
  3. dasd_find_dataset(schid, "CORE.ZXFOUNDATIONLOADER01.SYS", &ext) — locate Stage 1 in the VTOC.
  4. Read the dataset track-by-track into 0x20000 using dasd_read_next.
  5. Sanity-check: verify the loaded image is not a disabled-wait PSW.
  6. Jump to 0x20000 with schid in %r2.

4. Linker Script (stage1.ld)

SectionAddressNotes
.text.ipl0x0IPL PSW (8 bytes)
.text0x58Code (after lowcore reserved area)
.bssafter .textZero-initialized data

An ASSERT in the linker script enforces that the entire stage fits within 12 KB. The build will fail if this limit is exceeded.


5. Stack

An 8 KB static array in BSS. The stack pointer is initialized to stage1_stack_top − 160.


6. Shared Library (common/)

Stage 0 uses a subset of the shared common/ library:

ModulePurpose
dasd_io.cLow-level CCW I/O (SSCH/TSCH)
dasd_vtoc.cVTOC traversal and dataset lookup
diag.cDIAG 8 console output
ebcdic.cEBCDIC ↔ ASCII conversion
panic.cDisabled-wait on fatal error
string.cMinimal memcpy, memset, strcmp

Stage 1

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/stage2/


1. Purpose

Stage 1 is the full production loader. It is a flat binary linked at 0x20000, loaded there by Stage 0. It performs all hardware detection, kernel loading, integrity verification, page table construction, and the final transfer of control to the kernel.


2. Entry Point (entry.Sstage2_entry)

stage2_entry:
  1. Save schid from %r2 into a callee-saved register (%r13)
  2. Call zxfl_lowcore_setup() — install disabled-wait new PSWs
  3. SSM 0x00 — mask all interrupts off
  4. Clear BSS with MVCL (pad-fill mode, source length = 0)
  5. Set stack pointer to stage2_stack_top − 160
  6. Restore schid into %r2
  7. Call zxfl01_entry(schid)

SSM 0x00 is issued immediately after zxfl_lowcore_setup installs safe new PSWs. Any interrupt that fires during the loader will hit a known disabled-wait rather than garbage.

BSS is cleared with MVCL in pad-fill mode (source length = 0, pad byte = 0x00). This is safe in 64-bit mode and faster than a byte loop for large BSS sections.


3. Main Function (entry.czxfl01_entry)

Execution order:

StepAction
1STFLE — store facility list into proto.stfle_fac[]
2CR setup — clear I/O, external, machine-check masks in CR0; zero CR6 and CR14
3Device probeprobe_ipl_device(): ECKD Sense ID first, then FBA; populates ipl_dev_type and ipl_dev_model
4Parmfile — read ETC.ZXFOUNDATION.PARM; parse syssize=
5Nucleus loaddasd_find_dataset_extents + zxfl_load_elf64
6ZXVL — structural lock check, handshake, SHA-256 segment checksums
7Memory probe — write-pattern test at 1 MB granularity up to syssize or 512 MB
8Module loading — load each sysmodule= dataset as a flat binary after the kernel image
9System detectionzxfl_system_detect: STSI (manufacturer, model, LPAR), SIGP Sense (CPU map), STCK (TOD)
10Protocol finalization — magic, version, binding token, stack canaries, CR snapshots
11MMU + jumpzxfl_mmu_setup_and_jump: build page tables, translate pointers, LPSWE to kernel entry

4. Linker Script (stage2.ld)

The binary is linked at 0x20000 as a flat ELF. The post-build step strips it to a raw binary with objcopy -O binary.


5. Stack

A 32 KB static array in BSS. The kernel receives a pointer to the top of this stack in %r15 and in proto->kernel_stack_top (HHDM virtual). The kernel must switch to its own stack before consuming more than ~8 KB.


6. Shared Library (common/)

Stage 1 uses the full common/ library:

ModulePurpose
dasd_io.cLow-level CCW I/O
dasd_vtoc.cVTOC traversal
dasd_eckd.cECKD device driver
dasd_fba.cFBA device driver
dasd_tape.cTape device driver
elfload.cELF64 segment loader
mmu.cBootloader page table builder
lowcore.cLowcore / new PSW setup
zxvl_verify.cZXVL integrity checks
parmfile.cParmfile parser
stfle.cSTFLE facility detection
system.cSTSI, SIGP Sense, STCK
diag.c, ebcdic.c, panic.c, string.cUtilities

DASD Subsystem

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/dasd_*.c


1. Overview

ZXFL supports three DASD device types. The correct driver is selected automatically by probing the IPL device with Sense ID and Read Device Characteristics (RDC) CCWs.

TypeDriverTypical device
ECKDdasd_eckd.c3390 (most common)
FBAdasd_fba.c9336
Tapedasd_tape.c3480, 3490, 3590

2. Low-Level I/O (dasd_io.c)

All device access goes through a single CCW submission layer:

dasd_do_io(schid, ccw_chain, sense_buf)
  │
  ├─ Build ORB pointing to ccw_chain
  ├─ SSCH(schid, ORB)
  ├─ Wait for I/O interrupt (disabled-wait loop on TSCH)
  ├─ TSCH(schid, IRB) → check device end status
  └─ Return status or panic on unrecoverable error

All CCW data buffers are static BSS arrays, ensuring they reside below 0x80000000 (31-bit CDA constraint).


3. ECKD Driver (dasd_eckd.c)

ECKD (Extended Count Key Data) is the standard format for IBM 3390 DASD. Addressing is by cylinder, head, and record number (C/H/R).

Key operations:

OperationCCW commandDescription
Sense ID0xE4Identify device type and model
Read Device Characteristics0x64Obtain geometry (cylinders, heads, sectors)
Seek0x07Position to cylinder/head
Search ID Equal0x31Find record by C/H/R
Read Count Key Data0x86Read a full record

Track reads use a Seek → Search → Read CCW chain. The search CCW loops (via TIC — Transfer in Channel) until the target record is found.


4. FBA Driver (dasd_fba.c)

FBA (Fixed Block Architecture) devices use linear block addressing. Each block is 512 bytes.

Key operations:

OperationCCW commandDescription
Sense ID0xE4Identify device
Define Extent0x63Set the block range for the following operation
Locate Record0x43Specify starting block and count
Read0x42Transfer data

5. Tape Driver (dasd_tape.c)

Tape support is provided for environments where the kernel is stored on a 3480/3490/3590 tape cartridge. Tape is read sequentially; there is no random access.

Key operations: Sense ID, Rewind, Read Block, Forward Space File.


6. Device Selection

At Stage 1 startup, probe_ipl_device() issues a Sense ID CCW to the IPL subchannel. The returned device type code selects the driver:

device_type == 0x3390  →  ECKD
device_type == 0x9336  →  FBA
device_type == 0x3480
              0x3490
              0x3590   →  Tape
otherwise              →  panic("unsupported IPL device")

VTOC

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/dasd_vtoc.c


1. What Is the VTOC?

The Volume Table of Contents (VTOC) is the directory of a z/Architecture DASD volume. It is an IBM-defined on-disk structure that maps dataset names to their physical extents (cylinder/head ranges on ECKD, or block ranges on FBA).

The VTOC begins at a fixed location recorded in the DASD label (Format-4 DSCB at cylinder 0, head 0, record 3 on ECKD). ZXFL reads the VTOC to locate the kernel and loader datasets by name.


2. DSCB Types

The VTOC consists of Data Set Control Blocks (DSCBs), each 140 bytes. ZXFL uses two types:

TypeFormatPurpose
Format-1F1DSCBDataset name, creation date, first extent
Format-3F3DSCBAdditional extents (overflow from F1)
Format-4F4DSCBVTOC descriptor — location and size of VTOC itself

3. Dataset Lookup

dasd_find_dataset(schid, name, &ext)
  │
  ├─ Read F4DSCB (C=0, H=0, R=3) → get VTOC start C/H and size
  ├─ For each DSCB in VTOC:
  │    ├─ Read record
  │    ├─ Check format byte
  │    ├─ If F1DSCB: compare DS1DSNAM (44-byte EBCDIC name) to target
  │    └─ If match: extract extent list from DS1EXT1..DS1EXT3
  └─ Return first extent (cylinder/head start + end)

Dataset names are stored in EBCDIC on disk. ZXFL converts the search name from ASCII to EBCDIC before comparison using ebcdic_ascii_to_ebcdic().


4. Extent Structure

Each extent describes a contiguous range of tracks:

struct extent {
    uint16_t  cyl_start;   // starting cylinder
    uint16_t  head_start;  // starting head
    uint16_t  cyl_end;     // ending cylinder (inclusive)
    uint16_t  head_end;    // ending head (inclusive)
};

A dataset may span up to three extents in its F1DSCB, with additional extents in a chained F3DSCB. ZXFL follows the F3 chain if the dataset requires more than three extents.


5. Sequential Read

After locating a dataset's extents, dasd_read_next() reads tracks sequentially:

for each extent:
    for each track in [cyl_start/head_start .. cyl_end/head_end]:
        Seek → Search R=1 → Read all records on track → append to buffer

The read stops when the buffer is full or all extents are exhausted.

ELF64 Loader

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/elfload.c


1. Overview

zxfl_load_elf64 loads the kernel ELF64 image from DASD into physical memory. It processes only PT_LOAD program headers; all other segment types are ignored.


2. Load Sequence

zxfl_load_elf64(schid, dataset_name, load_base_out)
  │
  ├─ Read ELF header (first 64 bytes)
  ├─ Validate: magic 0x7F 'E' 'L' 'F', EI_CLASS=2 (64-bit),
  │            EI_DATA=2 (big-endian), e_machine=0x16 (s390)
  ├─ Read program header table (e_phoff, e_phnum entries)
  ├─ For each PT_LOAD segment:
  │    ├─ Compute physical load address:
  │    │    pa = p_paddr − CONFIG_KERNEL_VIRT_OFFSET
  │    ├─ Read p_filesz bytes from file offset p_offset → pa
  │    └─ Zero-fill [pa + p_filesz, pa + p_memsz)
  └─ Return load_min (lowest p_paddr seen, stripped of HHDM offset)

3. Address Computation

The kernel is linked with virtual addresses in the HHDM range (p_vaddr ≥ 0xFFFF800000000000). The physical load address is derived by subtracting CONFIG_KERNEL_VIRT_OFFSET:

$$pa = p_paddr - \texttt{CONFIG_KERNEL_VIRT_OFFSET}$$

The loader does not use p_vaddr directly; it uses p_paddr to avoid ambiguity when the linker script sets AT() addresses.


4. Constraints

  • The kernel ELF must be ET_EXEC (executable, not shared object).
  • e_machine must be 0x16 (EM_S390). Any other value causes an immediate panic.
  • All PT_LOAD segments must have p_paddr ≥ CONFIG_KERNEL_VIRT_OFFSET. A segment below the HHDM offset is rejected.
  • The kernel entry point (e_entry) must be ≥ 0xFFFF800000040000 (HHDM + 256 KB). The loader enforces this before the final jump.
  • The total loaded image (all PT_LOAD segments) must fit within the memory probed by the write-pattern test.

5. BSS Zeroing

Segments where p_memsz > p_filesz have a BSS tail. The loader zeros this region with memset immediately after reading the file data. This ensures the kernel's BSS is clean before any ZXVL verification.

Bootloader MMU & HHDM

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/mmu.c


1. Purpose

Before transferring control to the kernel, Stage 1 must enable DAT (Dynamic Address Translation) and establish the virtual address space the kernel expects. This involves building a 5-level page table hierarchy with two mappings:

MappingVirtual rangePhysical rangePurpose
Identity[0x0, RAM)[0x0, RAM)Allows the loader itself to continue executing after DAT is enabled
HHDM[HHDM_BASE, HHDM_BASE + RAM)[0x0, RAM)The kernel's primary view of physical memory

HHDM_BASE = 0xFFFF800000000000 (CONFIG_KERNEL_VIRT_OFFSET).


2. Page Table Allocation

The bootloader allocates page tables from a bump allocator backed by a contiguous physical region immediately after the kernel image. The region base is the first 1 MB-aligned address after kernel_phys_end, floored at 32 MB. The end of this region is recorded in proto->pgtbl_pool_end.

The kernel PMM must mark [pool_base, pgtbl_pool_end) as reserved during initialization.


3. Build Sequence

zxfl_mmu_setup_and_jump(proto, entry_point)
  │
  ├─ Allocate R1 table (16 KB, zero-filled)
  ├─ For each 4 KB page in [0, RAM):
  │    ├─ Map VA = PA         (identity)
  │    └─ Map VA = PA + HHDM  (HHDM)
  ├─ Build ASCE: R1_phys | DT=11 | TL=2048
  ├─ Load ASCE into CR1 (LCTL)
  ├─ Translate all proto pointer fields to HHDM virtual
  ├─ Set PSW.DAT = 1 in the new PSW
  └─ LPSWE → entry_point (DAT on, interrupts masked)

Large pages (EDAT-1 / EDAT-2) are used if the corresponding STFLE facility is present, reducing the number of page table entries required.


4. Pointer Translation

All pointer fields in zxfl_boot_protocol_t that reference physical memory are translated to HHDM virtual addresses before the jump:

$$va = pa + \texttt{CONFIG_KERNEL_VIRT_OFFSET}$$

This includes mem_map_addr, kernel_entry, kernel_stack_top, cmdline_addr, and lowcore_phys. The kernel must not attempt to dereference any protocol pointer as a physical address.


5. State at Kernel Entry

ResourceState
DATOn — CR1 holds the ASCE built by the loader
InterruptsMasked — all interrupt classes disabled
%r2HHDM virtual address of zxfl_boot_protocol_t
%r15HHDM virtual address of initial stack top (32 KB)
All other GPRsUndefined

Boot Protocol

Document Revision: 26h1.0
Protocol version: ZXFL_VERSION_4 (0x00000004)


1. Overview

The kernel receives a pointer to zxfl_boot_protocol_t in %r2 at entry. All pointer fields are HHDM virtual addresses. The struct is version 4.

The kernel must validate proto->magic == ZXFL_MAGIC (0x5A58464C, "ZXFL") before using any other field. A mismatch indicates the wrong value is in %r2 or the loader did not complete correctly.


2. Header Fields

FieldTypeValue / Description
magicu320x5A58464C ("ZXFL")
versionu320x00000004
flagsu32Bitmask of ZXFL_FLAG_* (see §8)
binding_tokenu64ZXVL_SEED ^ stfle_fac[0] ^ ipl_schid

3. Loader Identity

FieldTypeDescription
loader_majoru16Major version (1)
loader_minoru16Minor version (0)
loader_timestampu32Build time encoded as HHMMSSZx

4. IPL Device

FieldTypeDescription
ipl_schidu32Subchannel ID of the IPL device
ipl_dev_typeu16Device type from Sense ID (e.g. 0x3390)
ipl_dev_modelu16Device model from Sense ID

5. Kernel Image

FieldTypeDescription
kernel_phys_startu64Physical base of loaded kernel
kernel_phys_endu64Physical end (exclusive), after modules
kernel_entryu64ELF entry point (HHDM virtual)

6. Memory Map

FieldTypeDescription
mem_total_bytesu64Total usable + kernel RAM
mem_map_addru64HHDM virtual address of zxfl_mem_region_t[]
mem_map_countu32Number of valid entries

Each zxfl_mem_region_t entry is defined as:

FieldTypeDescription
baseu64Physical base address of the region
lengthu64Length of the region in bytes
typeu32ZXFL_MEM_* constant
numa_nodeu8Logical NUMA node ID this memory region belongs to

7. Page Table Pool

FieldTypeDescription
pgtbl_pool_endu64Physical end of bootloader page-table bump pool

Pool base is the first 1 MB-aligned address after kernel_phys_end, floored at 32 MB. The kernel PMM must mark [pool_base, pgtbl_pool_end) as reserved.


8. Kernel Stack

FieldTypeDescription
kernel_stack_topu64HHDM virtual address of initial stack top (32 KB)

The kernel should switch to its own stack as early as possible and treat this region as reserved.


9. Control Register Snapshots

FieldTypeDescription
cr0_snapshotu64CR0 at time of kernel jump
cr1_snapshotu64CR1 (ASCE) at time of jump
cr13_snapshotu64CR13 at time of jump

10. SMP / CPU Map

FieldTypeDescription
cpu_map[]zxfl_cpu_info_t[128]Up to 128 CPU entries
cpu_countu32Valid entries in cpu_map
bsp_cpu_addru16CPU address of the boot processor

Each zxfl_cpu_info_t:

FieldTypeDescription
cpu_addru16CPU address (0–65535)
typeu8ZXFL_CPU_TYPE_* constant
stateu8ZXFL_CPU_ONLINE or ZXFL_CPU_STOPPED
numa_nodeu8Logical NUMA node ID derived from physical book/socket
drawer_idu8Drawer physical identifier from STSI 15.1.x
book_idu8Book physical identifier from STSI 15.1.x
socket_idu8Socket physical identifier from STSI 15.1.x
chip_idu8Chip physical identifier from STSI 15.1.x
thread_idu8Thread physical identifier from STSI 15.1.x

Valid when ZXFL_FLAG_SMP is set.


11. System Identification

Populated from STSI when ZXFL_FLAG_SYSINFO is set:

FieldDescription
manufacturer[16]ASCII, e.g. "IBM"
type[4]Machine type, e.g. "2964"
model[16]Model identifier
sequence[16]Machine serial number
plant[4]Manufacturing plant code
lpar_name[8]LPAR name (STSI 2.2.2); empty on bare metal
lpar_numberLPAR number
cpus_totalTotal CPUs in CEC
cpus_configuredConfigured CPUs
cpus_standbyStandby CPUs
capabilityCPU capability rating

12. Modules

Up to 16 modules loaded from sysmodule= parmfile entries:

FieldDescription
modules[i].name[32]Dataset name (NUL-terminated)
modules[i].phys_startPhysical load address
modules[i].size_bytesSize in bytes

13. Flags

FlagBitMeaning
ZXFL_FLAG_SMP0cpu_map[] is valid
ZXFL_FLAG_MEM_MAP1mem_map is valid
ZXFL_FLAG_CMDLINE2cmdline_addr is valid
ZXFL_FLAG_LOWCORE3lowcore_phys is valid
ZXFL_FLAG_STFLE4stfle_fac[] is valid
ZXFL_FLAG_SYSINFO5sysinfo is valid
ZXFL_FLAG_TOD6tod_boot is valid

14. Binding Token

The binding token ties the boot session to the specific hardware and IPL device:

$$\texttt{binding_token} = \texttt{ZXVL_SEED} \oplus \texttt{stfle_fac[0]} \oplus \texttt{ipl_schid}$$

The kernel must recompute this value and compare it to proto->binding_token. A mismatch means the protocol was tampered with or the kernel is running on unexpected hardware.

The binding token is also used as a component of the ZXVL handshake nonce and the stack frame canary. See ZXVL Verification.

ZXVL Verification

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/zxvl_verify.c


1. Overview

ZXVL (ZXVerifiedLoad) is the integrity verification layer embedded in the ZXFL bootloader. It prevents arbitrary payloads from being loaded as the kernel nucleus. Three mechanisms are applied in sequence after ELF loading, before DAT is enabled.


2. Structural Lock

The kernel must embed a .zxfl_lock section at fixed offsets from its physical load base (load_min):

Offset from load_minContent
0x70000High 32 bits of lock key: 0xCCBBCC35
0x70004Sentinel: 0x5A58464C ("ZXFL")
0x71000Low 32 bits of lock key: 0xE5664311

The loader verifies:

$$(\texttt{key} \oplus \texttt{ZXVL_LOCK_MASK}) = \texttt{ZXVL_LOCK_EXPECTED}$$

where:

  • $\texttt{key} = (\texttt{hi} \ll 32) \mid \texttt{lo}$
  • $\texttt{ZXVL_LOCK_MASK} = \texttt{0x3C1E0F8704B2D596}$
  • $\texttt{ZXVL_LOCK_EXPECTED} = \texttt{0xF0A5C3B2E1D49687}$

A missing sentinel or wrong key causes an immediate panic — the loader refuses to execute the image.


3. Handshake

The kernel must place a callable function stub at load_min + 0x0 (the very first byte of the loaded image). The stub must implement:

$$f(\texttt{nonce}) = \text{rotl}_{17}(\texttt{nonce}) + \texttt{ZXVL_HS_RESPONSE}$$

where $\text{rotl}_{17}(x) = (x \ll 17) \mid (x \gg 47)$ and $\texttt{ZXVL_HS_RESPONSE} = \texttt{0xDEADBEEF0BADF00D}$.

The loader calls the stub with:

$$\texttt{nonce} = \texttt{ZXVL_SEED} \oplus \texttt{binding_token}$$

$$\texttt{binding_token} = \texttt{ZXVL_SEED} \oplus \texttt{stfle_fac[0]} \oplus \texttt{schid}$$

This ties the handshake to the specific hardware and IPL device. A kernel image that passes on one machine will not pass on another with different STFLE facilities or a different subchannel ID.


4. SHA-256 Segment Checksums

After the handshake, zxvl_verify_nucleus_checksums reads the zxvl_checksum_table_t from load_min + 0x80000 and verifies each entry:

$$\text{SHA-256}(\texttt{phys_start}, \texttt{size}) = \texttt{entry.digest}$$

Any mismatch causes an immediate panic. The table is patched into the kernel ELF by zxsign at build time. Any modification to a PT_LOAD segment after the build — including by a malicious bootloader or storage attack — is detected here.


5. Binding Token

The binding token is stored in proto->binding_token and used in two places:

  1. Handshake nonce (above).
  2. Stack frame canary: frame[1] = ZXVL_FRAME_MAGIC_B ^ binding_token.

The canary value is unique per hardware configuration. A canary extracted from one system cannot be replayed on another.

The kernel must recompute the binding token on entry and compare it to proto->binding_token. See Boot Protocol §14.

Checksum Protocol

Document Revision: 26h1.1


1. Purpose

The checksum protocol ensures that the kernel image loaded into memory matches the image that was built and signed. It operates at two points:

PointActorAction
Build timezxsignCompute SHA-256 per PT_LOAD segment; patch into .zxvl_checksums
Boot time (loader)zxvl_verify_nucleus_checksumsRecompute and compare before DAT is enabled
Boot time (kernel)verify_kernel_checksumsRecompute and compare from HHDM after DAT is enabled

The double verification (loader + kernel) ensures that neither a compromised loader nor a post-load memory modification can go undetected.


2. Table Location

The checksum table is placed in the .zxvl_checksums ELF section, which is emitted as a dedicated PT_LOAD segment with p_flags = ZXVL_PFLAGS_CKSUM (0x00200004).

The loader discovers the table's physical address by scanning the ELF program header table for a segment with that exact p_flags value. The physical address is stored in zxfl_boot_protocol_t::cksum_table_phys and passed to the kernel. No hardcoded offsets are used.


3. Table Format

See zxsign §3 for the full zxvl_checksum_table_t layout.

Key fields:

FieldValue
magic0x5A58564C ("ZXVL")
version0x00000001
algo0x00000001 (SHA-256)
countNumber of verified segments

4. Excluded Segments

The segment containing .zxvl_checksums itself is excluded from the checksum computation. Hashing the table while building it would be circular. zxsign identifies and skips this segment automatically.


5. Kernel Re-verification

After the kernel initializes the PMM and VMM, verify_kernel_checksums re-reads the table from the HHDM virtual address and recomputes SHA-256 for each segment. This catches:

  • Memory corruption between loader verification and kernel execution.
  • A loader that passed verification but then modified segments before the jump.

A mismatch at this stage calls panic("sys: kernel segment checksum mismatch — image tampered").

How to Load Your Kernel with ZXFL

Document Revision: 26h1.0

for most up-to-date information, see ZXFL Barebones

This guide walks through every step required to produce a kernel image that ZXFL will accept and execute. Read the Boot Protocol and ZXVL Verification pages first for background.


Overview

ZXFL imposes five requirements on the kernel image before it will execute it:

  1. Valid ELF64 for s390x, ET_EXEC, all PT_LOAD segments in the HHDM range.
  2. Structural lock section at fixed offsets.
  3. Handshake stub at the physical load base.
  4. SHA-256 checksum table at load_min + 0x80000, patched by zxsign.
  5. Boot protocol validation on entry.

All PT_LOAD segments must have virtual addresses at or above CONFIG_KERNEL_VIRT_OFFSET (0xFFFF800000000000). ZXFL computes the physical load address by subtracting this offset from p_paddr:

pa = p_paddr - 0xFFFF800000000000

No AT() override is needed. Because there is no LMA override in the linker script, p_paddr equals p_vaddr, and the loader strips the HHDM offset to get the physical address.

A minimal linker script skeleton (modelled on arch/s390x/init/link.ld):

ENTRY(my_kernel_entry)

PHDRS {
    nucleus       PT_LOAD FLAGS(7);
    checksums_seg PT_LOAD FLAGS(4);
}

SECTIONS {
    /* Handshake stub — must be the first code at the physical load base */
    .zxfl_hs 0xFFFF800000100000 : {
        KEEP(*(.zxfl_hs))
    } :nucleus

    .text 0xFFFF800000100400 : {
        KEEP(*(.text.my_kernel_entry))
        *(.text .text.*)
    } :nucleus

    .rodata : ALIGN(8) { *(.rodata .rodata.*) } :nucleus
    .data   : ALIGN(8) { *(.data   .data.*)   } :nucleus

    /* Structural lock — fixed virtual offsets from load base */
    .zxfl_lock 0xFFFF800000170000 : {
        KEEP(*(.zxfl_lock))
    } :nucleus

    .bss : ALIGN(4096) {
        *(.bss .bss.*) *(COMMON)
    } :nucleus

    /* Checksum table — fixed virtual offset from load base */
    .zxvl_checksums 0xFFFF800000180000 : {
        KEEP(*(.zxvl_checksums))
    } :checksums_seg
}

The entry point (e_entry) must be at or above 0xFFFF800000040000 (HHDM + 256 KB). ZXFL rejects images with a lower entry point.


Step 2 — Embed the Structural Lock

The lock constants can be placed directly in the linker script (as ZXFoundation™ does), or in a C translation unit:

/* In the linker script — simplest approach */
.zxfl_lock 0xFFFF800000170000 : {
    LONG(0xCCBBCC35)   /* hi */
    LONG(0x5A58464C)   /* sentinel "ZXFL" */
    . = . + 0x1000 - 8;
    LONG(0xE5664311)   /* lo */
} :nucleus

The loader verifies: ((hi << 32 | lo) ^ 0x3C1E0F8704B2D596) == 0xF0A5C3B2E1D49687.


Step 3 — Implement the Handshake Stub

The stub must be the very first code at the physical load base. It receives a nonce in %r2 and must return the response in %r2. ZXVL_HS_RESPONSE = 0xDEADBEEF0BADF00D.

    .machinemode zarch
    .section .text.handshake, "ax"
    .globl __zxfl_handshake_stub
.equ ZXFL_SEED_HI, 0xA5F0C3E1
.equ ZXFL_SEED_LO, 0xB2D49687
.equ HS_RESPONSE_HI,  0xDEADBEEF
.equ HS_RESPONSE_LO,  0x0BADF00D

__zxfl_handshake_stub:
    llihf   %r0, ZXFL_SEED_HI
    iilf    %r0, ZXFL_SEED_LO
    xgr     %r2, %r0
    lgr     %r0, %r2
    sllg    %r0, %r0, 17
    srlg    %r1, %r2, 47
    ogr     %r0, %r1
    llihf   %r1, HS_RESPONSE_HI
    iilf    %r1, HS_RESPONSE_LO
    lgr     %r2, %r0
    agr     %r2, %r1
    br      %r14

The stub must not clobber %r14 (return address) or %r15 (stack pointer). It must be callable with BRASL and return via BR %r14.


Step 4 — Reserve the Checksum Table

Declare the checksum table section. It is zero at link time; zxsign patches it after linking:

__attribute__((section(".zxvl_checksums"), used))
static volatile zxvl_checksum_table_t zxvl_cksum_table = { 0 };

Step 5 — Run zxsign

After linking, run the host tool on the ELF:

zxsign my_kernel.elf

This computes SHA-256 for each PT_LOAD segment (excluding .zxvl_checksums itself) and patches the table in-place. The ELF is now ready for DASD.


Step 6 — Write to DASD

Write the kernel ELF to the DASD volume as dataset CORE.ZXFOUNDATION.NUCLEUS. In sysres.conf:

DATASET CORE.ZXFOUNDATION.NUCLEUS  my_kernel.elf

See Build Targets for the full dasdload invocation.


Step 7 — Handle the Boot Protocol on Entry

Your kernel entry point receives zxfl_boot_protocol_t *boot in %r2. Minimum required validation:

[[noreturn]] void my_kernel_entry(zxfl_boot_protocol_t *boot) {
    if (!boot || boot->magic != ZXFL_MAGIC)
        for (;;) __asm__("nop");

    uint64_t expected = ZXVL_COMPUTE_TOKEN(boot->stfle_fac[0], boot->ipl_schid);
    if (boot->binding_token != expected)
        for (;;) __asm__("nop");

    if (boot->version != ZXFL_VERSION_4)
        for (;;) __asm__("nop");

    /* proceed */
}

All pointer fields in the protocol are HHDM virtual addresses. Do not treat them as physical addresses.


Checklist

#RequirementEnforced by
1ELF64, ET_EXEC, e_machine = 0x16 (EM_S390)Loader ELF validation
2All PT_LOAD p_vaddr >= 0xFFFF800000000000Loader address check
3e_entry >= 0xFFFF800000040000Loader entry check
4Structural lock at load_min + 0x70000zxvl_verify
5Handshake stub at load_min + 0x0zxvl_verify
6Checksum table at load_min + 0x80000, patched by zxsignzxvl_verify
7boot->magic validated on entryKernel
8boot->binding_token validated on entryKernel

ZXFoundation™ Kernel Design

Document: ZXF-KRN-DESIGN-001 Revision: 26h1.0 Status: Draft Date: 2026-05-09 Author: ZXFoundation™ Core Team


Document Scope

This document is the master architectural specification for the ZXFoundation™ kernel. It defines the design of every major subsystem — capability system, memory architecture, IPC, domain model, scheduler, time, trap handling, fault recovery, and the long-term implementation roadmap.

This document does not reference source files or API signatures. Those belong in per-subsystem reference documents. This document defines what the kernel is and why it is designed that way. Pseudocode and diagrams are used where precision is required.


1. Architectural Philosophy

1.1 Design Axioms

ZXFoundation™ is a capability-based object microkernel for IBM z/Architecture. Six axioms govern every design decision:

  1. Minimal Trusted Computing Base. The kernel enforces only what cannot be enforced elsewhere: memory isolation, capability validity, and CPU scheduling. Everything else is a server domain.

  2. Capability-First. No resource may be accessed without a valid capability. There is no ambient authority. A thread that holds no capabilities can do nothing.

  3. No Implicit Trust. Server domains are untrusted by default, including system-provided ones. Trust is established by capability grant, not by identity or position in a hierarchy.

  4. z/Architecture Native. The kernel exploits z/Architecture hardware features — DAT, storage keys, SIGP, TOD clock, CPU timer, channel subsystem — directly. No portability layer is maintained.

  5. SysV ABI Only. The kernel defines its own system call surface. No POSIX compatibility layer exists or is planned. The SysV calling convention (GPRs 2–7 for arguments, GPR 2 for return) is the sole ABI.

  6. Extreme Redundancy. The kernel must not panic on a faulting server domain or a recoverable hardware error. Fault containment and recovery are first-class design requirements, not afterthoughts.

1.2 Threat Model

ThreatMitigation
Untrusted user domain reads kernel memorySeparate DAT address space per domain; kernel ASCE never loaded in user state
Untrusted domain forges a capabilityCapabilities are kernel-managed integers; user space never constructs them
Faulting server domain corrupts kernel stateServer domains run in user state; a fault traps to the kernel, not into it
Hardware storage error corrupts a pageMachine-check recovery classifies and isolates the affected frame
Capability leak via IPCCapability transfer is move-semantics; sender loses the capability atomically
Denial of service via busy loopScheduler enforces quanta; CPU timer interrupt is non-maskable by user state

1.3 Kernel / User Boundary

The kernel runs exclusively in supervisor state (PSW problem-state bit = 0). All server domains and user processes run in problem state (PSW bit 8 = 1).

The boundary is enforced by z/Architecture hardware:

  • DAT translates user virtual addresses through a per-domain ASCE (CR1 is loaded with the domain's ASCE on context switch).
  • Storage keys restrict memory access to pages owned by the domain.
  • Privileged instructions (LPSWE, SPX, SIGP, SSCH, etc.) trap to the kernel when executed in problem state.

1.4 Layered Architecture

┌─────────────────────────────────────────────────────────────────┐
│  User Processes  (problem state, own ASCE, own capability table) │
├─────────────────────────────────────────────────────────────────┤
│  Server Domains  (problem state, own ASCE, own capability table) │
│  [ block I/O | filesystem | network | console | device mgr ]    │
├─────────────────────────────────────────────────────────────────┤
│  Kernel TCB  (supervisor state, kernel ASCE)                    │
│  ┌──────────┬──────────┬──────────┬──────────┬───────────────┐  │
│  │ Capability│  IPC     │ Scheduler│  Memory  │ Trap / Syscall│  │
│  │  System  │ Subsystem│          │  Manager │   Dispatch    │  │
│  └──────────┴──────────┴──────────┴──────────┴───────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  KOMS · PMM · VMM · Slab · SMP · RCU · Sync Primitives  │   │
│  └──────────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────┤
│  z/Architecture Hardware                                        │
│  [ DAT · Storage Keys · SIGP · TOD · CPU Timer · CSS · MCCK ]  │
└─────────────────────────────────────────────────────────────────┘

2. Capability System

2.1 Definition

A capability is an unforgeable, kernel-managed token that grants a specific set of rights to a specific kernel object. Possession of a capability is both necessary and sufficient to exercise the rights it encodes. There is no access control list, no ambient authority, and no privilege escalation path outside of explicit capability grant.

2.2 Capability Token Structure

A capability token is a 64-bit opaque integer. User space treats it as an integer handle into its own capability table. The kernel interprets the internal encoding; user space never constructs or decodes it.

 63      56 55      40 39      24 23       0
 ┌────────┬──────────┬──────────┬──────────┐
 │  type  │  rights  │   gen    │  index   │
 │  8 bit │  16 bit  │  16 bit  │  24 bit  │
 └────────┴──────────┴──────────┴──────────┘
FieldWidthMeaning
type8Object type (maps to kobj_type_t::type_id)
rights16Bitmask of granted rights
gen16Generation counter; incremented on revocation
index24Index into the kernel's global object table

The gen field enables generation-based revocation: when a capability is revoked, the kernel increments the generation counter on the target object. Any token whose gen field does not match the current object generation is invalid, regardless of index or rights.

2.3 Rights Model

Rights are type-specific. The following rights are defined at the kernel level; subsystems may define additional type-specific rights in the upper 8 bits.

BitNameMeaning
0CAP_READRead the object's state
1CAP_WRITEModify the object's state
2CAP_EXECExecute / invoke the object
3CAP_GRANTDerive and transfer a capability to this object
4CAP_REVOKERevoke derived capabilities
5CAP_MAPMap the object's memory into an address space
6CAP_DESTROYDestroy the object
7–15reserved / type-specific

Derivation rule: A derived capability may only have a subset of the parent's rights. Rights can never be amplified. A domain that holds CAP_READ | CAP_GRANT may derive a capability with CAP_READ only.

2.4 Capability Table

Each domain owns a capability table — a flat, kernel-managed array of capability slots. The table is allocated at domain creation with a fixed capacity. User space references capabilities by their slot index (a small integer handle).

Domain Capability Table
┌───────┬──────────────────────────────────────────────┐
│ Slot  │ Capability Token (64-bit, kernel-interpreted) │
├───────┼──────────────────────────────────────────────┤
│   0   │ Self capability (CAP_READ | CAP_WRITE)        │
│   1   │ IPC endpoint capability (CAP_EXEC)            │
│   2   │ Memory region capability (CAP_READ | CAP_MAP) │
│   3   │ (empty)                                       │
│  ...  │  ...                                          │
│  N-1  │ (empty)                                       │
└───────┴──────────────────────────────────────────────┘

The capability table is allocated from a dedicated slab cache backed by pages with a non-zero s390x storage key. This provides hardware-enforced isolation: a domain cannot read another domain's capability table even if it obtains a pointer to it, because the storage key check will fault.

2.5 Capability Lifecycle

                    cap_mint(type, rights, object)
                              │
                              ▼
                    ┌─────────────────┐
                    │  CAPABILITY     │
                    │  VALID          │◄──── cap_derive(parent, subset_rights)
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
         cap_transfer    cap_revoke    object destroyed
              │              │              │
              ▼              ▼              ▼
       moved to         gen++ on        all tokens
       receiver's       object;         with this
       table            all tokens      index become
                        with old gen    invalid
                        invalid

2.6 Core Operations (Pseudocode)

// Mint a new capability for an existing kernel object.
// Called only from kernel context; never directly by user space.
cap_mint(object, rights):
    slot = cap_table_alloc(current_domain.cap_table)
    token.type   = object.type_id
    token.rights = rights
    token.gen    = object.cap_gen
    token.index  = object.global_index
    current_domain.cap_table[slot] = token
    return slot

// Derive a capability with reduced rights.
// Syscall: cap_derive(src_slot, new_rights) -> dst_slot
cap_derive(src_slot, new_rights):
    token = cap_lookup(current_domain, src_slot)
    assert token.rights & CAP_GRANT
    assert (new_rights & ~token.rights) == 0   // no amplification
    dst_slot = cap_table_alloc(current_domain.cap_table)
    new_token = token
    new_token.rights = new_rights
    current_domain.cap_table[dst_slot] = new_token
    return dst_slot

// Revoke all capabilities derived from an object.
// Increments the generation counter; all existing tokens become stale.
cap_revoke(object):
    atomic_inc(object.cap_gen)
    // No table scan needed: stale tokens fail at cap_lookup time.

// Look up and validate a capability slot.
// Returns the target object pointer, or fails.
cap_lookup(domain, slot):
    assert slot < domain.cap_table.capacity
    token = domain.cap_table[slot]
    assert token.type != CAP_TYPE_INVALID
    object = global_object_table[token.index]
    assert object != null
    assert object.cap_gen == token.gen    // generation check
    return object, token.rights

2.7 KOMS Integration

Every kobject_t is a capability target. The KOMS type_id field maps directly to the capability token type field. The KOMS global object table (indexed by token.index) is the authoritative registry of all live kernel objects.

The capability system does not replace KOMS reference counting. A valid capability implies the object is alive (generation check passes only while the object is alive). When an object is destroyed, its generation is incremented, invalidating all capabilities before the final koms_put.

┌─────────────────────────────────────────────────────┐
│  Capability System                                  │
│  token.index ──────────────────────────────────┐   │
│  token.gen   ──── generation check ────────┐   │   │
└────────────────────────────────────────────│───│───┘
                                             │   │
┌────────────────────────────────────────────│───│───┐
│  KOMS                                      │   │   │
│  global_object_table[index] ───────────────┘   │   │
│  kobject_t::cap_gen ───────────────────────────┘   │
│  kobject_t::ref (kref_t) — independent lifetime    │
└─────────────────────────────────────────────────────┘

3. Memory Architecture

Memory is the most critical subsystem in ZXFoundation™. Every other subsystem depends on it. This section defines strict requirements and invariants for every memory layer. Violations of these requirements are kernel panics, not recoverable errors.

3.1 Physical Memory Manager (PMM)

3.1.1 Zone Model

Physical memory is partitioned into two zones at boot time. The partition is permanent; zones are never merged or resized after pmm_init.

ZoneRangePurpose
ZONE_DMA[0, 16 MB)Channel I/O buffers (31-bit CDA constraint)
ZONE_NORMAL[16 MB, RAM limit)General kernel and domain allocations

The 16 MB boundary is a hardware constraint: the Channel Data Address (CDA) field in a CCW is 31 bits. All I/O buffers submitted to the channel subsystem must reside below 0x80000000. ZONE_DMA covers this range conservatively.

3.1.2 Buddy Allocator

Each zone maintains a buddy allocator with orders 0 through MAX_ORDER (10), covering block sizes from 4 KB (order 0) to 4 MB (order 10).

Zone free lists (per order):

Order 0  (4 KB):  [pfn_a] → [pfn_b] → [pfn_c] → ∅
Order 1  (8 KB):  [pfn_d] → ∅
Order 2  (16 KB): ∅
...
Order 10 (4 MB):  [pfn_e] → ∅

Buddy invariants (non-negotiable):

  1. Every free block is buddy-aligned: pfn % (1 << order) == 0.
  2. Coalescing is mandatory on every free. If a block's buddy is also free, they are merged into a block of order+1, recursively up to MAX_ORDER.
  3. A block may only be freed at the same order it was allocated. Mismatched order corrupts the buddy tree and is a kernel panic.
  4. Free blocks are poisoned with PF_POISON. Any allocation that returns a non-poisoned block indicates a double-allocation bug.

3.1.3 Per-CPU Page Cache

Order-0 (4 KB) allocations are served from a per-CPU cache to avoid zone lock contention on the hot path.

Per-CPU cache (one per zone per CPU):

  count = 7
  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
  │pfn_0│pfn_1│pfn_2│pfn_3│pfn_4│pfn_5│pfn_6│  -  │  -  │
  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
  ← count                                    PCP_HIGH=16 →

  Refill: when count == 0, acquire zone lock, pop PCP_BATCH=8 pages.
  Drain:  when count > PCP_HIGH, acquire zone lock, push PCP_BATCH pages.

The per-CPU cache is accessed with IRQs disabled. No spinlock is needed because the cache is strictly per-CPU and IRQ handlers that allocate memory must use ZX_GFP_ATOMIC, which bypasses the per-CPU cache and draws directly from the zone's atomic reserve.

3.1.4 Atomic Reserve

Each zone holds PMM_ATOMIC_RESERVE = 64 pages back from the buddy allocator. These pages are only accessible to callers that pass ZX_GFP_ATOMIC. This ensures that hard-IRQ context allocations (e.g., channel I/O completion handlers) always succeed even under memory pressure.

Strict requirement: ZX_GFP_ATOMIC must only be used from hard-IRQ context. Using it from process context to bypass memory pressure is prohibited and will be detected by a context check in debug builds.

3.1.5 PMM Allocation Flow

pmm_alloc_page(gfp):
    if gfp & ZX_GFP_ATOMIC:
        goto zone_alloc          // bypass per-CPU cache
    if order == 0:
        page = pcp_pop(current_cpu, zone)
        if page: return page
        pcp_refill(current_cpu, zone)
        return pcp_pop(current_cpu, zone)
zone_alloc:
    acquire zone.lock (irqsave)
    for order in [requested_order .. MAX_ORDER]:
        pfn = free_area_pop(zone, order)
        if pfn != INVALID:
            split down to requested_order
            release zone.lock
            if gfp & ZX_GFP_ZERO: zero_page(pfn)
            return pfn_to_page(pfn)
    if gfp & ZX_GFP_ATOMIC and zone.atomic_reserve > 0:
        // draw from reserve
        ...
    release zone.lock
    return nullptr              // OOM

3.1.6 PMM Strict Requirements

#Requirement
PMM-1pmm_free_page/pages must never be called on a page not in PF_BUDDY state. Double-free is a kernel panic.
PMM-2The order passed to pmm_free_pages must match the order used at allocation.
PMM-3Allocation from hard-IRQ context requires ZX_GFP_ATOMIC. Any other flag in IRQ context is a kernel panic.
PMM-4zx_mem_map[] is allocated during pmm_init and never freed. It must not be modified after init except by the PMM itself.
PMM-5The per-CPU cache must be drained to the zone before a CPU goes offline.
PMM-6ZONE_DMA and ZONE_NORMAL boundaries are immutable after pmm_init.

3.2 Virtual Memory Manager (VMM)

3.2.1 Address Space Layout

Virtual Address Space (64-bit z/Architecture, 5-level DAT)

0x0000_0000_0000_0000 ┌──────────────────────────────────────┐
                      │  User / Domain space                 │
                      │  (per-domain ASCE, problem state)    │
0x0000_7FFF_FFFF_FFFF └──────────────────────────────────────┘
                        [ translation exception — unmapped ]
0xFFFF_8000_0000_0000 ┌──────────────────────────────────────┐
                      │  HHDM — Higher-Half Direct Map       │
                      │  PA 0x0 → VA 0xFFFF_8000_0000_0000   │
                      │  Mapped with EDAT-1 (1 MB pages)     │
0xFFFF_C000_0000_0000 ├──────────────────────────────────────┤
                      │  vmalloc / ioremap region            │
                      │  Virtually contiguous, phys-discontig│
0xFFFF_E000_0000_0000 ├──────────────────────────────────────┤
                      │  Kernel image + BSS + static data    │
0xFFFF_FFFF_FFFF_FFFF └──────────────────────────────────────┘

The HHDM offset 0xFFFF_8000_0000_0000 places the kernel in R1 entry 2047 (the topmost Region-First entry), cleanly separating kernel (R1[2047]) from user space (R1[0..2046]) at the highest table level.

3.2.2 vm_space_t and VMA Tree

Each address space is represented by a vm_space_t. The kernel has one (kernel_vm_space). Each domain has its own, created at domain birth and destroyed at domain death.

VMAs are indexed by an augmented RB-tree keyed on vm_start. Each node carries subtree_max_end — the maximum vm_end in its subtree — enabling O(log n) free-gap search for vmalloc and O(1) overlap detection.

VMA Tree (augmented RB-tree):

                  [0xC000, 0xE000, max_end=0xF000]
                 /                                 \
  [0xA000, 0xB000, max_end=0xB000]    [0xE000, 0xF000, max_end=0xF000]

  Each node: vm_start (key), vm_end, subtree_max_end, vm_prot, rb_node

Locking model:

  • Readers call vmm_find_vma inside rcu_read_lock(). Fully lockless. The RCU-protected tree guarantees that a reader always sees a consistent snapshot, even while a writer is modifying the tree.
  • Writers acquire aug_root.lock (spinlock, irqsave) before any insert, remove, or augmentation update.

A per-CPU hint cache stores the last-found VMA per CPU. On a cache hit (the faulting address falls within the cached VMA), the tree walk is skipped entirely — O(1) on the hot page-fault path.

3.2.3 VMM Strict Requirements

#Requirement
VMM-1All VMA modifications must hold aug_root.lock (spinlock, irqsave).
VMM-2All VMA reads must be inside rcu_read_lock().
VMM-3VMAs must not overlap. vmm_insert_vma rejects overlapping ranges.
VMM-4vm_start and vm_end must be page-aligned (4 KB boundary).
VMM-5A vm_space_t must not be destroyed while any VMA remains mapped.
VMM-6The kernel ASCE (CR1) must never be loaded into a domain's address space.
VMM-7EDAT large pages (1 MB, 2 GB) must not be used for user domain mappings without an explicit CAP_MAP capability granting large-page access.
VMM-8vmm_remove_vma must unmap all backing pages and perform a TLB invalidation (IPTE/IDTE) before returning.

3.2.4 Domain Address Space Creation

When a new domain is created, the kernel allocates a fresh vm_space_t and a new R1 page table. The kernel HHDM mapping is not shared into domain address spaces. Domains have no visibility into kernel virtual addresses.

Domain address space creation:

  alloc vm_space_t
  alloc R1 table (16 KB, order=2, ZONE_NORMAL)
  initialize all R1 entries as invalid (Z_I_BIT set)
  set vm_space.pgtbl_root = phys(R1)
  set vm_space.asce = encode_asce(phys(R1), DT=R1, TL=2048)
  // Domain's ASCE is loaded into CR1 on context switch to this domain.
  // Kernel ASCE remains in a separate register save area.

3.3 Slab and Object Allocator

3.3.1 Magazine-Depot Model

The slab allocator uses a magazine-depot architecture for per-CPU caching of fixed-size objects.

Per-CPU layer (no lock needed, IRQs disabled):
  ┌──────────────────────────────────────────┐
  │  Hot magazine  [obj0│obj1│obj2│...│objN] │  ← alloc/free here
  │  Cold magazine [obj0│obj1│...          ] │  ← swap with hot when full/empty
  └──────────────────────────────────────────┘
           ↕ swap (acquire depot lock)
Global depot layer (spinlock):
  ┌──────────────────────────────────────────┐
  │  Full magazines:  [mag_a][mag_b][mag_c]  │
  │  Empty magazines: [mag_d][mag_e]         │
  └──────────────────────────────────────────┘
           ↕ slab page allocation (acquire zone lock)
PMM (buddy allocator)

Allocation: pop from hot magazine. If empty, swap hot/cold. If cold also empty, fetch a full magazine from the depot. If depot has none, allocate a new slab page from PMM and populate a magazine.

Free: push to hot magazine. If full, swap hot/cold. If cold also full, return the cold magazine to the depot as a full magazine.

3.3.2 Storage Key Isolation

Each slab cache may be created with a non-zero s390x storage key. Pages backing that cache are assigned the specified key. A domain that does not hold the matching key in its PSW access key field will receive a protection exception if it attempts to access those pages.

Capability table pages use a dedicated storage key (key 1 by convention). This provides hardware-enforced isolation: even if a domain obtains a pointer to another domain's capability table, the storage key check will fault before any data is read.

3.3.3 Slab Strict Requirements

#Requirement
SLAB-1kmem_cache_alloc must not be called from hard-IRQ context unless the cache was created with atomic support. Use kmalloc(ZX_GFP_ATOMIC) from IRQ context.
SLAB-2kmem_cache_free must only be called with a pointer returned by kmem_cache_alloc on the same cache. Cross-cache free is undefined behavior.
SLAB-3Freed objects are poisoned with a sentinel pattern. Re-use before alloc is detected in debug builds.
SLAB-4kmem_cache_destroy must only be called after all objects have been returned. Outstanding objects at destroy time is a kernel panic.

3.4 Capability Memory

Capability tables are the most security-sensitive data structure in the kernel. They receive special treatment beyond the standard slab rules.

3.4.1 Allocation

Capability tables are allocated from a dedicated slab cache:

  • Storage key: 1 (non-zero, distinct from general kernel data at key 0).
  • GFP flags: ZX_GFP_NORMAL only. Capability tables are never allocated from the atomic reserve.
  • Pages are marked PF_PINNED immediately after allocation. They are never reclaimed, swapped, or migrated.

3.4.2 Lifetime

A capability table is created atomically with its domain. It is destroyed atomically when the domain dies. The destruction sequence is:

domain_destroy(domain):
    // 1. Freeze the domain: no new capabilities may be minted into it.
    domain.state = DOMAIN_DYING
    // 2. Revoke all capabilities in the table.
    for slot in domain.cap_table:
        if cap_table[slot].type != CAP_TYPE_INVALID:
            cap_revoke_slot(domain, slot)
    // 3. Free the table pages.
    kmem_cache_free(cap_table_cache, domain.cap_table)
    // 4. Drop the domain kobject reference.
    koms_put(domain.kobj)

Step 2 increments the generation counter on every object the domain held capabilities to. This atomically invalidates all derived capabilities that other domains may have received from this domain.

3.4.3 Capability Memory Strict Requirements

#Requirement
CAP-MEM-1Capability table pages must be PF_PINNED. They are never reclaimed.
CAP-MEM-2Capability table pages use storage key 1. General kernel data uses key 0.
CAP-MEM-3Capability table destruction must complete before the domain's vm_space_t is torn down.
CAP-MEM-4No capability token may be stored in user-accessible memory. The kernel never copies a raw token to user space.

3.5 Memory for IPC

IPC memory is designed to minimize allocation on the critical path.

3.5.1 Synchronous IPC — Zero Allocation

Small synchronous messages (up to 8 × 64-bit registers) are passed entirely in CPU registers. The kernel performs a direct thread switch: the sender's GPRs 2–9 become the receiver's GPRs 2–9. No kernel buffer is allocated. No memory is touched beyond the two threads' kernel stacks.

3.5.2 Asynchronous Queue — Fixed-Capacity Ring Buffer

Each IPC endpoint that supports async messaging owns a fixed-capacity ring buffer, allocated from the slab at endpoint creation time. The capacity is specified at creation and never changes.

Async message queue (ring buffer):

  head ──►  ┌──────────────────────────────────────────┐
            │  msg[0]: tag | regs[8] | caps[4]         │
            │  msg[1]: tag | regs[8] | caps[4]         │
            │  msg[2]: (empty)                         │
            │  ...                                     │
  tail ──►  │  msg[N-1]: (empty)                       │
            └──────────────────────────────────────────┘
  capacity = N (fixed at endpoint creation)
  each message slot = 136 bytes (8 + 8×8 + 4×8)

The ring buffer is allocated with ZX_GFP_NORMAL and is never reallocated. If the queue is full, the send operation returns ERR_QUEUE_FULL to the sender. The sender is responsible for retry or backpressure.

3.5.3 Shared Memory — Zero-Copy Large Transfer

For bulk data transfer, the sender grants a CAP_MAP capability on a VMA. The receiver maps the VMA into its own address space via vmm_insert_vma. No kernel buffer is involved. The physical pages are shared between the two address spaces via DAT table entries pointing to the same physical frames.

Shared memory transfer:

  Sender domain                    Receiver domain
  vm_space_t                       vm_space_t
  ┌──────────────────┐             ┌──────────────────┐
  │ VMA [A, B)       │             │ VMA [C, D)       │
  │ prot: R/W        │             │ prot: R (derived)│
  └────────┬─────────┘             └────────┬─────────┘
           │ DAT entries                    │ DAT entries
           └──────────────┬─────────────────┘
                          ▼
                  Physical frames [P0, P1, ...]

The receiver's mapping uses the rights from the CAP_MAP capability. If the capability grants only CAP_READ, the receiver's DAT entries are read-only. A write attempt generates a protection exception in the receiver's domain, not a kernel panic.


4. IPC Subsystem

4.1 Design Goals

IPC is the primary communication mechanism between all domains. Because ZXFoundation™ is a microkernel, IPC performance directly determines system throughput. The design targets:

  • Synchronous fastpath latency: < 1 µs on z/Architecture (single hop, no contention, small message).
  • Async queue throughput: limited only by memory bandwidth and ring buffer capacity.
  • Zero kernel allocation on the synchronous fastpath.
  • Capability transfer atomicity: a capability moved in a message is never visible in both sender and receiver simultaneously.

4.2 IPC Endpoint

An IPC endpoint is a kernel object (kobject_t, type KOBJ_TYPE_ENDPOINT). It is the rendezvous point for IPC. A domain that wishes to receive messages creates an endpoint and publishes a capability to it.

Endpoint state:

  ENDPOINT_IDLE      — no sender or receiver waiting
  ENDPOINT_RECV_WAIT — a receiver thread is blocked, waiting for a message
  ENDPOINT_SEND_WAIT — one or more sender threads are queued (async overflow)

An endpoint is addressed exclusively by capability. A domain that does not hold a capability to an endpoint cannot send to or receive from it.

4.3 Synchronous Fastpath

The synchronous fastpath is the primary IPC mechanism. It is used when the receiver is already blocked on the endpoint.

Synchronous IPC fastpath:

  Sender                    Kernel                    Receiver
    │                          │                          │
    │  ipc_call(ep_cap,        │                          │
    │    regs[0..7])           │                          │
    ├─────────────────────────►│                          │
    │                          │  cap_lookup(ep_cap)      │
    │                          │  endpoint.state ==       │
    │                          │    RECV_WAIT?  YES       │
    │                          │                          │
    │                          │  copy regs[0..7] to      │
    │                          │  receiver kernel stack   │
    │                          │                          │
    │                          │  transfer caps (if any)  │
    │                          │  from sender table to    │
    │                          │  receiver table          │
    │                          │                          │
    │                          │  direct thread switch:   │
    │  [blocked]               │  sender → BLOCKED        │
    │                          │  receiver → RUNNING      │
    │                          ├─────────────────────────►│
    │                          │                          │  regs[0..7]
    │                          │                          │  available
    │                          │                          │
    │                          │  receiver calls          │
    │                          │  ipc_reply(regs[0..7])   │
    │                          │◄─────────────────────────┤
    │                          │  direct thread switch:   │
    │                          │  receiver → BLOCKED      │
    │◄─────────────────────────┤  sender → RUNNING        │
    │  regs[0..7] = reply      │                          │

The direct thread switch bypasses the scheduler run queue entirely. The kernel saves the sender's context, restores the receiver's context, and returns to user space in the receiver. This is the seL4-style fastpath.

Fastpath conditions (all must hold; any failure falls back to slow path):

  1. Endpoint state is RECV_WAIT.
  2. Message fits in 8 registers (no large payload).
  3. At most 4 capability handles transferred.
  4. Receiver thread is on the same CPU (avoids cross-CPU IPI on fastpath).

4.4 Asynchronous Queue Fallback

When the fastpath conditions are not met, the message is enqueued in the endpoint's ring buffer and the sender continues without blocking.

Async send path:

  ipc_send_async(ep_cap, msg):
      endpoint = cap_lookup(ep_cap, CAP_EXEC)
      acquire endpoint.lock (spinlock, irqsave)
      if ring_buffer_full(endpoint.queue):
          release endpoint.lock
          return ERR_QUEUE_FULL
      ring_buffer_enqueue(endpoint.queue, msg)
      if endpoint.state == RECV_WAIT:
          // Wake the receiver.
          thread_wake(endpoint.waiting_receiver)
          endpoint.state = ENDPOINT_IDLE
      release endpoint.lock
      return OK

  ipc_recv(ep_cap):
      endpoint = cap_lookup(ep_cap, CAP_EXEC)
      acquire endpoint.lock
      if ring_buffer_empty(endpoint.queue):
          endpoint.state = RECV_WAIT
          endpoint.waiting_receiver = current_thread
          release endpoint.lock
          thread_block()          // deschedule; woken by sender
          // On wake: message is in thread's IPC buffer
          return OK
      msg = ring_buffer_dequeue(endpoint.queue)
      release endpoint.lock
      return msg

4.5 Message Structure

Every IPC message has the same fixed structure regardless of path:

IPC Message (136 bytes):

  ┌──────────────────────────────────────────────────────────┐
  │  tag      [63:0]   — message type / protocol identifier  │
  ├──────────────────────────────────────────────────────────┤
  │  regs[0]  [63:0]   ─┐                                    │
  │  regs[1]  [63:0]    │                                    │
  │  ...                │  8 × 64-bit data words             │
  │  regs[7]  [63:0]   ─┘                                    │
  ├──────────────────────────────────────────────────────────┤
  │  caps[0]  [63:0]   ─┐                                    │
  │  caps[1]  [63:0]    │  4 × capability handles            │
  │  caps[2]  [63:0]    │  (slot indices in sender's table)  │
  │  caps[3]  [63:0]   ─┘                                    │
  └──────────────────────────────────────────────────────────┘
  Total: 1 + 8 + 4 = 13 × 8 = 104 bytes of payload
         + 4 bytes padding = 136 bytes per slot

4.6 Capability Transfer

Capabilities included in a message (caps[0..3]) are transferred with move semantics: the kernel atomically removes the capability from the sender's table and inserts it into the receiver's table. The sender's slot is cleared. The capability is never simultaneously visible in both tables.

cap_transfer(sender, receiver, sender_slot):
    acquire sender.cap_table.lock
    acquire receiver.cap_table.lock   // always in address order to avoid deadlock
    token = sender.cap_table[sender_slot]
    assert token.type != CAP_TYPE_INVALID
    dst_slot = cap_table_alloc(receiver.cap_table)
    receiver.cap_table[dst_slot] = token
    sender.cap_table[sender_slot] = CAP_INVALID
    release receiver.cap_table.lock
    release sender.cap_table.lock
    return dst_slot

4.7 IPC and KOMS

IPC endpoints are kobject_t instances registered in the KOMS namespace under the owning domain's subtree. A domain may publish an endpoint by name, allowing other domains to discover it via koms_ns_find_get and then request a capability from a trusted broker.

KOMS namespace (IPC endpoints):

  koms_root_ns
  └── "domains"
      ├── "block-io"
      │   └── "ep.request"   ← IPC endpoint kobject
      ├── "filesystem"
      │   └── "ep.request"
      └── "console"
          └── "ep.write"

5. Process and Domain Model

5.1 Fundamental Units

ZXFoundation™ defines two fundamental execution units:

  • Domain: the unit of isolation. Owns an address space (vm_space_t), a capability table, and one or more threads. Analogous to a process in a monolithic kernel, but the kernel makes no distinction between a "driver domain" and an "application domain."

  • Thread: the unit of scheduling. Belongs to exactly one domain. Has a kernel stack, a saved register set (irq_frame_t), and a scheduling state. Threads within the same domain share the domain's address space and capability table.

5.2 Domain Lifecycle

                    domain_create()
                          │
                          ▼
                  ┌───────────────┐
                  │   CREATING    │  — address space allocated,
                  └───────┬───────┘    capability table allocated,
                          │            initial thread created
                          ▼
                  ┌───────────────┐
                  │    RUNNING    │◄──── threads scheduled normally
                  └───────┬───────┘
                          │
              ┌───────────┼───────────┐
              │           │           │
         domain_kill   unhandled   watchdog
              │         fault       timeout
              │           │           │
              ▼           ▼           │
        ┌──────────┐ ┌──────────┐    │
        │  DYING   │ │ FAULTED  │◄───┘
        └────┬─────┘ └────┬─────┘
             │            │
             │     supervisor domain
             │     decides: restart or kill
             │            │
             │     ┌──────┴──────┐
             │     │             │
             │  restart        kill
             │     │             │
             │     ▼             │
             │ ┌──────────┐      │
             │ │RESTARTING│      │
             │ └────┬─────┘      │
             │      │            │
             │      ▼            ▼
             │  ┌────────┐  ┌──────┐
             └─►│  DEAD  │  │ DEAD │
                └────────┘  └──────┘

5.3 Domain Structure

A domain is a kobject_t of type KOBJ_TYPE_DOMAIN. It embeds:

Domain object:

  kobject_t         kobj          — KOMS base (lifecycle, namespace, events)
  vm_space_t        space         — address space (ASCE, VMA tree)
  cap_table_t       cap_table     — capability table
  list_head_t       threads       — list of owned threads
  spinlock_t        lock          — protects state transitions
  domain_state_t    state         — CREATING/RUNNING/FAULTED/RESTARTING/DEAD
  uint32_t          domain_id     — globally unique identifier
  kobject_t        *supervisor    — domain that receives fault events (may be null)
  uint64_t          heartbeat_seq — watchdog sequence number

5.4 Thread Structure

A thread is a kobject_t of type KOBJ_TYPE_THREAD. It embeds:

Thread object:

  kobject_t         kobj          — KOMS base
  domain_t         *domain        — owning domain (non-null, immutable)
  irq_frame_t       saved_regs    — GPRs, FPRs, PSW (saved on context switch)
  uint64_t          kernel_stack  — kernel stack top (virtual address)
  thread_state_t    state         — RUNNABLE/RUNNING/BLOCKED/DEAD
  sched_entity_t    sched         — scheduler run queue linkage
  uint32_t          priority      — scheduling priority class
  uint64_t          cpu_mask      — CPU affinity bitmask
  uint64_t          user_timer    — accumulated user-mode CPU time (ns)
  uint64_t          sys_timer     — accumulated kernel-mode CPU time (ns)

5.5 Fault Containment

When a domain faults (unhandled program check, protection exception, or watchdog timeout), the kernel:

  1. Suspends all threads in the domain (sets state to BLOCKED).
  2. Sets domain state to FAULTED.
  3. Fires KOBJ_EVENT_DOMAIN_FAULT on the domain's kobject.
  4. If the domain has a registered supervisor, delivers an IPC message to the supervisor's fault endpoint containing the fault code and domain ID.
  5. The supervisor decides: call domain_restart or domain_kill.

If no supervisor is registered, the kernel kills the domain immediately. The kernel itself never panics due to a domain fault.

Fault containment flow:

  Domain D faults
       │
       ▼
  kernel suspends D's threads
  D.state = FAULTED
  koms_event_fire(D, KOBJ_EVENT_DOMAIN_FAULT)
       │
       ├── supervisor registered?
       │         YES                        NO
       │          │                          │
       ▼          ▼                          ▼
  IPC message to supervisor          domain_kill(D)
  { fault_code, domain_id }
       │
       ├── supervisor calls domain_restart(D)
       │         │
       │         ▼
       │   D.state = RESTARTING
       │   reset address space
       │   reset capability table
       │   restart initial thread
       │   D.state = RUNNING
       │
       └── supervisor calls domain_kill(D)
                 │
                 ▼
           D.state = DEAD
           destroy address space
           destroy capability table
           koms_put(D)

5.6 Server Domains

A server domain is a domain that provides a service to other domains. It is distinguished from a user domain only by convention and registration:

  • It registers one or more IPC endpoints in the KOMS namespace under a well-known path (e.g., "domains/block-io/ep.request").
  • It registers a supervisor domain (typically the system manager domain) that will restart it on fault.
  • It registers a heartbeat capability with the kernel watchdog.

The kernel has no built-in concept of "driver" or "system service." All server domains are equal in privilege. Their authority derives entirely from the capabilities they hold.

5.7 KOMS Domain Hierarchy

koms_root_ns
└── "domains"
    ├── "system-manager"    ← supervisor for all server domains
    │   ├── "ep.fault"      ← receives fault events
    │   └── threads/
    │       └── "main"
    ├── "block-io"
    │   ├── "ep.request"
    │   └── threads/
    │       └── "worker-0"
    ├── "filesystem"
    │   ├── "ep.request"
    │   └── threads/
    │       └── "worker-0"
    └── "user-shell"
        └── threads/
            └── "main"

6. Scheduler

6.1 Design Goals

ZXFoundation™ targets throughput/batch workloads: long-running server domains, high CPU utilization, and minimal context-switch overhead. The scheduler is not designed for sub-millisecond interactive latency. It is designed to keep all CPUs busy and to minimize the overhead of scheduling decisions on the hot path.

6.2 Priority Classes

The scheduler defines three priority classes, processed in strict order:

ClassValueQuantumUse case
SCHED_REALTIME0 (highest)1 msWatchdog thread, IPC notification threads
SCHED_BATCH110 msServer domains, user processes
SCHED_IDLE2 (lowest)unboundedIdle loop (runs only when no other work)

A SCHED_REALTIME thread always preempts a SCHED_BATCH or SCHED_IDLE thread. A SCHED_BATCH thread always preempts SCHED_IDLE. Within a class, scheduling is round-robin.

The 10 ms batch quantum is chosen to match the z/Architecture TOD clock resolution and to amortize context-switch overhead over a meaningful amount of work. Server domains that perform I/O will voluntarily yield (block on IPC receive) long before the quantum expires.

6.3 Per-CPU Run Queues

Each CPU maintains three run queues, one per priority class. Run queues are doubly-linked lists of sched_entity_t nodes embedded in thread objects.

Per-CPU scheduler state (one per CPU):

  ┌─────────────────────────────────────────────────────────┐
  │  CPU N                                                  │
  │                                                         │
  │  current_thread ──► [thread currently running]          │
  │                                                         │
  │  rq[SCHED_REALTIME]: [t_a] ↔ [t_b] ↔ ∅                │
  │  rq[SCHED_BATCH]:    [t_c] ↔ [t_d] ↔ [t_e] ↔ ∅        │
  │  rq[SCHED_IDLE]:     [idle_thread] ↔ ∅                 │
  │                                                         │
  │  rq_lock (spinlock, irqsave)                            │
  │  nr_running (total threads across all queues)           │
  └─────────────────────────────────────────────────────────┘

The rq_lock is a per-CPU spinlock. It is held only during run queue manipulation (enqueue, dequeue, pick_next). It is never held across a context switch.

6.4 Scheduling Decision

The scheduler is invoked from three points:

  1. CPU timer interrupt (quantum expiry).
  2. thread_block() — a thread voluntarily deschedules (e.g., IPC receive).
  3. thread_wake() — a thread is made runnable (e.g., IPC send wakes receiver).
schedule():
    acquire rq_lock (irqsave)
    next = pick_next_thread(current_cpu)
    if next == current_thread:
        release rq_lock
        return                  // no switch needed
    prev = current_thread
    current_thread = next
    release rq_lock
    context_switch(prev, next)  // saves prev, restores next, returns in next

pick_next_thread(cpu):
    for class in [SCHED_REALTIME, SCHED_BATCH, SCHED_IDLE]:
        if rq[class] not empty:
            thread = rq[class].head
            list_rotate(rq[class])   // round-robin: move head to tail
            return thread
    return idle_thread              // always non-null

6.5 Context Switch

A context switch saves the outgoing thread's full CPU state and restores the incoming thread's state. On z/Architecture this includes:

  • 16 × 64-bit general-purpose registers (GPRs 0–15)
  • 16 × 64-bit floating-point registers (FPRs 0–15)
  • Program Status Word (PSW: mask + instruction address)
  • 16 × 32-bit access registers (ARs 0–15)
  • CPU timer value (STPTC / SPTC)

The kernel stack pointer (GPR 15) is saved in the thread's saved_regs and restored on the next switch. The domain's ASCE is loaded into CR1 when switching between domains.

Context switch sequence:

  context_switch(prev, next):
      // Save prev state to prev.saved_regs
      STMG  R0,R15, prev.saved_regs.gprs
      STFPC prev.saved_regs.fpc
      STPTC prev.saved_regs.cpu_timer
      // Update time accounting
      prev.sys_timer += (STCK() - lowcore.sys_enter_timer)
      // Switch address space if domains differ
      if prev.domain != next.domain:
          LCTLG CR1, next.domain.space.asce
          // TLB is tagged by ASCE; no explicit flush needed on z/Arch
      // Restore next state
      LPTC  next.saved_regs.cpu_timer
      LFPC  next.saved_regs.fpc
      LMG   R0,R15, next.saved_regs.gprs
      // lowcore.current_task = next (for fault handler identification)
      lowcore.current_task = next
      lowcore.sys_enter_timer = STCK()
      // Return in next thread's context

6.6 Work Stealing

When a CPU's run queues are empty (only the idle thread is runnable), the CPU attempts to steal work from the busiest CPU.

Work stealing:

  idle_loop(cpu):
      while true:
          victim = find_busiest_cpu()   // scan per-CPU nr_running
          if victim == null or victim.nr_running <= 1:
              arch_cpu_relax()          // DIAG 0x44 (z/Arch yield hint)
              continue
          acquire victim.rq_lock (irqsave)
          acquire cpu.rq_lock (irqsave)   // always in cpu_id order
          steal_half(victim, cpu)
          release cpu.rq_lock
          release victim.rq_lock
          break

Stealing moves half the victim's SCHED_BATCH threads to the idle CPU. SCHED_REALTIME threads are never stolen — they are pinned to their assigned CPU by the IPI mechanism.

6.7 CPU Affinity

A thread may be pinned to a subset of CPUs via its cpu_mask field. The scheduler respects affinity: pick_next_thread skips threads whose cpu_mask does not include the current CPU. Work stealing also respects affinity: a thread is only stolen if the stealing CPU is in the thread's cpu_mask.

Affinity is set at thread creation via a capability-gated syscall. The capability must grant CAP_WRITE on the thread object.


7. Time Subsystem

7.1 Hardware Time Sources

z/Architecture provides three hardware time mechanisms, all per-CPU:

SourceInstructionTypeResolutionUse
TOD clockSTCK / STCKFGlobal, monotonic~0.24 ns (2^-12 µs)Wall time, ktime_get
CPU timerSPTC / STPTCPer-CPU countdownSame as TODScheduler preemption
Clock comparatorSCKC / STCKCPer-CPU absoluteSame as TODSleep / timeout

The TOD clock is a single hardware clock shared across all CPUs. It is monotonic and does not wrap in any practical timeframe (64-bit, ~143 years at full resolution). STCKF reads it without serialization — it is safe from any context including hard-IRQ.

7.2 Kernel Time (ktime_t)

ktime_t is a 64-bit nanosecond count since kernel boot. It is derived from the TOD clock with a boot-time offset computed during pmm_init.

TOD clock value (raw):
  bits 63:0 = TOD units (1 TOD unit = 2^-12 µs ≈ 0.244 ns)

ktime conversion:
  ktime_ns = (tod_raw - tod_boot_offset) * 125 / 512
           = (tod_raw - tod_boot_offset) >> 2  (approximate, 4 ns resolution)

  Exact: 1 TOD unit = 1000/4096 ns
         ktime_ns = tod_delta * 1000 / 4096

ktime_get() reads STCKF and applies the conversion. It is callable from any context, holds no lock, and never sleeps.

7.3 CPU Timer and Scheduler Preemption

The CPU timer is a per-CPU countdown register. When it reaches zero, a CPU timer interrupt fires (external interrupt, subclass 0x1004). The kernel uses this to enforce scheduler quanta.

Quantum setup (on context switch to a new thread):
    quantum_tod = thread.priority == SCHED_REALTIME ? 1_ms_in_tod
                                                    : 10_ms_in_tod
    SPTC -quantum_tod    // load negative value; counts up to zero

CPU timer interrupt handler:
    // Fires when CPU timer reaches zero (overflows from negative to positive)
    sched_tick()         // account time, check if quantum expired
    if quantum_expired:
        schedule()       // pick next thread
    else:
        return           // spurious or early; reload timer

7.4 Clock Comparator and Timer Wheel

The clock comparator fires an external interrupt when the TOD clock reaches a programmed absolute value. The kernel uses this for sleep and timeout operations.

The timer wheel is a per-CPU hierarchical structure with 8 levels and 64 slots per level. Each slot covers a time range; the resolution doubles at each level.

Timer wheel (per CPU):

  Level 0: 64 slots × 1 ms  = 64 ms range   (fine-grained)
  Level 1: 64 slots × 64 ms = 4 s range
  Level 2: 64 slots × 4 s   = 256 s range
  ...
  Level 7: 64 slots × ...   = years range    (coarse)

  Each slot: list of timer_t objects expiring in that window

  On clock comparator interrupt:
      advance current slot pointer
      fire all timers in the current slot
      if level 0 wraps: cascade from level 1, etc.
      program clock comparator for next non-empty slot

Timer callbacks execute in softirq context — after the hard-IRQ handler returns, before returning to user space. They must not block, must not acquire spinlocks held by hard-IRQ handlers, and must complete in bounded time.

7.5 Time Accounting

Per-thread time accounting uses the lowcore timing fields:

Kernel entry (SVC, PGM, EXT, IO):
    lowcore.sys_enter_timer = STCK()

Kernel exit (return to user space):
    elapsed = STCK() - lowcore.sys_enter_timer
    current_thread.sys_timer += elapsed
    lowcore.exit_timer = STCK()

User time (updated on kernel entry):
    user_elapsed = lowcore.sys_enter_timer - lowcore.exit_timer
    current_thread.user_timer += user_elapsed

7.6 Time Strict Requirements

#Requirement
TIME-1ktime_get() must be callable from any context including hard-IRQ. It reads STCKF directly — no lock, no sleep.
TIME-2Timer callbacks execute in softirq context. They must not block or acquire locks held by hard-IRQ handlers.
TIME-3The CPU timer must be reloaded on every context switch. A thread must never run beyond its quantum without a timer interrupt.
TIME-4The clock comparator must be reprogrammed after every timer wheel advance to the next non-empty slot.
TIME-5tod_boot_offset is computed once during pmm_init and never modified.

8. Trap and System Call Architecture

8.1 Interrupt Classes

z/Architecture defines six hardware interrupt classes. Each has a dedicated new PSW slot in the lowcore and a dedicated entry point in the kernel.

ClassLowcore offsetTriggerKernel handler
RESTART0x01A0SIGP RESTART (AP bringup)restart_handler
EXTERNAL0x01B0CPU timer, clock comparator, SIGP, service callext_handler
SVC0x01C0SVC n instruction (system call)svc_handler
PROGRAM0x01D0Page fault, protection exception, illegal instructionpgm_handler
MCCK0x01E0Machine check (hardware error)mcck_handler
IO0x01F0Channel subsystem I/O completionio_handler

8.2 Entry Path

All interrupt classes share the same entry structure:

Hardware interrupt fires:
    1. Hardware saves old PSW to lowcore (e.g., svc_old_psw at 0x0140).
    2. Hardware saves interrupt parameters to lowcore
       (e.g., svc_code at 0x008A for SVC).
    3. Hardware loads new PSW from lowcore (e.g., svc_new_psw at 0x01C0).
    4. Execution begins at the kernel entry stub.

Kernel entry stub (assembly):
    STMG  R0,R15, lowcore.save_area_sync   // save all GPRs
    // Build irq_frame_t on kernel stack:
    //   gprs[16], psw (from lowcore old PSW), ilc, code
    LG    R15, lowcore.kernel_stack        // switch to kernel stack
    BRASL R14, <C handler>                 // call C dispatcher
    // On return: restore GPRs, LPSWE to return PSW
    LMG   R0,R15, frame.gprs
    LPSWE frame.psw

The irq_frame_t on the kernel stack is the canonical representation of the interrupted context. It is used by the fault handler, the debugger, and the context switch path.

8.3 SVC — System Call Dispatch

ZXFoundation™ defines its own system call table. There is no POSIX compatibility layer. The SVC number is in lowcore.svc_code (16-bit). Arguments follow the SysV ABI: GPRs 2–7. Return value in GPR 2.

Every system call that operates on a kernel object takes a capability handle as its first argument (GPR 2). The kernel validates the capability before performing any operation. An invalid or insufficient capability returns ERR_CAP_INVALID immediately.

SVC dispatch:

  svc_handler(frame):
      svc_nr = lowcore.svc_code & 0xFF
      if svc_nr >= ZX_SYSCALL_MAX:
          return ERR_INVALID_SYSCALL
      cap_handle = frame.gprs[2]
      object, rights = cap_lookup(current_domain, cap_handle)
      if object == null:
          return ERR_CAP_INVALID
      return syscall_table[svc_nr](object, rights, frame)

ZXFoundation™ v1 system call surface (~32 syscalls):

NumberNameCapability typeDescription
0zx_cap_deriveanyDerive a capability with reduced rights
1zx_cap_transferany + CAP_GRANTTransfer a capability via IPC message
2zx_cap_revokeany + CAP_REVOKERevoke all derived capabilities
3zx_domain_createdomain factoryCreate a new domain
4zx_domain_killdomain + CAP_DESTROYKill a domain
5zx_domain_restartdomain + CAP_WRITERestart a faulted domain
6zx_thread_createdomain + CAP_WRITECreate a thread in a domain
7zx_thread_startthread + CAP_EXECStart a thread at a given address
8zx_thread_exitTerminate the calling thread
9zx_ipc_callendpoint + CAP_EXECSynchronous IPC call
10zx_ipc_recvendpoint + CAP_EXECBlock waiting for a message
11zx_ipc_replyReply to a synchronous call
12zx_ipc_sendendpoint + CAP_EXECAsync send (non-blocking)
13zx_mem_mapVMA + CAP_MAPMap a VMA into the calling domain
14zx_mem_unmapVMA + CAP_WRITEUnmap a VMA
15zx_mem_allocdomain + CAP_WRITEAllocate anonymous memory
16zx_endpoint_createdomain + CAP_WRITECreate an IPC endpoint
17zx_endpoint_destroyendpoint + CAP_DESTROYDestroy an endpoint
18zx_time_getRead ktime_t (no capability needed)
19zx_sleepSleep for a duration
20zx_yieldVoluntarily yield the CPU
21zx_watchdog_registerdomain + CAP_WRITERegister a heartbeat capability
22zx_watchdog_heartbeatwatchdog capSignal liveness to the watchdog
23–31reservedFuture use

8.4 PGM — Program Check Handler

The program check handler dispatches on lowcore.pgm_code:

pgm_handler(frame):
    code = lowcore.pgm_code
    addr = lowcore.trans_exc_code   // faulting virtual address (if applicable)

    switch code:
        case PGM_TRANSLATION_EXCEPTION:   // page fault
            vma = vmm_find_vma(current_domain.space, addr)
            if vma == null:
                goto domain_fault         // no mapping → domain fault
            page = pmm_alloc_page(ZX_GFP_NORMAL)
            if page == null:
                goto domain_fault         // OOM → domain fault
            mmu_map_page(current_domain.space, addr, page, vma.vm_prot)
            return                        // retry the faulting instruction

        case PGM_PROTECTION_EXCEPTION:    // write to read-only page, or key mismatch
            goto domain_fault

        case PGM_PRIVILEGED_OPERATION:    // user tried a privileged instruction
            goto domain_fault

        case PGM_SPECIFICATION_EXCEPTION: // alignment or format error
            goto domain_fault

        default:
            goto domain_fault

domain_fault:
    domain_suspend(current_domain)
    deliver_fault_event(current_domain, code, addr)
    schedule()                            // switch to another thread

A program check in kernel context (PSW problem-state bit = 0 at the time of the fault) is always a kernel panic. The kernel must not generate translation exceptions or protection exceptions in its own address space.

8.5 EXT — External Interrupt Handler

ext_handler(frame):
    code = lowcore.ext_int_code

    switch code:
        case EXT_CPU_TIMER (0x1004):
            sched_tick()
            if quantum_expired: schedule()

        case EXT_CLOCK_COMPARATOR (0x1005):
            timer_wheel_advance(current_cpu)
            program_clock_comparator(next_expiry)

        case EXT_SERVICE_CALL (0x2401):
            sclp_service_call_handler()   // SCLP response (console, hardware info)

        case EXT_SIGP_EMERGENCY (0x1201):
            ipi_handler()                 // cross-CPU IPI (TLB shootdown, CPU offline)

        default:
            // Unknown external interrupt: log and ignore.

8.6 IO — Channel Subsystem Interrupt Handler

io_handler(frame):
    schid.sch_no = lowcore.subchannel_nr
    schid.ssid   = lowcore.subchannel_id >> 16

    // Read the Interrupt Response Block (IRB) via TSCH.
    TSCH schid, irb

    // Look up the IRQ descriptor for this subchannel.
    desc = irq_lookup_by_schid(schid)
    if desc == null:
        return                  // spurious; no handler registered

    // Dispatch to the registered handler.
    // The handler is typically the block-I/O server domain's IPC endpoint.
    desc.handler(desc, &irb)

The I/O handler is intentionally minimal. It reads the IRB and dispatches to a registered handler. The handler is responsible for notifying the appropriate server domain via IPC. The kernel does not interpret I/O completion data.


9. Machine-Check Recovery and Watchdog

9.1 Machine-Check Classification

When a machine-check interrupt fires, lowcore.mcck_interruption_code classifies the error. The kernel classifies each error as recoverable or unrecoverable:

Error classRecoverable?Action
Storage error (corrected)YesLog; mark page suspect; continue
Storage error (uncorrected)NoOffline affected frames; migrate domains
CPU malfunctionNoOffline CPU; migrate its domains
Timing facility errorYesRe-sync TOD; log
External damageNoKernel panic (hardware integrity lost)

9.2 Machine-Check Recovery Flow

mcck_handler(frame):
    code = lowcore.mcck_interruption_code

    if code & MCCK_SD:              // system damage — unrecoverable
        goto kernel_panic

    if code & MCCK_ST:              // storage error
        addr = lowcore.failing_storage_address
        page = phys_to_page(addr)
        if code & MCCK_ST_CORRECTED:
            pmm_mark_suspect(page)  // log; keep in service
        else:
            pmm_offline_page(page)  // remove from buddy; migrate domains
            domain_migrate_from_page(page)

    if code & MCCK_CPU:             // CPU malfunction
        cpu_offline(current_cpu)    // SIGP STOP self after migration
        domain_migrate_all(current_cpu)
        SIGP STOP, current_cpu_addr

    // Recoverable: return to interrupted context.
    LPSWE frame.psw

9.3 CPU Offline and Domain Migration

When a CPU is taken offline (due to MCCK or operator request):

cpu_offline(cpu):
    // 1. Stop accepting new work.
    cpu.state = CPU_OFFLINE_PENDING
    // 2. Drain the run queue to other CPUs.
    acquire cpu.rq_lock
    for each thread in cpu.rq[SCHED_BATCH]:
        target = find_least_loaded_cpu(thread.cpu_mask)
        enqueue(target.rq[SCHED_BATCH], thread)
    release cpu.rq_lock
    // 3. Notify domains whose threads were migrated.
    for each migrated_thread:
        koms_event_fire(migrated_thread.domain, KOBJ_EVENT_DOMAIN_MIGRATE)
    // 4. Stop the CPU.
    cpu.state = CPU_OFFLINE
    SIGP STOP, cpu.cpu_addr

9.4 Domain Watchdog

The kernel maintains a per-CPU watchdog thread at SCHED_REALTIME priority. Each server domain that registers with the watchdog receives a heartbeat capability. The domain must call zx_watchdog_heartbeat within a configured interval (default: 5 seconds).

Watchdog state machine (per registered domain):

  WATCHDOG_OK ──── heartbeat received ────► WATCHDOG_OK
       │
       │ interval elapsed without heartbeat
       ▼
  WATCHDOG_WARN ──── heartbeat received ──► WATCHDOG_OK
       │
       │ second interval elapsed
       ▼
  WATCHDOG_FAULT
       │
       ▼
  domain_fault(domain)   // triggers fault containment flow (Section 5.5)

The watchdog thread runs on a dedicated CPU (CPU 0 by convention) and is never migrated. It is the only SCHED_REALTIME thread that the kernel creates at boot time.

9.5 Kernel Self-Check (syschk)

The existing zx_system_check() infrastructure is extended with severity levels:

SeverityAction
ZX_SYSCHK_WARNINGLog to kernel ring buffer; continue
ZX_SYSCHK_DEGRADEDDisable the affected subsystem; log; continue
ZX_SYSCHK_CORE_CORRUPTDisabled-wait PSW (kernel panic)

ZX_SYSCHK_CORE_CORRUPT is reserved for conditions where kernel data structures are known to be corrupted and continued execution would cause silent data loss or security violations. All other conditions should use WARNING or DEGRADED to maximize availability.

9.6 Storage Key Protection

Each domain is assigned a non-zero s390x storage key at creation time. All pages mapped into the domain's address space are assigned that key. The domain's PSW access key field is set to match.

A domain that attempts to access a page with a mismatched storage key receives a protection exception (PGM code 0x04). This is handled as a domain fault (Section 8.4) — the domain is suspended, not the kernel.

This provides a hardware-enforced memory isolation layer that operates independently of DAT. Even if a bug in the kernel's page table management accidentally maps a page from domain A into domain B's address space, the storage key check will prevent domain B from reading or writing it.


10. Long-Term Implementation Roadmap

10.1 Overview

The roadmap is organized into seven phases. Each phase has a clear prerequisite, a defined deliverable, and a set of subsystems it unlocks. Phases are sequential within a dependency chain but may overlap where dependencies permit.

Phase dependency graph:

  [Phase 1: TCB Hardening]
          │
          ▼
  [Phase 2: Capability Foundation]
          │
          ▼
  [Phase 3: Domain and IPC]
          │
     ┌────┴────┐
     ▼         ▼
[Phase 4:  [Phase 6:
 Server     Memory
 Domain     Completion]
 Infra]
     │
     ▼
[Phase 5: First Server Domains]
     │
     ▼
[Phase 7: Hardening and Observability]

10.2 Phase 1 — TCB Hardening

Prerequisite: Current state (PMM, VMM, slab, KOMS, IRQ, SMP, sync all functional).

Deliverables:

  1. Trap/entry completion: Full irq_frame_t save/restore for all six interrupt classes. SVC, PGM, EXT, IO, MCCK, RESTART handlers dispatch to C. Return path restores full CPU state via LPSWE.

  2. Time subsystem: TOD clock read (STCKF), ktime_t type and ktime_get(). CPU timer setup and quantum enforcement. Clock comparator setup. Timer wheel (8 levels, 64 slots). ktime_sleep().

  3. Scheduler — BATCH class: Per-CPU run queues. schedule(), thread_block(), thread_wake(). Context switch (GPR/FPR/PSW save- restore). CPU timer interrupt → sched_tick(). Work stealing. Idle thread per CPU.

Unlocks: Phase 2 (capability system requires a running scheduler to test domain creation).

10.3 Phase 2 — Capability Foundation

Prerequisite: Phase 1 complete.

Deliverables:

  1. Capability token: 64-bit structure, type/rights/gen/index fields. cap_mint, cap_derive, cap_revoke, cap_lookup.

  2. Capability table: Slab cache with storage key 1. Per-domain flat array. cap_table_alloc, cap_table_free. PF_PINNED pages.

  3. KOMS extension: kobject_t gains cap_gen (generation counter) and global_index (object table index). Global object table (flat array, spinlock-protected). koms_init_obj registers in table. koms_put at zero increments cap_gen before freeing.

  4. Syscalls 0–2: zx_cap_derive, zx_cap_transfer, zx_cap_revoke. SVC dispatch table. Capability validation on every syscall entry.

Unlocks: Phase 3 (domain creation requires capability tables).

10.4 Phase 3 — Domain and IPC

Prerequisite: Phase 2 complete.

Deliverables:

  1. Domain object: domain_t kobject type. vm_space_t creation per domain. Capability table allocation at domain birth. Domain lifecycle state machine. domain_create, domain_kill.

  2. Thread object: thread_t kobject type. Kernel stack allocation. thread_create, thread_start, thread_exit. Integration with scheduler (enqueue on thread_start).

  3. SVC entry — capability validation: Every syscall validates its capability argument before proceeding. ERR_CAP_INVALID on failure.

  4. IPC sync fastpath: zx_ipc_call, zx_ipc_recv, zx_ipc_reply. Direct thread switch. Register-passing (GPRs 2–9). Fastpath conditions enforced.

  5. IPC async queue: Ring buffer slab allocation. zx_ipc_send. Enqueue/dequeue. Receiver wake on enqueue.

  6. Syscalls 3–17: Full domain, thread, memory, and endpoint syscalls.

Unlocks: Phase 4 and Phase 6 (both depend on working domains and IPC).

10.5 Phase 4 — Server Domain Infrastructure

Prerequisite: Phase 3 complete.

Deliverables:

  1. Fault containment: domain_suspend, deliver_fault_event. Fault event IPC to supervisor domain. domain_restart, domain_kill from supervisor.

  2. Domain watchdog: Watchdog thread at SCHED_REALTIME. Heartbeat capability. zx_watchdog_register, zx_watchdog_heartbeat. Two-strike fault trigger.

  3. MCCK recovery: Storage error classification. pmm_offline_page. CPU offline and domain migration. KOBJ_EVENT_DOMAIN_MIGRATE.

  4. Storage key assignment: Per-domain key allocation. Page key assignment on vmm_insert_vma. PSW access key set on context switch.

  5. System manager domain: The first server domain, started by the kernel at boot. Receives fault events for all other server domains. Implements restart policy.

Unlocks: Phase 5 (server domains require fault containment to be safe).

10.6 Phase 5 — First Server Domains

Prerequisite: Phase 4 complete.

Deliverables:

  1. Console server: Wraps DIAG 0x08 / SCLP. Exposes ep.write endpoint. Accepts zx_ipc_send with a string payload. Replaces printk for user-visible output.

  2. Channel I/O server: Wraps CSS interrupt dispatch. Accepts subchannel registration from other domains. Exposes ep.request for I/O submission. Returns I/O completion via IPC reply.

  3. Block I/O server: Built on channel I/O server. Implements ECKD (DASD) read/write. Exposes ep.request with a block I/O protocol.

  4. Filesystem server (minimal): Built on block I/O server. Implements a read-only flat filesystem (sufficient to load user programs). Exposes ep.open, ep.read.

Unlocks: Phase 7 (hardening requires a running system to test against).

10.7 Phase 6 — Memory Management Completion

Prerequisite: Phase 3 complete (can proceed in parallel with Phase 4/5).

Deliverables:

  1. Demand paging: PGM translation exception → vmm_find_vmapmm_alloc_pagemmu_map_page → retry. Anonymous and file-backed VMAs.

  2. Copy-on-write: VM_COW flag on shared VMAs. Write protection fault → page copy → remap. Used for domain cloning (fork-like semantics).

  3. Page reclaim: LRU list per zone. Reclaim under memory pressure (triggered when ZONE_NORMAL.free_pages < LOW_WATERMARK). Reclaim selects cold anonymous pages; writes dirty pages to swap device.

  4. Swap: Capability-gated swap device via channel I/O server. Swap page table entries. pmm_swap_out, pmm_swap_in.

Unlocks: Phase 7 (full memory management required for production use).

10.8 Phase 7 — Hardening and Observability

Prerequisite: Phases 4, 5, and 6 complete.

Deliverables:

  1. KOMS attribute bus: Expose domain/thread/memory statistics as KOMS attributes. Readable via zx_attr_get syscall with a capability.

  2. Kernel ring buffer: Fixed-size circular log buffer. Capability-gated read via ep.klog endpoint. Replaces printk for kernel diagnostics.

  3. Capability audit log: Every cap_mint, cap_derive, cap_revoke, and cap_transfer is logged to a dedicated ring buffer. Readable by the system manager domain.

  4. Syscall fuzz harness: Host-side tool that generates random syscall sequences and validates that the kernel never panics (only returns error codes) on invalid inputs.

  5. SMP stress test: Multi-domain IPC stress test exercising the fastpath, work stealing, and domain fault/restart under load.

10.9 Milestone Summary

PhaseKey DeliverableUnlocks
1Trap, time, schedulerCapability system
2Capability tokens and tablesDomain creation
3Domains, threads, IPCServer domains, memory completion
4Fault containment, watchdog, MCCKFirst server domains
5Console, block I/O, filesystemFull system
6Demand paging, CoW, reclaim, swapProduction memory management
7Observability, audit, hardeningProduction readiness

End of ZXF-KRN-DESIGN-001 Rev 26h1.0

Kernel Overview

Document Revision: 26h1.0


1. Entry Contract

The kernel receives control from ZXFL with the following guaranteed state:

ResourceState
DATOn — CR1 holds the ASCE built by the loader
InterruptsMasked — all interrupt classes disabled
%r2HHDM virtual address of zxfl_boot_protocol_t
%r15HHDM virtual address of initial stack top (32 KB loader stack)
All other GPRsUndefined

The kernel entry point is zxfoundation_global_initialize(zxfl_boot_protocol_t *boot). The first action must be to validate boot->magic == ZXFL_MAGIC. Any other use of the protocol before this check is undefined behavior.


2. Subsystem Table

SubsystemSource locationStatus
Early initzxfoundation/init/Active
PMMzxfoundation/memory/pmm.cActive
VMMzxfoundation/memory/vmm.cActive
Slabzxfoundation/memory/slab.cActive
kmalloczxfoundation/memory/kmalloc.cActive
Heapzxfoundation/memory/heap.cActive
MMUarch/s390x/mmu/mmu.cActive
Per-CPUarch/s390x/cpu/percpu.cActive
qspinlockarch/s390x/cpu/qspinlock.cActive
Mutexzxfoundation/sync/mutex.cActive
RW Lockzxfoundation/sync/rwlock.cActive
Semaphorezxfoundation/sync/semaphore.cActive
Wait queuezxfoundation/sync/waitqueue.cActive
RCUzxfoundation/sync/rcu.cActive
SRCUzxfoundation/sync/srcu.cActive
kobjectzxfoundation/object/kobject.cActive
printkzxfoundation/sys/printk.cActive
paniczxfoundation/sys/panic.cActive
Traparch/s390x/trap/Active
SMParch/s390x/cpu/smp.cActive
Schedulerzxfoundation/sched/Active
IRQarch/s390x/irq/Stub
Timearch/s390x/time/Stub

Early Initialization

Document Revision: 26h1.0
Source: zxfoundation/init/main.c


1. Initialization Sequence

zxfoundation_global_initialize performs early initialization in strict order before enabling interrupts or starting APs:

StepActionNotes
1zxfl_lowcore_setup()Install kernel new PSWs in the BSP lowcore
2diag_setup() + printk_initialize()Enable console output
3Validate boot->magic == ZXFL_MAGICPanic if wrong
4Validate boot->binding_tokenRecompute and compare; panic on mismatch
5validate_stack_frame()Verify ZXVL stack canaries
6verify_kernel_checksums()Re-verify SHA-256 segment digests from HHDM
7Print machine/LPAR/CPU infoIf ZXFL_FLAG_SYSINFO / ZXFL_FLAG_SMP set
8percpu_init_bsp()Initialize BSP per-CPU block at prefix+0x200
9arch_cpu_features_init(boot)Detect STFLE facilities, populate feature flags
10rcu_init()Initialize RCU subsystem
11pmm_init(boot)Register usable memory regions; reserve loader/kernel/pool
12mmu_init()Install 8 KB VA-0 lowcore window; scrub identity map; inherit EDAT-1/2 state. Order is mandatory — see §4.
13vmm_init()Set up vmalloc region
14slab_init()Initialize slab caches
15kmalloc_init()Initialize kmalloc size classes
16trap_init()Install program-check new PSW; enable trap handler
17smp_init()Start all APs (SIGP sequence); each AP calls trap_init()
18sched_init()BSP becomes idle (PID 0); spawns kernel_init (PID 1)

2. Security Checks (Steps 3–6)

These checks run before any subsystem is initialized. A failure at any point calls panic(), which loads a disabled-wait PSW.

Binding token (step 4): The kernel recomputes ZXVL_COMPUTE_TOKEN(stfle_fac[0], ipl_schid) and compares it to boot->binding_token. This ties the running kernel to the specific hardware and IPL device — a protocol struct copied from another machine will fail here.

Stack frame (step 5): The loader writes a two-word canary at boot->kernel_stack_top. The kernel verifies frame[0] == ZXVL_FRAME_MAGIC_A and frame[1] == ZXVL_FRAME_MAGIC_B ^ binding_token. A mismatch indicates stack corruption or an unauthorized loader.

Checksum re-verification (step 6): The kernel re-reads the zxvl_checksum_table_t from kernel_phys_start + ZXVL_CKSUM_TABLE_OFFSET (via HHDM) and recomputes SHA-256 for each PT_LOAD segment. This catches any modification to the kernel image between loader verification and kernel execution.


3. PMM Reservation (Step 10)

pmm_init registers all ZXFL_MEM_USABLE regions from the boot protocol memory map, then marks the following ranges as reserved:

RangeReason
[0, 1 MB)Lowcore + loader code
[kernel_phys_start, kernel_phys_end)Kernel image
[pool_base, pgtbl_pool_end)Bootloader page table pool
Each module's [phys_start, phys_start + size)Loaded modules

4. MMU Initialization Ordering Invariant (Step 12)

mmu_init() takes ownership of the bootloader ASCE and replaces the bootloader's 8 GB identity map with a precise 8 KB window at VA 0. This operation has a strict, unbreakable ordering requirement rooted in z/Architecture hardware behavior.

Why VA 0 Must Always Be Mapped

Every interrupt handler entry stub (trap_pgm_entry, trap_ext_entry, etc.) begins with:

lg  %r1, LC_ASYNC_STACK(0)   // effective VA = 0x0350

The zero base register is not an error — it is the only way to load a value before registers have been saved. Because DAT is active when this runs, VA 0x350 must be translated successfully. If the mapping is absent even for one instruction cycle while interrupts are unmasked, a program-check fires, SAVE_FRAME tries to load from VA 0x350 again, and the CPU enters an infinite Region-first-translation exception (0x0039) death loop.

Required Sequence in mmu_init()

 Step 1: mmu_map_page(VA 0x0000 → PA 0x0000)   // build mapping first
 Step 2: mmu_map_page(VA 0x1000 → PA 0x1000)   // both pages of the lowcore
 Step 3: scrub r1[1..2046]                      // revoke identity map
 Step 4: mmu_flush_tlb_local()                  // make scrub visible to CPU

Steps 1–2 must precede steps 3–4. The new 8 KB mapping is committed into the live R1 table before any identity entry is removed, so VA 0x350 is always valid.

Can This Be Avoided by Enabling DAT Earlier?

No. The requirement is not a consequence of when DAT is enabled; it comes from how SAVE_FRAME accesses the lowcore. Even if ZXFL enabled DAT internally and passed the kernel a fully virtual address space, the kernel's entry.S would still execute lg %r1, 0x350(0) and still require VA 0x350 to be mapped. This is standard z/Architecture operating system design — Linux s390x, z/VM, and z/OS all maintain an equivalent lowcore window at virtual address 0 for the same reason. See docs/src/kernel/trap.md for the full architectural rationale.

System Check (syschk)

Document Revision: 26h1.3
Status: Active


1. Overview

The System Check subsystem (syschk) is the kernel's mechanism for halting the system when a condition is detected from which execution cannot safely continue.

The halt path acquires no locks, calls no kernel subsystems, and dereferences no kernel data structures. It is safe to call from any context: exception handlers, IRQ handlers, early init, or a state where kernel memory is corrupt.


2. Error Code Encoding

Every system check is identified by a 16-bit code with three fields:

 15      12 11       8 7             0
 ┌────────┬──────────┬───────────────┐
 │ CLASS  │  DOMAIN  │     TYPE      │
 │  4 b   │   4 b    │     8 b       │
 └────────┴──────────┴───────────────┘
FieldBitsPurpose
CLASS15–12Severity class
DOMAIN11–8Originating subsystem
TYPE7–0Specific condition within the domain

2.1 Severity Classes

ClassValueBehavior
FATAL0xFAlways halts
CRITICAL0xCAlways halts
WARNING0x3Always halts

All classes halt unconditionally. The class field exists for post-mortem triage, not for runtime branching.

2.2 Domains

DomainValueSubsystem
CORE0x0Core kernel / initialization
MEM0x1Memory subsystem
SYNC0x2Synchronization primitives
ARCH0x3Architecture / hardware
SCHED0x4Scheduler
IO0x5I/O subsystem

3. Halt Sequence

zx_system_check(code, msg)
        │
        ▼
  arch_local_irq_disable()
        │
        ▼
  g_halting set? ──YES──► arch_sys_halt()
        │
        │ NO
        ▼
  g_halting = 1
        │
        ▼
  write zx_crash_record_t to lowcore + 0x1400
  (magic, code, PSW snapshot, reason string)
        │
        ▼
  raw SIGP STOP loop over g_cpu_map[]
  (boot protocol array; no percpu_areas lookup)
  CC=2 retried; CC=3 skipped
        │
        ▼
  arch_sys_halt()  ← disabled-wait PSW; machine stops

4. Crash Record

Before halting, the issuing CPU writes a zx_crash_record_t to a fixed offset (0x1400) within the BSP lowcore. The lowcore is a fixed physical address, always mapped, and accessible regardless of kernel heap or DAT state.

Offset  Size  Field
------  ----  -----
0x00    8     magic  (0x5A584352554E4348 "ZXCRUNCH")
0x08    2     code   (zx_syschk_code_t)
0x0A    6     pad
0x10    8     psw_mask  (EPSW at time of syschk)
0x18    8     psw_addr  (0; not available from EPSW)
0x20    128   msg    (NUL-terminated reason string)

The record is read post-mortem by a debugger or operator console. It is not printed to the console during the halt sequence.


5. Re-entrancy

If a second system check fires on any CPU while a halt is already in progress, the re-entrant call detects g_halting immediately after IRQ disable and proceeds directly to arch_sys_halt(). The crash record is not overwritten.

g_halting is a volatile int, not an atomic. If the memory subsystem is corrupt, atomic operations cannot be trusted.


6. SMP Teardown

The halt path iterates g_cpu_map[] — the boot protocol's CPU map, registered at init time via zx_syschk_register_cpu_map(). This array is loader-written, physically contiguous, and never freed. It does not depend on percpu_areas[] or any kernel allocator.

sigp() is a single inline assembly instruction. It acquires no locks. CC=2 (busy) is retried in a tight loop. CC=3 (not operational) is skipped.


7. WARNING-Class Codes

WARNING codes halt unconditionally. There is no filter mechanism. If a subsystem needs to log a recoverable condition, it should call printk directly and not use zx_system_check.


8. Revision History

RevisionChange
26h1.3Removed filter API; all classes halt unconditionally; crash record written to lowcore; raw SIGP loop; no printk on halt path
26h1.2Re-entrant guard moved first; SMP teardown before printk; static BSS message buffer
26h1.1Initial release

Per-CPU Data

Document Revision: 26h1.3 Sources: include/arch/s390x/cpu/lowcore.h, include/zxfoundation/percpu.h, arch/s390x/cpu/percpu.c


1. Layout

Each CPU's prefix area (lowcore) is a monolithic 8 KB block (two contiguous physical pages). The physical address of this block is loaded into the hardware prefix register via SPX. The prefix register transparently remaps absolute address 0x0000–0x1FFF to the CPU's own physical lowcore for all absolute-mode accesses.

The layout unifies hardware-assigned fields and software-defined per-CPU data into a single structure (zx_lowcore_t):

Physical Prefix Area (8 KB)
┌──────────────────────────────┐ 0x000
│  Hardware Lowcore            │   PSWs, interrupt codes, save areas (PoP §4)
├──────────────────────────────┤ 0x400  ← LC_PERCPU_OFFSET
│  Software Per-CPU Block      │   prefix_base, cpu_id, lock_depth,
│  (zx_percpu_t percpu)        │   MCS nodes, RCU state, PCP caches
├──────────────────────────────┤ 0x1200
│  Hardware Save Areas         │   GPRs, FPRs, CRs, ARs
└──────────────────────────────┘ 0x2000

2. Access — Current CPU

To access the current CPU's own per-CPU data, the kernel uses zx_lowcore(), which returns the HHDM-mapped pointer to the active lowcore. Because the prefix register already routes absolute-address-0 to this CPU's physical lowcore, and the HHDM maps physical 0 to CONFIG_KERNEL_VIRT_OFFSET, zx_lowcore() always resolves to the correct CPU without needing the prefix register value at all.

MacroDescription
percpu_get(field)Read a field from the current CPU's percpu block
percpu_set(field, val)Write a field to the current CPU's percpu block
percpu_inc(field)Increment a field in place
percpu_dec(field)Decrement a field in place
percpu_ptr_to(field)Pointer to a field in the current CPU's block

3. Access — Other CPUs (zx_lowcore_cpu)

3.1 The Hardware Prefix Aliasing Bug

Accessing another CPU's lowcore by index into a global pointer array is deceptively dangerous on s390x. Consider the global array __percpu_areas_raw[] where:

  • __percpu_areas_raw[0] = HHDM pointer to BSP lowcore = CONFIG_KERNEL_VIRT_OFFSET + 0
  • __percpu_areas_raw[1] = HHDM pointer to AP-1 lowcore = CONFIG_KERNEL_VIRT_OFFSET + P

When AP-1 (whose prefix register is P) reads a value from address CONFIG_KERNEL_VIRT_OFFSET + 0 (i.e., the BSP's HHDM lowcore), the MMU translates it to physical address 0. The prefix register then remaps physical 0 to physical P — so AP-1 silently reads its own lowcore, not the BSP's.

Symmetrically, when AP-1 reads from CONFIG_KERNEL_VIRT_OFFSET + P, the MMU translates it to physical P. The prefix register remaps physical P to physical 0 — so AP-1 silently reads the BSP's lowcore.

The result: every AP's cross-CPU lowcore lookup is silently swapped with the BSP's. IPI delivery, RCU quiescent-state tracking, and PMM per-CPU page caches all operated on the wrong CPU's data. The system "mostly worked" because the perfect symmetry of the swap caused IPIs to still reach all CPUs, masking the corruption.

3.2 The Safe Accessor: zx_lowcore_cpu(cpu)

__percpu_areas_raw[] must never be accessed directly. Use zx_lowcore_cpu(cpu) defined in include/zxfoundation/percpu.h, which applies an inverse prefix swap in software:

#define zx_lowcore_cpu(cpu)                                                    \
    ({                                                                          \
        zx_lowcore_t *__lc = __percpu_areas_raw[(cpu)];                        \
        zx_lowcore_t *__res = __lc;                                             \
        if (__lc) {                                                             \
            uint64_t __target_real = (uint64_t)__lc - CONFIG_KERNEL_VIRT_OFFSET;\
            uint64_t __my_prefix   = zx_lowcore()->percpu.prefix_base;         \
            if (__target_real == __my_prefix)                                   \
                __res = (zx_lowcore_t *)CONFIG_KERNEL_VIRT_OFFSET;             \
            else if (__target_real == 0)                                        \
                __res = (zx_lowcore_t *)(CONFIG_KERNEL_VIRT_OFFSET + __my_prefix);\
        }                                                                       \
        __res;                                                                  \
    })

How it works: if the target's physical address matches my_prefix, the hardware would have swapped it to 0, so we manually redirect to HHDM + 0 (the BSP). If the target's physical address is 0, the hardware would have swapped it to my_prefix, so we redirect to HHDM + my_prefix. Any other CPU is unaffected (no swap applies).

The cross-CPU access macros all go through this accessor:

MacroDescription
percpu_get_on(cpu, field)Read from another CPU's percpu block
percpu_set_on(cpu, field, val)Write to another CPU's percpu block
percpu_ptr_on(cpu, field)Pointer to a field in another CPU's block

4. Initialization

FunctionWhen CalledEffect
percpu_init_bsp()Once, early in main.cRegisters BSP lowcore (physical 0x0) in __percpu_areas_raw[0]
percpu_init_ap(cpu_id, cpu_addr, node)Once per AP in smp_init()Allocates 8 KB (order-1), installs prefix via SPX, registers in __percpu_areas_raw[cpu_id]

5. Fields (zx_percpu_t)

FieldTypePurpose
prefix_baseuint64_tPhysical address of this CPU's lowcore (used by zx_lowcore_cpu)
cpu_iduint16_tLogical CPU ID (0 = BSP)
cpu_addruint16_tz/Arch CPU address (STAP result); used for SIGP
lock_depthuint32_tqspinlock nesting depth
lock_nodes[MAX_LOCK_DEPTH]mcs_node_t[]MCS queue nodes for qspinlock
rcu_gp_sequint64_tRCU grace-period sequence (written by BSP)
rcu_qs_sequint64_tRCU quiescent-state sequence (written by this CPU)
in_rcu_read_sideuint8_t1 if inside rcu_read_lock()
ipi_pending_countuint32_tPending IPI completion counter
ap_stack_topuint64_tInitial AP stack pointer (physical, set before SIGP Restart)
pcp[ZONE_MAX]pmm_pcplist_t[]Per-CPU PMM order-0 page caches, one per memory zone

6. Assembly Offsets

Key lowcore offsets used by entry.S and head64.S are defined as named constants in include/arch/s390x/cpu/lowcore.h and verified at compile time by _Static_assert:

ConstantValueField
LC_ASYNC_STACK0x0350zx_lowcore_t::async_stack
LC_MCCK_STACK0x0368zx_lowcore_t::mcck_stack
LC_KERNEL_STACK0x0348zx_lowcore_t::kernel_stack
LC_RESTART_STACK0x0360zx_lowcore_t::restart_stack
LC_KERNEL_ASCE0x0388zx_lowcore_t::kernel_asce
LC_PERCPU_OFFSET0x0400zx_lowcore_t::percpu
LC_CPU_ID_OFFSET0x0408zx_percpu_t::cpu_id (within percpu block)

Interrupt Subsystem

Document Revision: 26h1.0
Subsystem: arch/s390x/trap, zxfoundation/irq


1. Overview

The interrupt subsystem handles all four z/Architecture interrupt classes delivered to the kernel: program check, external, I/O, and machine check. It is structured in two layers:

  • Architecture layer (arch/s390x/trap/) — low-level entry stubs and class-specific C handlers that decode hardware state from the lowcore.
  • Generic layer (zxfoundation/irq/) — a flat IRQ descriptor table that routes decoded interrupt codes to registered handlers.

Supervisor calls (SVC) are reserved for the future syscall layer and are not dispatched through this subsystem.


2. Interrupt Delivery on z/Architecture

When an interrupt fires, the hardware atomically:

  1. Saves the current PSW into the class-specific old PSW slot in the lowcore (prefix area).
  2. Writes interrupt parameters into fixed lowcore fields.
  3. Loads the class-specific new PSW slot, transferring control to the kernel entry stub.
Hardware fires interrupt
        │
        ▼
  Save current PSW → lowcore old PSW slot (0x0130/0x0150/0x0160/0x0170)
        │
        ▼
  Write interrupt parameters to lowcore (pgm_code, ext_int_code, …)
        │
        ▼
  Load new PSW slot (0x01B0/0x01D0/0x01E0/0x01F0) → entry stub

The new PSW slots are installed by zx_lowcore_setup_late() after DAT is enabled. Before that point they hold disabled-wait sentinels.


3. Lowcore Interrupt Slots

ClassOld PSWNew PSWParameter fields
External0x01300x01B0ext_int_code (0x0086)
Program check0x01500x01D0pgm_code (0x008E)
Machine check0x01600x01E0mcck_interruption_code (0x00E8)
I/O0x01700x01F0subchannel_nr (0x00BA)

4. Entry Stubs (arch/s390x/trap/entry.S)

Each entry stub performs the following sequence without touching any kernel data structure:

entry stub
  │
  ├─ Load dedicated stack pointer from lowcore
  │    async_stack (0x0350) for PGM / EXT / IO
  │    mcck_stack  (0x0368) for MCCK
  │
  ├─ Allocate 160-byte ABI save area + 160-byte interrupt frame
  │
  ├─ Store GPRs r0–r15 into frame.gprs[0..15]
  │
  ├─ Copy old PSW (mask + addr) from lowcore into frame.psw_mask/psw_addr
  │
  ├─ Set %r2 = &frame  (first argument to C handler)
  │
  ├─ BRASL → C handler (do_pgm_check / do_ext_interrupt / …)
  │
  └─ Restore GPRs r0–r14, LPSWE from frame.psw_mask

The machine-check stub uses a separate stack (mcck_stack) so that the handler runs even if the async stack is corrupt.

4.1 Interrupt Frame Layout

Offset  Size  Field
------  ----  -----
0x00    128   gprs[0..15]   — GPRs at interrupt time
0x80    8     psw_mask      — old PSW mask word
0x88    8     psw_addr      — old PSW instruction address

Total: 160 bytes (IRQ_FRAME_SIZE).


5. IRQ Number Space

The generic layer uses a 16-bit IRQ number partitioned by interrupt class:

0x0000 – 0x00FF   Program check codes  (pgm_code & 0x7FFF)
0x0100 – 0x01FF   External codes       (ext_int_code)
0x0200 – 0x02FF   I/O subchannel numbers (subchannel_nr & 0xFF)
0x0300 – 0x03FF   Machine-check sub-codes (mcic >> 56)

The descriptor table has ZX_IRQ_NR_MAX = 0x400 entries.


6. IRQ Descriptor Table (zxfoundation/irq/)

The table is a flat, statically-allocated BSS array. Each entry holds:

  • A handler function pointer (irq_handler_t).
  • An opaque data pointer forwarded to the handler.
  • flags (ZX_IRQF_SHARED, ZX_IRQF_DISABLED).
  • A count field incremented on every dispatch.

6.1 Dispatch Path

C handler (do_pgm_check / do_ext_interrupt / …)
  │
  ├─ Read hardware code from lowcore
  ├─ Compute irq = ZX_IRQ_BASE_* + code
  └─ irq_dispatch(irq, frame)
        │
        ├─ Bounds check irq < ZX_IRQ_NR_MAX
        ├─ Increment desc->count
        └─ Call desc->handler (or default handler if NULL)

6.2 Default Handler Behavior

IRQ rangeDefault action
PGM (0x0–0xFF)zx_system_check(ARCH_UNHANDLED_TRAP) — fatal
EXT (0x100–0x1FF)printk + drop
IO (0x200–0x2FF)printk + drop
MCCK (0x300–0x3FF)zx_system_check(ARCH_MCHECK) — fatal

7. Machine-Check Special Case

Before dispatching, do_mcck_interrupt checks the system damage bit (bit 0) of the MCIC. If set, zx_system_check() is called immediately — the descriptor table itself may reside in damaged storage and cannot be trusted.


8. Registration API

irq_register(irq, handler, data, flags)  → 0 or -1
irq_unregister(irq)
irq_dispatch(irq, frame)
irq_get_desc(irq)                        → const irq_desc_t *

irq_register and irq_unregister are not SMP-safe at this revision. They must be called during single-threaded initialization or with external serialization.


9. Revision History

RevisionChange
26h1.0Initial release

Memory Management

Document Revision: 26h1.0


ZXFoundation™'s memory management is organized in four layers:

┌──────────────────────────────────────────┐
│  kmalloc / kfree  (general-purpose)      │
├──────────────────────────────────────────┤
│  Slab allocator   (fixed-size caches)    │
├──────────────────────────────────────────┤
│  VMM              (virtual address space)│
├──────────────────────────────────────────┤
│  PMM              (physical frames)      │
├──────────────────────────────────────────┤
│  MMU              (hardware DAT tables)  │
└──────────────────────────────────────────┘
PageContents
PMMZone-aware buddy allocator, page descriptors
VMMVirtual address space, VMA red-black tree, vmalloc
Slab & KmallocFixed-size object caches, general allocator

Physical Memory Manager (PMM)

Document Revision: 26h1.0
Source: zxfoundation/memory/pmm.c


1. Zones

ZonePhysical rangePurpose
ZONE_DMA[0, 16 MB)Channel I/O buffers (31-bit CDA constraint)
ZONE_NORMAL[16 MB, RAM limit)General kernel allocations

Allocations without ZX_GFP_DMA are served from ZONE_NORMAL first. If ZONE_NORMAL is exhausted and ZX_GFP_DMA_FALLBACK is set, the PMM falls back to ZONE_DMA.


2. Buddy Allocator

Free physical frames are managed in a buddy system. Block sizes are powers of two, from order 0 (4 KB) to order 10 (4 MB). Each order has a free list of blocks.

Allocation — walk the free list at the requested order. If empty, split a block from the next higher order. Repeat until a block is found or all orders are exhausted.

Deallocation — compute the buddy PFN (pfn ^ (1 << order)). If the buddy is free at the same order, coalesce and recurse upward.

Free list links use PFN-based intrusive fields (buddy_next) rather than virtual pointers, ensuring correctness across HHDM translations.


3. Page Descriptor (zx_page_t)

Each physical frame has a 32-byte descriptor. The descriptor array is mapped contiguously in the HHDM. 32 bytes places 128 descriptors per 4 KB frame — a deliberate cache-line optimization.

FieldDescription
refcountAtomic reference count; 0 = free
orderCurrent buddy order of this block
flagsZone membership, compound page markers
buddy_nextPFN of next free block in the buddy list

4. GFP Flags

FlagMeaning
ZX_GFP_NORMALStandard allocation from ZONE_NORMAL
ZX_GFP_DMAMust allocate from ZONE_DMA
ZX_GFP_DMA_FALLBACKTry ZONE_NORMAL, fall back to ZONE_DMA
ZX_GFP_ZEROZero-fill the allocated pages

5. SMP Safety & Per-CPU Lists (PCP)

Each zone has a dedicated ticket spinlock. To reduce contention, order-0 pages are cached in Per-CPU Lists (PCP).

  • Allocation: CPUs pull from local PCP first without locking (IRQs disabled).
  • Drain: Global operations (like pmm_reserve_range) trigger a global PCP drainage via SIGP Emergency Signals (IPI) to all other CPUs. This ensures no CPU holds a 'stale' cached page that should be reserved.

6. HHDM Side Reinforcement

The Direct Physical Mapping (HHDM) is validated during initialization:

  1. Validation: pmm_verify_hhdm() checks translation consistency against the loader's memory map. It verifies that every usable physical page is correctly mapped to its HHDM virtual counterpart.
  2. EDAT Compliance: Verifies Enhanced-DAT (EDAT-1/2) 1 MB and 2 GB page usage to optimize memory performance and reduce TLB pressure.
  3. Consistency: The loader must ensure that the mapping covers the entire physical memory range described in the boot protocol, rounding up to the nearest Region-3 or Segment boundary as required by the z/Architecture DAT structure.

7. Initialization

pmm_init(boot) is called once during early init:

  1. Walk boot->mem_map[] and register all ZXFL_MEM_USABLE regions.
  2. Mark reserved ranges via Surgical Reservation:
    • Lowcore/Artifacts: [0, 1 MB) is always reserved to protect lowcore and loader leftovers.
    • Kernel Image: [kernel_phys_start, kernel_phys_end) is marked as critical.
    • Page Table Pool: [kernel_phys_end, pgtbl_pool_end) is reserved to protect active DAT tables.
    • PMM Metadata: The zx_mem_map descriptor array itself.
  3. Insert all non-reserved USABLE frames into the buddy free lists.

[!IMPORTANT] Surgical Reservation prevents "Zone Exhaustion" bugs where a large bootloader page pool could otherwise wipe out all available frames in ZONE_DMA (under 16 MB).

Virtual Memory Manager (VMM)

Document Revision: 26h1.0
Source: zxfoundation/memory/vmm.c


1. Address Space Regions

RegionBasePurpose
HHDM0xFFFF800000000000Linear physical memory map (built by loader, read-only to VMM)
vmalloc0xFFFFC00000000000Dynamically mapped kernel memory

2. Virtual Memory Areas (VMAs)

Each allocated virtual range is described by a vm_area_t:

FieldDescription
va_startStart of virtual range (page-aligned)
va_endEnd of virtual range (exclusive)
flagsVM_READ, VM_WRITE, VM_EXEC
rb_nodeRed-Black Tree node for $O(\log n)$ lookup

VMAs are indexed in a Red-Black Tree (rbtree.h). A one-entry MRU cache in vm_space_t provides an $O(1)$ fast path for sequential access patterns.


3. vmalloc

vmm_alloc(size, flags) reserves a contiguous virtual range in the vmalloc region and maps it with PMM-allocated frames:

vmm_alloc(size, flags)
  │
  ├─ Round size up to page boundary
  ├─ Bump-allocate virtual range from vmalloc region
  ├─ Insert VMA into red-black tree
  ├─ For each page in range:
  │    ├─ pmm_alloc_page(flags)
  │    └─ mmu_map_page(kernel_pgtbl, va, pa, prot)
  └─ Return va_start

Frames backing a vmalloc range are not required to be physically contiguous.


4. Large-Object Heap (kheap)

For allocations larger than 8 KB, kheap_alloc calls vmm_alloc to back the range with PMM frames. A 64-bit HEAP_MAGIC canary guards the allocation header against buffer underflows.


5. MMU Integration

The VMM calls mmu_map_page (4 KB), mmu_map_large_page (1 MB, if EDAT-1 available), or mmu_map_huge_page (2 GB, if EDAT-2 available) to install PTEs. TLB coherency is handled automatically by the IPTE instruction — no software IPI is required.

Slab Allocator & kmalloc

Document Revision: 26h1.1 Source: zxfoundation/memory/slab.c, zxfoundation/memory/kmalloc.c


1. Slab Allocator

The slab allocator provides fixed-size object caches to amortize the cost of frequent small allocations (VMAs, sync primitives, capability tables, etc.). It uses a magazine-depot architecture for lock-free per-CPU fast paths and SMP-safe bulk operations through the depot.

1.1 Architecture

kmem_cache_t
  ├─ obj_size          (8-byte aligned)
  ├─ storage_key       (s390x storage key for all backing pages)
  ├─ depot_lock        (spinlock protecting the depot lists)
  ├─ full_mags         (depot: magazines with MAG_SIZE objects ready)
  ├─ empty_mags        (depot: magazines ready to be refilled)
  ├─ partial_slabs     (slab pages with free objects remaining)
  ├─ full_slabs        (slab pages fully allocated)
  └─ cpu_mags[MAX_CPUS] (per-CPU active magazine pointer)

Each magazine holds up to MAG_SIZE (31) object pointers. Each slab is one PMM page; the slab header, free-index stack, and object array are all embedded within that page.

1.2 Fast Path (per-CPU, no lock)

alloc:
  IRQs disabled
  if cpu_mag.count > 0 → pop and return
  else → magazine_swap(fill) → pop and return

free:
  IRQs disabled
  if cpu_mag.count < MAG_SIZE → push and return
  else → magazine_swap(drain) → push and return

IRQs are disabled for the duration of the fast path. No lock is taken; the per-CPU magazine is accessed exclusively.

1.3 Slow Path (depot, with lock)

magazine_swap acquires depot_lock. Two sub-paths:

Fill (need objects):

1. full_mags non-empty?
      yes → promote to CPU slot immediately (fast fill)
       no → obtain empty shell from empty_mags (or alloc from mag_cache)
            → cache_refill_magazine (may drop+reacquire depot_lock for PMM)
            → move filled shell to full_mags → promote to CPU slot

Drain (returning a full CPU magazine):

1. Push CPU magazine to full_mags
2. Pull empty shell from empty_mags into CPU slot (or set to nullptr)

1.4 Slab Refill & Lock Discipline

cache_refill_magazine is called with depot_lock held. When a new slab page must be allocated from the PMM:

drop depot_lock
  pmm_alloc_page()      ← PMM zone lock acquired/released here
reacquire depot_lock
re-validate partial_slabs (another CPU may have added one in the window)

This ensures the PMM zone lock and depot_lock are never held simultaneously, eliminating the lock-inversion hazard present in earlier revisions.

1.5 Node Lifecycle

Magazine nodes cycle between:

empty_mags ──fill──▶ (detached, being filled) ──▶ full_mags ──promote──▶ cpu_mag
cpu_mag ──drain──▶ full_mags   empty_mags ◀── (pulled empty shell)

list_del_init is used for all magazine-node removals so nodes are always in a self-pointing state when not on a list, making re-insertion safe without re-initialization.


2. kmalloc

kmalloc(size) routes requests to the appropriate slab cache based on size class.

Size rangeBacking
≤ 8 KBSlab cache (power-of-two class)
> 8 KBvmallocvmm_alloc

kfree(ptr) returns the object to its originating cache. A header embedded before each allocation records the cache pointer and a canary for use-after-free detection.


3. Initialization Order

pmm_init()      ← must run first; slab needs PMM pages
slab_init()     ← bootstraps cache_cache and mag_cache from a single PMM page
kmalloc_init()  ← registers size-class caches via kmem_cache_create
vmm_notify_slab_ready() ← switches VMM early allocator to kmalloc

4. Strict Requirements

IDRequirement
SLAB-1kmem_cache_alloc must not be called from hard-IRQ context unless the cache was created with atomic support. Use kmalloc(ZX_GFP_ATOMIC) from IRQ context.
SLAB-2kmem_cache_free must only be called with a pointer returned by kmem_cache_alloc on the same cache. Cross-cache free is undefined behavior.
SLAB-3kmem_cache_destroy must only be called after all objects have been returned. Outstanding objects at destroy time trigger a kernel panic.
SLAB-4depot_lock must never be held when calling into the PMM or any allocator that may itself acquire a zone lock. Use the lock-drop protocol in cache_refill_magazine.

SMP

Document Revision: 26h1.0
Source: arch/s390x/cpu/


1. CPU Detection

The bootloader detects CPUs by issuing SIGP Sense (order 0x01) to each address in [0, ZXFL_CPU_MAP_MAX). A condition code of 3 means "not operational" — the address is unoccupied. CC 0, 1, or 2 means the CPU exists and is recorded in proto->cpu_map[].

The BSP address is read with STAP (Store CPU Address).

At kernel entry, proto->cpu_count contains the number of detected CPUs and proto->bsp_cpu_addr identifies the boot processor.


2. AP State at Handover

All APs are in the stopped state when the kernel receives control. The bootloader never starts APs. The kernel BSP is responsible for starting each AP:

StepAction
1Allocate a private prefix area (4 KB, page-aligned) for the AP
2Allocate a private stack for the AP
3Install interrupt new PSWs in the AP's prefix area
4SIGP Initial CPU Reset — clear the AP's state
5SIGP Set Prefix — point the AP's prefix register at its private lowcore
6SIGP Restart — start the AP at the restart new PSW in its prefix area

Note: AP startup is not yet implemented. The current kernel halts after BSP initialization.


3. Per-CPU Data

Each CPU requires its own:

  • Prefix area (4 KB) — private lowcore with correct new PSWs. Set via SPX.
  • Stack — the AP must not use the BSP stack or the loader stack.
  • Per-CPU variables — accessed via the prefix register offset (analogous to %gs on x86).

4. TLB Coherency

z/Architecture hardware handles TLB coherency automatically via the IPTE (Invalidate Page Table Entry) instruction. IPTE atomically clears a PTE and broadcasts a TLB purge to all CPUs that have the affected ASCE loaded. No software IPI is required for TLB shootdowns.

mmu_ipte(va):
    ipte %r0, va    ← serialising, hardware-broadcast

PTLB (Purge TLB) flushes the entire local TLB and should only be used during address-space teardown. For single-page invalidation in a running SMP kernel, always use IPTE.


5. SIGP Reference

OrderCodeUse
Sense0x01Query CPU state
External Call0x02Send external interrupt to CPU
Emergency Signal0x03Send emergency signal
Initial CPU Reset0x06Clear CPU state before restart
Set Prefix0x0DSet prefix register on target CPU
Store Status0x0ESave CPU registers to prefix area
Set Architecture0x12Switch to z/Architecture mode
Restart0x06 + Restart PSWStart AP at restart new PSW

PSW Manager

Document Revision: 26h1.0
Subsystem: arch/s390x/cpu/psw


1. Overview

The PSW (Program Status Word) manager provides a single, authoritative definition of all z/Architecture PSW mask constants and new-PSW lowcore offsets. Prior to this subsystem, constants were duplicated across zxconfig.h and lowcore.h under different names, and assembly files hardcoded incorrect bit patterns.

All consumers — C translation units, assembly files, the ZXFL loader, and the kernel — include a single header: arch/s390x/cpu/psw.h.


2. PSW Mask Word Layout

The z/Architecture PSW is 16 bytes. The first 8 bytes are the mask word; the second 8 bytes are the instruction address.

Bit  0     PER mask
Bit  5     DAT (address translation enable)
Bit  6     I/O interrupt mask
Bit  7     External interrupt mask
Bit 12     Machine-check mask
Bit 14     Wait state
Bit 15     Problem state (user mode)
Bits 16-17 Address space control (ASC)
Bit 31     EA — required for 64-bit addressing
Bit 32     BA — required for 64-bit addressing

Bits not listed above are reserved and must be zero. Setting a reserved bit causes a Specification Exception when the PSW is loaded via LPSWE.


3. Defined Constants

3.1 Bit Masks

ConstantValueDescription
PSW_BIT_DAT0x0400000000000000Address translation enable
PSW_BIT_IO0x0200000000000000I/O interrupt mask
PSW_BIT_EXT0x0100000000000000External interrupt mask
PSW_BIT_MCCK0x0008000000000000Machine-check mask
PSW_BIT_WAIT0x0002000000000000Wait state
PSW_BIT_PSTATE0x0001000000000000Problem state (user mode)
PSW_BIT_HOME_SPACE0x0000C00000000000Home space addressing mode
PSW_BIT_EA0x0000000100000000Extended addressing (64-bit)
PSW_BIT_BA0x0000000080000000Basic addressing (64-bit)

3.2 Composite Masks

ConstantValueDescription
PSW_ARCH_BITS0x0000000180000000EA|BA — 64-bit mode, no other bits set
PSW_MASK_KERNEL0x0000000180000000Supervisor, DAT off, all interrupts disabled
PSW_MASK_KERNEL_DAT0x0400C00180000000Supervisor, DAT on (Home Space), all interrupts disabled
PSW_MASK_DISABLED_WAIT0x0002000180000000Wait state, DAT off, all interrupts disabled

3.3 New PSW Lowcore Offsets

These are the physical offsets within the lowcore (prefix area) where the hardware loads the PSW on each interrupt class (PoP SA22-7832 §4.3.3).

ConstantOffsetInterrupt class
PSW_LC_RESTART0x01A0Restart
PSW_LC_EXTERNAL0x01B0External
PSW_LC_SVC0x01C0Supervisor call
PSW_LC_PROGRAM0x01D0Program check
PSW_LC_MCCK0x01E0Machine check
PSW_LC_IO0x01F0I/O

Note: These offsets are distinct from the old PSW save slots (0x0120–0x0170) and from the interrupt parameter area (0x0080–0x00C0).


4. Boot Initialization

The ZXFL loader prepares the memory tables, registers the Home Space ASCE in CR13 and the Primary Space ASCE in CR1, and directly transitions to DAT-on mode using a PSW_MASK_KERNEL_DAT PSW target before passing control to the kernel.

Thus, the kernel boots with DAT active and executes completely in Home-Space. The legacy psw_install_new_psws() and zx_lowcore_setup_pre_dat() methods have been removed because the pre-DAT boot window is bypassed by the loader.

During early kernel initialization, zx_lowcore_setup_late() is called to install the live interrupt handler entry points directly into the HHDM-mapped lowcore.

Synchronization Primitives

Document Revision: 26h1.0
Source: zxfoundation/sync/, include/zxfoundation/spinlock.h, include/zxfoundation/atomic.h


1. Atomic Operations

include/zxfoundation/atomic.h provides atomic_t (32-bit) and atomic64_t (64-bit) types with the standard load/store/add/sub/cmpxchg operations, implemented using z/Architecture's CS (Compare and Swap) and CSG (Compare and Swap, 64-bit) instructions.


2. Spinlock

include/zxfoundation/spinlock.h provides a ticket spinlock. Ticket spinlocks guarantee FIFO ordering, preventing starvation on highly contended locks.

FunctionDescription
spin_lock(lock)Acquire; busy-wait with DIAG 44 (yield hint)
spin_unlock(lock)Release
spin_lock_irqsave(lock, flags)Acquire + disable interrupts, save PSW mask
spin_unlock_irqrestore(lock, flags)Release + restore PSW mask

irqsave/irqrestore variants are required whenever a lock may be acquired from both process context and interrupt context.


3. Mutex

zxfoundation/sync/mutex.c — a sleeping mutex backed by a wait queue. Suitable for contexts where sleeping is permitted (not interrupt handlers).

FunctionDescription
mutex_lock(m)Acquire; sleep if contended
mutex_trylock(m)Non-blocking acquire; returns 0 on failure
mutex_unlock(m)Release; wake one waiter

4. Reader-Writer Lock

zxfoundation/sync/rwlock.c — allows multiple concurrent readers or one exclusive writer.

FunctionDescription
rwlock_read_lock(rw)Acquire shared read access
rwlock_read_unlock(rw)Release read access
rwlock_write_lock(rw)Acquire exclusive write access
rwlock_write_unlock(rw)Release write access

5. Semaphore

zxfoundation/sync/semaphore.c — counting semaphore.

FunctionDescription
sem_init(s, count)Initialize with initial count
sem_wait(s)Decrement; sleep if count is 0
sem_post(s)Increment; wake one waiter

6. Wait Queue

zxfoundation/sync/waitqueue.c — a list of sleeping tasks waiting for a condition.

FunctionDescription
waitqueue_init(wq)Initialize
waitqueue_wait(wq, condition)Sleep until condition is true
waitqueue_wake_one(wq)Wake the first waiter
waitqueue_wake_all(wq)Wake all waiters

7. RCU

zxfoundation/sync/rcu.c — Read-Copy-Update. Currently a stub; rcu_read_lock/rcu_read_unlock are no-ops and synchronize_rcu returns immediately.

RCU and SRCU

Document Revision: 26h1.1
Source: zxfoundation/sync/rcu.c, zxfoundation/sync/srcu.c


1. RCU

Read-Copy-Update for a non-preemptive kernel. A quiescent state (QS) occurs whenever a CPU is not inside an rcu_read_lock() section.

Read Side

FunctionDescription
rcu_read_lock()Enter read-side critical section (compiler barrier only)
rcu_read_unlock()Exit read-side critical section
rcu_dereference(p)Safely read an RCU-protected pointer
rcu_assign_pointer(p, v)Safely publish a new pointer

Write Side

FunctionDescription
call_rcu(head, fn)Register a callback for after the next grace period
synchronize_rcu()Block until all pre-existing readers have completed, then drain callbacks
rcu_report_qs()Report a quiescent state for the current CPU

Grace Period Mechanism

synchronize_rcu():
  1. Increment gp_seq
  2. Broadcast new gp_seq to all per-CPU rcu_gp_seq fields
  3. Spin until every CPU's rcu_qs_seq == gp_seq
  4. Drain callback list

rcu_report_qs() must be called from the idle loop and any long-running non-read-side context.


2. SRCU

Sleepable RCU — allows read-side critical sections to sleep. Each SRCU domain (srcu_struct_t) is independent.

Read Side

FunctionDescription
srcu_read_lock(s)Enter SRCU read section; returns slot index
srcu_read_unlock(s, idx)Exit SRCU read section

Write Side

FunctionDescription
synchronize_srcu(s)Wait for all pre-existing readers; may spin
call_srcu(s, head, fn)Synchronize then invoke callback

Two-Slot Mechanism

Active slot: s->idx (0 or 1)

srcu_read_lock:   increment pcpu[cpu].c[s->idx]
srcu_read_unlock: decrement pcpu[cpu].c[idx]

synchronize_srcu:
  1. Flip s->idx (new readers use new slot)
  2. Wait until sum of pcpu[*].c[old_idx] == 0
  3. Increment gp_seq

Initialization

DEFINE_SRCU(my_domain);          // static
srcu_init(&my_domain);           // runtime

Kernel Object Management System

Document: ZXF-KRN-KOMS-001
Revision: 1.0
Status: Released


1. Purpose

The Kernel Object Management System (KOMS) is the unified abstraction layer for all reference-counted kernel objects. It defines a single base type, kobject_t, that any subsystem may embed to obtain lifecycle management, naming, attribute storage, event delivery, and hierarchical organization at no additional per-subsystem cost.


2. Architectural Position

KOMS sits immediately above the memory allocator and synchronization primitives, and below all subsystems that manage named, reference-counted resources.

┌─────────────────────────────────────────────────────┐
│  Subsystems  (IRQ, VMM, Device, Task, File, …)      │
├─────────────────────────────────────────────────────┤
│  KOMS  (koms.h / koms.c)                            │
├──────────────┬──────────────┬───────────────────────┤
│  kmalloc /   │  spinlock /  │  RCU                  │
│  slab        │  rwlock      │                       │
└──────────────┴──────────────┴───────────────────────┘

KOMS is initialized once, after kmalloc_init(), before any subsystem that registers a type or allocates a managed object.


3. Core Concepts

3.1 kobject_t

Every managed object embeds kobject_t as its first member. The base object carries:

  • An atomic reference counter (kref_t).
  • A mandatory operations table (kobject_ops_t) with a release callback.
  • A lifecycle state (KOBJECT_UNINITIALIZED, KOBJECT_ALIVE, KOBJECT_DEAD).
  • A static name string.
  • A 32-bit type identifier.
  • A 32-bit flags word.
  • Intrusive list nodes for parent/child hierarchy, namespace membership, attributes, and event listeners.
  • An embedded spinlock_t protecting the mutable extension fields.
  • An rcu_head_t for deferred free.

The kobject_container() macro recovers the containing struct from a kobject_t * pointer using compile-time offset arithmetic.

3.2 Type Registry

A kobj_type_t descriptor is registered once at boot per object class. It carries:

FieldPurpose
type_idGlobally unique 32-bit identifier
nameHuman-readable string for diagnostics
obj_sizesizeof of the containing struct
cacheOptional dedicated slab cache
kobj_opsMandatory ops table (must provide release)
type_opsOptional extended vtable (init, destroy, ns_add, ns_remove)

After koms_init() the registry is append-only and read locklessly.

3.3 Namespace

A kobj_ns_t is an RCU-protected hash table of kobject_t pointers, keyed by name. Namespaces form a tree rooted at koms_root_ns.

koms_root_ns
├── "irq"
│   ├── "ext-0x40"
│   └── "pgm-0x0d"
├── "vmm"
│   └── "kernel"
└── "device"
    └── "dasd-0"

Reads use rcu_read_lock() and are fully lockless. Writes acquire the namespace's write_lock (spinlock, irqsave).

3.4 Attributes

Attributes are kobj_attr_t nodes linked into kobject_t::attrs. Each attribute has a name and optional get/set callbacks. The attribute list is protected by kobject_t::lock.

3.5 Event Bus

Events are typed (kobj_event_type_t) and carry a payload union. Listeners (kobj_listener_t) are registered per-object with an optional event-type bitmask filter. Dispatch snapshots the listener list under the object lock, then calls each listener without the lock, preventing deadlocks on re-entrant dispatch. Events propagate up the parent chain automatically.


4. Lifecycle

         koms_alloc()
              │
              ▼
        [refcount = 0]
              │
        koms_init_obj()
              │
              ▼
        KOBJECT_ALIVE  ◄──── koms_get()
        [refcount = 1]
              │
        koms_put() × N
              │
        [refcount = 0]
              │
              ▼
         KOBJECT_DEAD
              │
         ops->release()
              │
              ▼
          koms_free()

koms_freeze() sets KOBJ_FLAG_FROZEN, causing koms_get_unless_dead() to fail without affecting existing references. This enables controlled teardown: freeze the object, wait for all external references to drain, then drop the final reference.


5. Allocation Strategy

koms_alloc(type, gfp)
    │
    ├─ type->cache != nullptr ──► kmem_cache_alloc(type->cache, gfp | ZERO)
    │
    └─ type->cache == nullptr ──► kzalloc(type->obj_size, gfp)

koms_free() dispatches symmetrically. The KOBJ_FLAG_KOMS_ALLOC flag distinguishes heap-allocated objects from statically embedded ones.


6. Thread Safety Summary

OperationMechanism
Reference countLock-free (CS instruction)
Attribute listkobject_t::lock (spinlock, irqsave)
Listener listkobject_t::lock (spinlock, irqsave)
Child listkobject_t::lock (spinlock, irqsave)
Namespace readsrcu_read_lock() (lockless)
Namespace writeskobj_ns_t::write_lock (spinlock, irqsave)
Type registry readsLockless (append-only after boot)
Type registry writestype_registry_lock (spinlock, irqsave)

7. Integration Guide

To integrate a subsystem with KOMS:

  1. Embed kobject_t as the first member of the subsystem struct.
  2. Define a kobject_ops_t with a release callback that calls koms_free().
  3. Optionally define a kobj_type_ops_t for init/destroy hooks.
  4. Define and register a kobj_type_t from the subsystem's init function.
  5. Allocate objects with koms_alloc() and initialize with koms_init_obj().
  6. Use koms_get() / koms_put() for reference management.
  7. Optionally register in a namespace with koms_ns_add().

8. Initialization Order

KOMS must be initialized after kmalloc_init() and before any subsystem that calls koms_type_register() or koms_alloc().

pmm_init → cma_init → mmu_init → vmm_init → slab_init → kmalloc_init
    → koms_init → smp_init → [subsystem inits]

Red-Black Tree

Document Revision: 26h1.1
Source: lib/rbtree.c, include/lib/rbtree.h


1. Overview

ZXFoundation™ provides a layered intrusive red-black tree library. Each layer is a strict superset of the one below it; callers of lower layers require no modification when higher layers are added.

LayerTypeConcurrency
0 — Corerb_root_tNone (caller-managed)
1 — Augmentedrb_root_aug_tNone (caller-managed)
2 — RCU-protectedrcu_rb_root_tLockless readers, serialised writers
2A — RCU-augmentedrcu_rb_root_aug_tLockless readers, serialised writers + propagation
3 — Per-CPU cachedrb_pcpu_cache_tO(1) fast path per CPU

The tree is intrusive: the caller embeds rb_node_t (or rb_node_aug_t) inside its own struct and recovers the container with rb_entry(). The colour bit is packed into bit 0 of the parent pointer, keeping rb_node_t at exactly 24 bytes.


2. Node Layout

rb_node_t (24 bytes)
┌──────────────────────────┐
│ left             (8 B)   │  pointer to left child
│ right            (8 B)   │  pointer to right child
│ parent_and_color (8 B)   │  parent ptr | colour bit (bit 0)
└──────────────────────────┘

rb_node_aug_t (32 bytes)
┌──────────────────────────┐
│ node  (rb_node_t, 24 B)  │  must be at offset 0 — cast-compatible
│ subtree_max_gap  (8 B)   │  maintained by propagate callback
└──────────────────────────┘

All rb_node_t pointers are 8-byte aligned on s390x, so bit 0 of any valid pointer is always zero and is free for colour storage.


3. Layer 0 — Core

The core layer provides O(log n) insert, erase, and traversal with no locking. All operations are iterative (bounded stack depth).

Insert Protocol

walk tree → find (parent, link)
rb_link_node(node, parent, link)
rb_insert_fixup(tree, node)

Erase

rb_erase(tree, node)

Traversal

rb_first(tree)   →  minimum node
rb_last(tree)    →  maximum node
rb_next(node)    →  in-order successor
rb_prev(node)    →  in-order predecessor

rb_for_each(pos, tree)
rb_for_each_entry(pos, tree, member)

Container Recovery

rb_entry(ptr, type, member)
rb_entry_safe(ptr, type, member)   ← null-safe variant

4. Layer 1 — Augmented

The augmented layer adds a rb_aug_callbacks_t to rb_root_aug_t. After every structural change (insert, erase, rotation), propagate is invoked bottom-up from the affected node to the root.

Callers embed rb_node_aug_t instead of rb_node_t and maintain a per-node subtree aggregate in subtree_max_gap.

Callbacks

propagate(node)          recompute node->subtree_max_gap from children
copy(dst, src)           copy aggregate when successor replaces deleted node

copy is required when the two-child erase case physically moves the successor into the deleted node's position. Without it the successor would carry a stale aggregate into its new location.

Propagation Order

structural change at node L
        │
        ▼
propagate(L)          ← children already up-to-date
        │
        ▼
propagate(parent(L))
        │
        ▼
        …  (up to root)

API

rb_root_aug_t root = RB_ROOT_AUG_INIT(&my_callbacks);

rb_insert_aug(&root, node, parent, link);
rb_erase_aug(&root, node);

5. Layer 2 — RCU-Protected

rcu_rb_root_t wraps rb_root_t with a write-side spinlock. Readers use the RCU lockless path; writers serialise through the lock and publish pointer updates via rcu_assign_pointer().

Concurrency Model

Reader                          Writer
──────────────────────          ──────────────────────────────
rcu_read_lock()                 spin_lock_irqsave(&root->lock)
  node = rcu_rb_find(...)         rb_erase(...)
  // use node safely              rcu_assign_pointer(root, ...)
rcu_read_unlock()               spin_unlock_irqrestore(...)
                                call_rcu(head, free_fn)

rcu_assign_pointer() issues smp_mb() before the store. rcu_dereference() issues a compiler barrier after each pointer load, preventing the compiler from collapsing multiple loads of the same pointer.

Erase and Grace Period

rcu_rb_erase(root, node, head, free_fn)
    ├─ unlink node under lock
    ├─ rcu_assign_pointer(...)   ← publish updated tree
    └─ call_rcu(head, free_fn)   ← free after grace period

6. Layer 2A — RCU-Augmented

rcu_rb_root_aug_t composes Layer 1 and Layer 2 under a single write lock. The lock covers both rebalancing and aggregate propagation atomically.

Key invariant: readers always observe a tree where subtree_max_gap is consistent with the pointer structure they see, because both are updated under the same lock before rcu_assign_pointer() publishes the result.

rcu_rb_aug_find_gap() performs an O(log n) free-gap search by pruning subtrees whose subtree_max_gap is smaller than the requested size:

find_gap(root, size, align, lo, hi):

  cursor = lo
  n = root

  while n:
    if n.left.subtree_max_gap >= size:
      descend left            ← prune right subtree entirely
      continue

    aligned = align_up(cursor, align)
    if aligned + size <= n.start:
      return aligned          ← gap found left of n

    cursor = max(cursor, n.end)
    n = n.right               ← no gap left of n; try right

  aligned = align_up(cursor, align)
  if aligned + size <= hi:
    return aligned            ← gap after last node

  return 0                    ← no gap found

This replaces the former O(n) linear scan. The caller supplies node_start and node_end accessors, making the search generic over any interval type.

API

rcu_rb_root_aug_t root = RCU_RB_ROOT_AUG_INIT(&my_callbacks);

rcu_rb_aug_insert(&root, node, parent, link);
rcu_rb_aug_erase(&root, node, head, free_fn);

// Under lock or rcu_read_lock():
uint64_t addr = rcu_rb_aug_find_gap(&root, size, align, lo, hi,
                                    node_start_fn, node_end_fn);

7. Layer 3 — Per-CPU Cached

rb_pcpu_cache_t is a per-CPU array of (hint, hint_key) pairs. On a cache hit the search returns in O(1) without touching the tree.

rb_find_cached(root, cache, cmp, arg):

  cpu  = current_cpu()
  hint = cache[cpu].hint

  if hint != NULL && cmp(hint, arg) == 0:
    return hint               ← O(1) fast path

  // full O(log n) walk
  result = tree_walk(root, cmp, arg)
  cache[cpu].hint = result
  return result

The hint is opportunistic — it may be stale. The comparator validates it before the result is returned.

Invalidation

rb_cache_invalidate(cache, node)        O(MAX_CPUS) — call before erase
rb_cache_invalidate_local(cache)        O(1)        — current CPU only

rb_cache_invalidate() must be called before rb_erase() or rcu_rb_aug_erase() on any node in a cached tree to prevent dangling hint pointers.


8. RB-Tree Invariants

The implementation maintains the four standard invariants after every operation:

  1. Every node is RED or BLACK.
  2. The root is BLACK.
  3. Every RED node has two BLACK children.
  4. Every path from a node to a null leaf contains the same number of BLACK nodes.

Insert fixup resolves double-red violations with at most 2 rotations and O(log n) recolourings. Erase fixup resolves double-black violations with at most 3 rotations and O(log n) recolourings. Recolourings do not change pointer structure and are invisible to RCU readers.


9. Constraints

  • rb_node_aug_t::node must be at offset 0. The _Static_assert in the header enforces this.
  • rb_aug_callbacks_t::copy may be nullptr only if the caller guarantees no two-child erase will occur. For general use it must be provided.
  • rb_cache_invalidate() must be called before erasing a node from any cached tree.
  • rcu_rb_aug_find_gap() may be called under rcu_read_lock() for a best-effort result, or under the write lock for a guaranteed-current result.
  • synchronize_rcu() may block indefinitely if a CPU never reports a quiescent state. Callers of rcu_rb_aug_erase() must ensure rcu_report_qs() is called from the idle loop and scheduler tick.

Time Subsystem

Document: ZXF-KRN-TIME-001 Revision: 26h1.0 Status: Draft


1. Overview

The time subsystem provides three services to the rest of the kernel:

  1. Monotonic kernel time (ktime_t) — nanoseconds since boot, readable from any context.
  2. Scheduler preemption — CPU timer fires EXT 0x1004 every 10 ms to enforce quanta.
  3. Deferred execution — clock comparator fires EXT 0x1005 to advance the per-CPU timer wheel.

All hardware access (STCKF, SPTC, STPTC, SCKC, STCKC, CR0 manipulation) is confined to arch/s390x/time/tod.c. The portable kernel layer in zxfoundation/time/ calls only the functions declared in include/arch/s390x/time/tod.h.


2. Hardware Sources

z/Architecture provides three per-CPU time mechanisms:

SourceInstructionTypeResolutionKernel use
TOD clockSTCKFGlobal, monotonic~0.244 nsktime_get(), sleep deadline
CPU timerSPTC / STPTCPer-CPU countdownSame as TODScheduler quantum (10 ms)
Clock comparatorSCKC / STCKCPer-CPU absoluteSame as TODTimer wheel advance

The TOD clock is shared across all CPUs and is monotonic. STCKF reads it without serialization and is safe from hard-IRQ context.


3. TOD Unit Conversion

1 TOD unit = 1000/4096 ns = 125/512 ns

ktime_ns = tod_delta × 125 / 512
tod_units = ns × 512 / 125

Constants used throughout the subsystem:

TOD_1MS  = 4 096 000 units
TOD_10MS = 40 960 000 units
TOD_1S   = 4 096 000 000 units

4. Initialization Sequence

BSP:
  time_init()
    tod_set_boot_offset(STCKF)   ← recorded once; never modified
    timer_wheel_init()           ← per-CPU wheel, level/slot arrays zeroed
    tod_enable_ext_interrupts()  ← CR0 bits 52+53 set
    tod_cpu_timer_set(-10ms)     ← first quantum armed
    tod_clock_comparator_set(now + 1s)  ← safe initial value

Each AP (from ap_startup):
  time_init_ap()
    timer_wheel_init()
    tod_enable_ext_interrupts()
    tod_cpu_timer_set(-10ms)
    tod_clock_comparator_set(now + 1s)

tod_boot_offset is set on the BSP before any AP is started. APs call ktime_get() using the same offset — this is correct because the TOD clock is global.


5. Interrupt Dispatch

The EXT interrupt handler (do_ext_interrupt) intercepts the two time-critical subclasses before the generic irq_dispatch() path:

do_ext_interrupt:
  ext_code = lowcore.ext_int_code
  if ext_code == 0x1004 → time_cpu_timer_handler()   // CPU timer
  if ext_code == 0x1005 → time_clock_comparator_handler()  // clock comparator
  else → irq_dispatch(ZX_IRQ_BASE_EXT + ext_code, frame)

This avoids routing through the irqdesc table, whose 0x0400-entry limit cannot accommodate the full 16-bit EXT subclass space.


6. Timer Wheel

6.1 Structure

8 levels × 64 slots per CPU. Level 0 has 1 ms slot width; each subsequent level is 64× wider.

Level 0: slot = 1 ms,   range = 64 ms
Level 1: slot = 64 ms,  range = ~4 s
Level 2: slot = ~4 s,   range = ~4 min
Level 3: slot = ~4 min, range = ~4.5 h
...
Level 7: slot = ~2 y,   range = ~140 y

6.2 Placement

A timer with expiry delta d from now is placed in the lowest level l such that d < range(l), at slot (current_slot[l] + d/slot_width[l] + 1) % 64.

6.3 Advance

On EXT 0x1005, timer_wheel_advance(now) steps level-0 slot by slot, firing all expired timers. When level 0 completes a full revolution, it cascades timers from level 1 into lower levels, and so on.

6.4 Constraints

  • All wheel operations require IRQs disabled on the calling CPU.
  • Callbacks execute in hard-IRQ context. They must not block or acquire locks held by process context.

7. ktime_sleep()

Current implementation is a busy-wait:

deadline = STCKF + ns_to_tod(ns)
SCKC(deadline)
while STCKF < deadline: cpu_relax()

This is correct for early boot and short delays. Once the scheduler is operational, this will be replaced with a block/wake implementation using the timer wheel.


8. Strict Requirements

#Requirement
TIME-1ktime_get() is callable from any context. No lock, no sleep.
TIME-2Timer callbacks execute in hard-IRQ context. No blocking, no process-context locks.
TIME-3CPU timer must be reloaded on every time_cpu_timer_handler() invocation.
TIME-4Clock comparator must be reprogrammed after every timer_wheel_advance() call.
TIME-5tod_boot_offset is set once in time_init() and never modified.
TIME-6time_init_ap() must be called on every AP before the AP enters its idle loop.

Scheduler

Subsystem Stubs

Document Revision: 26h1.1


The following subsystems have source directories and header files but are not yet implemented.


IRQ (arch/s390x/irq/)

Handles I/O interrupts from the channel subsystem. The I/O new PSW at lowcore 0x1E0 must point to the I/O interrupt handler. The handler calls TSCH to read the IRB and dispatches to the appropriate device driver.

Status: Stub — new PSW installed as disabled-wait.


Time (arch/s390x/time/)

Provides kernel timekeeping using the TOD (Time-of-Day) clock. The TOD clock is a 64-bit counter incremented at 4096 Hz. The boot timestamp is available in proto->tod_boot. The clock comparator interrupt (external interrupt subclass) drives the scheduler tick once the IRQ subsystem is active.

Status: Stub.

Build System Overview

Document Revision: 26h1.0


1. Prerequisites

ToolMinimum versionNotesRequired
CMake3.10Build system generatortrue
Compiler and toolstoolchain-specificSee toolchains.mdpartly
NinjaanyRecommended generatoroptional
dasdloadanyNeeded for image generation (optional)optional
Hercules4.xHelpful for developmentoptional

2. Output Artifacts

ArtifactDescriptionConverted from
core.zxfoundationloader00.sysStage 0 IPL record (tape format)zxfl_stage1.elfzxfl_stage1.bin
core.zxfoundationloader01.sysStage 1 flat binaryzxfl_stage2.elf
core.zxfoundation.nucleusKernel ELF64 (SHA-256 checksums patched in)N/A
sysres.3390Hercules 3390 DASD imageN/A
bin2recHost toolN/A
zxsignHost toolN/A

3. CMake Modules

ModulePurpose
cmake/dependencies.cmakeHost dependency checks
cmake/configuration.cmakeOPT_LEVEL, DSYM_LEVEL cache variables
cmake/platform.cmakePlatform detection
cmake/standard.cmakeC standard enforcement
cmake/hosttools.cmakeBuild bin2rec and zxsign with host compiler
cmake/source.cmakeKernel source file lists (ZX_SOURCES_64)
cmake/zxfl-compile.cmakeZXFL Stage 0 and Stage 1 targets
cmake/zxfoundation-compile.cmakeKernel nucleus target
cmake/run.cmakedasd target — generates sysres.3390

4. Build Order

CMake enforces the following dependency chain:

tools  (bin2rec, zxsign — host compiler)
  │
  ├─► zxfl_stage1.elf
  │     └─► zxfl_stage1.bin  (objcopy)
  │           └─► core.zxfoundationloader00.sys  (bin2rec)
  │
  ├─► zxfl_stage2.elf
  │     └─► core.zxfoundationloader01.sys  (objcopy)
  │
  └─► core.zxfoundation.nucleus
        └─► zxsign patches .zxvl_checksums in-place
              └─► sysres.3390  (dasdload)

Host tools are always compiled first with ZX_HOST_CC. The kernel and loader are compiled with the cross-compiler.


5. Configuration Variables (non-toolchain-specific, for toolchain-specific, see toolchains.md)

VariableDefaultDescription
OPT_LEVEL2-O level for all targets
DSYM_LEVEL0-g level (0 = no debug info)

Override at configure time:

cmake -B build \
  -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake \
  -DOPT_LEVEL=3

Toolchains

Document Revision: 26h1.0


1. Clang (cmake/toolchain/zxfoundation-clang.cmake)

Uses LLVM's built-in cross-compilation support — no separate cross-compiler installation is required on most systems.

cmake -B build \
  -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake \
  -DMARCH_MODE=z14
RoleTool
C compilerclang (or clang-$CLANG_VERSION)
Linkerld.lld
Archiverllvm-ar
objcopyllvm-objcopy
Host CCclang

Set CLANG_VERSION in the environment to select a versioned binary (e.g. CLANG_VERSION=18clang-18). If unset, unversioned clang is used.

The target triple --target=s390x-unknown-none-elf is passed as a compile option (not via CMAKE_C_COMPILER_TARGET) to avoid CMake's compiler detection interfering with the freestanding build.


2. GCC (cmake/toolchain/zxfoundation-gcc.cmake)

Requires a s390x-ibm-linux-gnu-* cross-compiler toolchain installed on the host.

cmake -B build \
  -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-gcc.cmake
RoleTool
C compilers390x-ibm-linux-gnu-gcc
Linkers390x-ibm-linux-gnu-ld
Archivers390x-ibm-linux-gnu-ar
objcopys390x-ibm-linux-gnu-objcopy
Host CCgcc

GCC-specific flags added to the kernel target:

FlagReason
-static-libgccAvoid libgcc DSO dependency
-Wno-array-boundsSuppress false positives from GCC's array-bounds analysis on lowcore pointer casts
-fno-delete-null-pointer-checksThe kernel legitimately dereferences physical address 0x0 (the lowcore)
-mzarchForce z/Architecture mode

3. Common Compiler Flags

Applied to all targets (loader and kernel):

FlagReason
-ffreestandingNo hosted C library assumptions
-nostdlibNo implicit library linking
-fno-builtinPrevent compiler from substituting builtins with libc calls
-fno-strict-aliasingKernel code casts between unrelated pointer types
-fwrapvSigned integer overflow wraps (defined behavior)
-ftrivial-auto-var-init=patternAuto-initialize locals to a poison pattern — catches use-before-init
-fno-stack-protectorNo __stack_chk_guard — freestanding, no libc
-msoft-floatNo FPU use in kernel
-mno-vxNo vector instructions in kernel

Kernel-only additional flag:

FlagReason
-mpacked-stackUse packed register save areas (reduces stack frame size)

4. Custom Toolchain

To use a non-standard toolchain, copy one of the provided toolchain files and adjust the compiler/linker paths. The following CMake variables must be set:

VariableDescription
CMAKE_C_COMPILERPath to the C compiler
CMAKE_LINKERPath to the linker
CMAKE_OBJCOPYPath to objcopy
ZX_HOST_CCHost C compiler for building bin2rec and zxsign
COMPILER_ID"clang" or "gcc" (selects compiler-specific flag sets)
TARGET_EMULATION_MODEelf64_s390
MARCH_MODETarget microarchitecture (e.g. z10, z14, z16)

Build Targets

Document Revision: 26h1.0


tools

Builds host-native bin2rec and zxsign using ZX_HOST_CC. This target is an implicit dependency of all other targets — it always runs first.


zxfl_stage1.elfcore.zxfoundationloader00.sys

Compiles Stage 0. Post-build steps:

  1. objcopy -O binary zxfl_stage1.elf zxfl_stage1.bin — strip ELF headers to raw binary.
  2. bin2rec zxfl_stage1.bin core.zxfoundationloader00.sys — wrap in DASD IPL record format.

The linker script stage1.ld enforces a 12 KB size limit with ASSERT. The build fails if this limit is exceeded.


zxfl_stage2.elfcore.zxfoundationloader01.sys

Compiles Stage 1. Post-build step:

  1. objcopy -O binary zxfl_stage2.elf core.zxfoundationloader01.sys — flat binary at 0x20000.

core.zxfoundation.nucleus

Compiles the kernel. Post-build step:

  1. zxsign core.zxfoundation.nucleus — computes SHA-256 for each PT_LOAD segment and patches the digests into the .zxvl_checksums ELF section in-place.

The kernel linker script is arch/s390x/init/link.ld.


dasdsysres.3390

Requires dasdload (from the Hercules package) on PATH.

  1. Remove any existing sysres.3390.
  2. Copy scripts/etc.zxfoundation.parm to the build directory.
  3. Run dasdload -z scripts/sysres.conf sysres.3390 — create a 3390 (compressed) DASD image and write all datasets.
  4. Copy scripts/hercules.cnf to the build directory.

sysres.conf defines the dataset layout: Stage 0, Stage 1, nucleus, and parmfile.


Running

cmake --build build # this build everything including DASD image
hercules -f build/hercules.cnf

In the Hercules console:

ipl 0100

bin2rec

Document Revision: 26h1.0
Source: tools/bin2rec.c


1. Purpose

bin2rec converts a flat binary into the DASD IPL record format required by the Hercules dasdload utility and the z/Architecture channel subsystem.

bin2rec <input.bin> <output.sys>

2. Background

The z/Architecture IPL mechanism reads the first physical record from the IPL device and loads it into memory at address 0x0. The record must be in a specific format: each 80-byte card image contains a header identifying it as a text record (TXT) or end record (END), a load address, a byte count, and 56 bytes of data.

This format originates from the IBM card-punch era — the DASD IPL record format is a direct descendant of the punched-card object deck format.


3. Record Format

Each 80-byte record:

BytesContent
00x02 (record type marker)
1–3TXT in EBCDIC (0xE3 0xE7 0xE3) or END (0xC5 0xD5 0xC4)
40x00
5–7Load address (24-bit, big-endian)
8–90x00 0x00
10–11Byte count (0x0038 = 56, big-endian)
12–150x00 0x00 0x00 0x00
16–7156 bytes of binary data
72–790x00 × 8

The tool reads 56 bytes at a time from the input binary, wraps each chunk in a TXT record, and writes an END record at the end.


4. Limitations

  • Maximum input size: 32 KB (MAX_REC_SIZE = 32768). This effectively caps stage 1 size at 32 KB.
  • Load address is 24-bit — intentional. The IPL PSW is a 31-bit ESA/390 PSW; the channel subsystem loads the record into the low 16 MB.

zxsign

Document Revision: 26h1.0
Source: tools/zxsign.c


1. Purpose

zxsign is a post-build host tool that computes SHA-256 digests for each PT_LOAD segment of the kernel ELF and patches them into the .zxvl_checksums section in-place.

zxsign <core.zxfoundation.nucleus>

The file is modified in place. It must be a valid ELF64 file with a .zxvl_checksums section.


2. Operation

  1. Read and validate the ELF64 header (magic, EI_CLASS = ELFCLASS64).
  2. Locate .zxvl_checksums by walking the section header table and the section name string table.
  3. Collect all PT_LOAD program headers. Skip segments with p_filesz = 0 and the segment containing .zxvl_checksums itself (hashing the table while building it would be circular).
  4. For each remaining PT_LOAD segment, read p_filesz bytes from p_offset and compute SHA-256.
  5. Build a zxvl_checksum_table_t with magic 0x5A58564C, version 1, algorithm ZXVL_CKSUM_ALGO_SHA256, and one entry per segment. Physical addresses are computed by stripping CONFIG_KERNEL_VIRT_OFFSET from p_paddr.
  6. Seek to the file offset of .zxvl_checksums and write the complete table in one fwrite.

3. Checksum Table Layout

zxvl_checksum_table_t (packed):
    uint32_t  magic;       // 0x5A58564C
    uint32_t  version;     // 0x00000001
    uint32_t  algo;        // 0x00000001 (SHA-256)
    uint32_t  count;       // number of entries
    entries[16]:
    uint64_t  phys_start  // physical address of segment
    uint64_t  size        // p_filesz
    uint8_t   digest[32]  // SHA-256

The table is located at load_min + ZXVL_CKSUM_TABLE_OFFSET (0x80000) in the loaded kernel. The bootloader reads it from physical memory after loading all ELF segments.


4. Kernel Requirements

The kernel must define a .zxvl_checksums section anchored at the correct virtual address:

__attribute__((section(".zxvl_checksums")))
static volatile zxvl_checksum_table_t zxvl_cksum_table = { 0 };

The linker script must place .zxvl_checksums at HHDM_BASE + 0x80000