ZXFoundation™ Development Guide

Document Revision: 26h1.0
Applies to: ZXFoundation™ release 26h1 and later
Status: Active development

About This Document

This guide is the primary technical reference for the ZXFoundation™ kernel and its associated toolchain. It is written for:

OS developers who wish to understand the z/Architecture boot and execution environment.
Kernel contributors who need a precise description of subsystem contracts and initialization order.
Integrators who want to load their own kernel or module using the ZXFL bootloader.

Familiarity with C23, ELF64, and general operating-system concepts is assumed. Background on IBM z/Architecture is provided in the Architecture chapter.

What Is ZXFoundation™?

ZXFoundation™ is a freestanding, SMP-capable kernel for IBM z/Architecture (s390x) mainframes and emulators. It is written in C23 and targets the s390x-unknown-none-elf ABI.

The project comprises three independently versioned components:

Component	Output artifact	Description
ZXFL	`core.zxfoundationloader00.sys`, `core.zxfoundationloader01.sys`	Two-stage bootloader
Nucleus	`core.zxfoundation.nucleus`	Kernel ELF64 image
Host tools	`bin2rec`, `zxsign`	Build-time utilities

All three are built from a single CMake project using a cross-compiler toolchain targeting s390x.

Version Scheme

Releases follow the scheme YYhN, where YY is the two-digit year and N is the half-year index (1 = first half, 2 = second half). The current release is 26h1.

The boot protocol carries its own version field (ZXFL_VERSION_*). A kernel must check this field and refuse to boot if the version is not one it understands.

Document Organization

Chapter	Contents
Architecture	z/Architecture fundamentals: PSW, DAT, CCW, IPL, paging
Bootloader	ZXFL design, stage descriptions, boot protocol
Kernel	Subsystem table, initialization sequence, memory management
Build System	CMake modules, toolchains, configuration variables
Host Tools	`bin2rec` and `zxsign` reference

Quick Start

# Configure with the Clang toolchain (recommended)
cmake -B build -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake

# Build everything
cmake --build build

# Generate the DASD image and launch Hercules
cmake --build build --target dasd
hercules -f build/hercules.cnf

In the Hercules console, issue ipl 0100 to start the boot sequence.

See Build System for full configuration options and Build Targets for a description of each output artifact.

Architecture Overview

Document Revision: 26h1.0
Reference: IBM z/Architecture Principles of Operation, SA22-7832

z/Architecture (s390x) is IBM's 64-bit mainframe instruction set, introduced with the z900 in 2000. It supersedes ESA/390 (31-bit) and System/370 (24-bit). ZXFoundation™ targets z/Architecture exclusively; ESA/390 compatibility mode is used only during the first instruction of the IPL sequence.

Key properties that distinguish z/Architecture from commodity architectures:

All I/O is performed through the Channel Subsystem (CSS). There is no memory-mapped I/O.
The Program Status Word (PSW) encodes the instruction address, addressing mode, DAT enable, and all interrupt masks in a single 128-bit register.
The Lowcore at physical address 0x0 is the hardware-defined interrupt vector table with a fixed layout.
Inter-processor communication uses the SIGP instruction rather than memory-mapped registers or MSIs.
The STFLE instruction enumerates optional hardware facilities (analogous to CPUID on x86).

2. Program Status Word (PSW)

The PSW is 128 bits wide. It is loaded atomically by LPSWE and saved atomically on every interrupt.

Bits  0–63:  Mask word
  Bit  1:    PER enable
  Bit  5:    DAT enable
  Bit  6:    I/O interrupt mask
  Bit  7:    External interrupt mask
  Bit  8:    Problem state (0=supervisor, 1=user)
  Bits 12–15: Condition Code
  Bit  31:   EA (Extended Addressing) — must be 1 for 64-bit
  Bit  32:   BA (Basic Addressing)    — must be 0 for 64-bit

Bits 64–127: Instruction address (64-bit)

EA=1, BA=0 selects 64-bit addressing mode. SAM64 sets this without altering other PSW fields.

Disabled-wait PSW: All interrupt masks cleared, wait bit set. The CPU halts permanently. Used as the panic state.

New PSWs: For each interrupt class (I/O, external, machine check, program, restart, SVC), the architecture reserves a fixed lowcore offset for a "new PSW" — the PSW loaded when that interrupt fires. The kernel must install valid new PSWs before enabling the corresponding interrupt class.

3. Lowcore (Prefix Area)

The lowcore is the 4 KB region at physical address 0x0. Its layout is fixed by the architecture.

Offset	Content
`0x000`	IPL PSW
`0x008`	IPL CCW1
`0x010`	IPL CCW2
`0x068`	Restart new PSW
`0x0B8`	Subchannel ID of IPL device
`0x1C0`	External new PSW
`0x1C8`	SVC new PSW
`0x1D0`	Program new PSW
`0x1D8`	Machine check new PSW
`0x1E0`	I/O new PSW

The prefix register (set by SPX, read by STPX) maps a per-CPU physical page to the logical lowcore address 0x0. Each CPU has its own private lowcore page; the BSP uses physical page 0, APs use separately allocated pages.

4. Channel Command Words (CCW) and I/O

All device I/O is performed through the Channel Subsystem. The CPU constructs a Channel Program — a linked list of CCWs — and submits it via SSCH (Start Subchannel).

CCW Format-1 (8 bytes)

Bits  0–7:   Command code  (0x02=Read, 0x01=Write, 0x08=Sense)
Bits 32–63:  Channel Data Address (CDA) — physical address of data buffer
Bit  65:     Chain Command (CC) — link to next CCW
Bits 80–95:  Byte count

Critical constraint: The CDA field is 31 bits. All I/O data buffers must reside below physical address 0x80000000. This is why ZONE_DMA covers [0, 16 MB).

I/O Sequence

CPU                        Channel Subsystem
 │                              │
 ├─ SSCH (schid, ORB) ────────► │  Submit channel program
 │                              ├─ Execute CCW chain, transfer data
 │◄──────── I/O interrupt ──────┤  Subchannel status available
 ├─ TSCH (schid, IRB) ────────► │  Read Interrupt Response Block
 │◄──────── IRB ────────────────┤  Device status, residual count

5. Initial Program Load (IPL)

When the operator issues a LOAD command, the channel subsystem performs the following automatically:

Reads the first physical record from the IPL device (ECKD: C=0, H=0, R=1) into physical address 0x0.
The record contains an IPL PSW at 0x0 and two CCWs at 0x8/0x10.
The CSS executes the CCW chain to load additional data.
The CPU loads the IPL PSW and begins execution.

For ZXFL, the IPL PSW is a 31-bit ESA/390 PSW pointing to the Stage 0 entry. The first instruction switches to z/Architecture mode via SIGP SET ARCHITECTURE.

6. Dynamic Address Translation (DAT)

DAT is enabled by PSW bit 5. When on, every memory access is translated through the page table hierarchy rooted at the ASCE in CR1.

Address Space Control Element (ASCE)

The ASCE is a 64-bit value in CR1 encoding the physical address of the root table, the Designation Type (DT), and the Table Length (TL). ZXFoundation™ uses DT=11 (Region-First), selecting 5-level paging.

5-Level Page Table Hierarchy

Level	Name	Entries	Coverage per entry
ASCE →	R1 (Region-First)	2048	8 PB
R1 →	R2 (Region-Second)	2048	4 TB
R2 →	R3 (Region-Third)	2048	2 GB
R3 →	Segment Table	2048	1 MB
Seg →	Page Table	256	4 KB

Each R1–Segment table is 16 KB (2048 × 8 bytes). Each page table is 4 KB (256 × 8 bytes).

Virtual Address Decomposition (DT=11)

 63      53 52      42 41      31 30      20 19    12 11       0
 ┌────────┬──────────┬──────────┬──────────┬────────┬──────────┐
 │  RFX   │   RSX    │   RTX    │    SX    │   PX   │    BX    │
 │ 11 bit │  11 bit  │  11 bit  │  11 bit  │  8 bit │  12 bit  │
 └────────┴──────────┴──────────┴──────────┴────────┴──────────┘
   R1 idx   R2 idx    R3 idx    Seg idx    PT idx   Byte offset

Large Pages (EDAT)

Facility	STFLE bit	Page size	Mechanism
EDAT-1	8	1 MB	FC=1 in Segment Table Entry
EDAT-2	78	2 GB	FC=1 in Region-Third Entry

7. Virtual Address Space Layout

0x0000000000000000  User space (future)
        ...
0x00007FFFFFFFFFFF  User space top

        [ unmapped — translation exception ]

0xFFFF800000000000  HHDM base (CONFIG_KERNEL_VIRT_OFFSET)
                    Physical memory linearly mapped here.
                    PA 0x0 → VA 0xFFFF800000000000

0xFFFFC00000000000  vmalloc / ioremap region

0xFFFFFFFFFFFFFFFF  Top of address space

The HHDM offset is 0xFFFF800000000000. The bootloader builds this mapping before transferring control; all kernel pointers in the boot protocol are HHDM virtual addresses.

8. Physical Memory Zones

Zone	Range	Purpose
`ZONE_DMA`	`[0, 16 MB)`	Channel I/O buffers (31-bit CDA constraint)
`ZONE_NORMAL`	`[16 MB, RAM limit)`	General kernel allocations

9. Control Registers

Register	Purpose
CR0	I/O/external interrupt subclass masks, feature enables
CR1	Primary ASCE (page table root)
CR6	I/O interrupt subclass mask (extended)
CR14	Machine check interrupt mask

The bootloader saves CR0, CR1, and CR14 snapshots in the boot protocol so the kernel can inspect the handover state.

Bootloader Overview

Document Revision: 26h1.0

1. What Is ZXFL?

ZXFL (ZXFoundation™ Loader) is the two-stage bootloader for ZXFoundation™. It is the only supported mechanism for loading the kernel nucleus. Its responsibilities are:

Transition the CPU from ESA/390 to z/Architecture 64-bit mode.
Locate and load the kernel ELF64 image from DASD.
Verify kernel integrity (ZXVL structural lock, handshake, SHA-256 checksums).
Probe hardware: memory, CPUs, TOD clock, system identification.
Build the 5-level page tables (identity map + HHDM).
Populate the boot protocol structure.
Transfer control to the kernel entry point with DAT enabled.

2. Two-Stage Design

The split is imposed by a hard architectural constraint: the IPL mechanism loads exactly one record from the IPL device into physical address 0x0 and executes it. That record must contain the IPL PSW and enough code to load a larger second stage.

Stage	Internal name	Dataset	Load address	Size limit
0	`zxfl_stage1`	`CORE.ZXFOUNDATIONLOADER00.SYS`	`0x0`	12 KB
1	`zxfl_stage2`	`CORE.ZXFOUNDATIONLOADER01.SYS`	`0x20000`	~512 KB

Stage 0 is a minimal DASD reader. Its only job is to find Stage 1 in the VTOC, load it to 0x20000, and jump to it.

Stage 1 is the full loader. It performs all hardware detection, ELF loading, integrity verification, page table construction, and the final jump to the kernel.

3. IPL Flow

Power-on / LOAD button
  │
  ▼
Channel subsystem reads IPL record (C=0, H=0, R=1) → 0x0
  │
  ▼
Stage 0  (arch/s390x/init/zxfl/stage1/)
  ├─ SIGP SET ARCHITECTURE → z/Architecture mode
  ├─ SAM64 → 64-bit addressing
  ├─ Clear BSS
  ├─ Find CORE.ZXFOUNDATIONLOADER01.SYS in VTOC
  ├─ Read it to 0x20000
  └─ Jump to 0x20000
       │
       ▼
Stage 1  (arch/s390x/init/zxfl/stage2/)
  ├─ Install disabled-wait new PSWs (lowcore)
  ├─ Clear BSS (MVCL)
  ├─ STFLE — detect facilities
  ├─ Probe IPL device (ECKD / FBA Sense ID + RDC)
  ├─ Read parmfile (ETC.ZXFOUNDATION.PARM)
  ├─ Find CORE.ZXFOUNDATION.NUCLEUS in VTOC
  ├─ Load ELF64 PT_LOAD segments to physical memory
  ├─ ZXVL: structural lock + handshake + SHA-256 checksums
  ├─ Probe memory (write-pattern test)
  ├─ Load sysmodule= modules
  ├─ Detect SMP (SIGP Sense), STSI, TOD (STCK)
  ├─ Build 5-level page tables (identity + HHDM)
  ├─ Translate all protocol pointers to HHDM virtual
  └─ LPSWE → kernel entry point (DAT on, interrupts masked)

4. Dataset Names

All datasets reside on the IPL DASD volume. Names follow the IBM MVS convention (uppercase, dot-separated, max 44 characters).

Dataset	Contents
`CORE.ZXFOUNDATIONLOADER00.SYS`	Stage 0 IPL record
`CORE.ZXFOUNDATIONLOADER01.SYS`	Stage 1 flat binary
`CORE.ZXFOUNDATION.NUCLEUS`	Kernel ELF64
`ETC.ZXFOUNDATION.PARM`	Boot parameters (parmfile)

Additional datasets may be listed in the parmfile via sysmodule= entries.

5. Parmfile

The parmfile ETC.ZXFOUNDATION.PARM is a plain-text file read by Stage 1. Supported keys:

Key	Description	Default
`syssize=`	Memory probe limit in MB	512
`sysmodule=`	Dataset name of an additional module to load	(none)

Multiple sysmodule= lines are permitted (up to 16).

6. Constraints

All CCW channel data addresses must be 31-bit (< 0x80000000). Static BSS buffers satisfy this automatically.
Stage 0 must fit within 12 KB (enforced by ASSERT in stage1.ld).
The Stage 1 stack is 32 KB. The kernel must switch to its own stack before consuming more than ~8 KB.
The kernel entry point must be ≥ 0xFFFF800000040000 (HHDM + 256 KB). The loader enforces this.

Stage 0

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/stage1/

1. Purpose

Stage 0 is the minimal IPL loader. It occupies the first record on the IPL DASD volume and is loaded by the channel subsystem into physical address 0x0. Its sole responsibility is to locate Stage 1 (CORE.ZXFOUNDATIONLOADER01.SYS) in the VTOC, read it to 0x20000, and jump to it.

2. Entry Point (`head.S`)

The channel subsystem loads the IPL record and executes the PSW at offset 0x0. This PSW is a 31-bit ESA/390 PSW pointing to stage1_entry.

The entry sequence:

stage1_entry:
  1. SIGP SET ARCHITECTURE (order 0x12) → switch to z/Architecture
     Retry with "restore PSWs" flag if first attempt fails.
  2. SAM64 → enable 64-bit addressing mode
  3. Clear BSS (byte loop — MVCL is unsafe before architecture switch)
  4. Set stack pointer to stage1_stack_top − 160
  5. Load schid from lowcore offset 0xB8
  6. Call zxfl00_entry(schid)
  7. Disabled-wait PSW (fallback — zxfl00_entry is [[noreturn]])

The 160-byte stack offset is the standard z/Architecture register save area size.

3. Main Function (`entry.c` — `zxfl00_entry`)

Execution order:

diag_setup() — flush any partial DIAG 8 output line.
Print the Stage 0 banner via DIAG 8.
dasd_find_dataset(schid, "CORE.ZXFOUNDATIONLOADER01.SYS", &ext) — locate Stage 1 in the VTOC.
Read the dataset track-by-track into 0x20000 using dasd_read_next.
Sanity-check: verify the loaded image is not a disabled-wait PSW.
Jump to 0x20000 with schid in %r2.

4. Linker Script (`stage1.ld`)

Section	Address	Notes
`.text.ipl`	`0x0`	IPL PSW (8 bytes)
`.text`	`0x58`	Code (after lowcore reserved area)
`.bss`	after `.text`	Zero-initialized data

An ASSERT in the linker script enforces that the entire stage fits within 12 KB. The build will fail if this limit is exceeded.

5. Stack

An 8 KB static array in BSS. The stack pointer is initialized to stage1_stack_top − 160.

6. Shared Library (`common/`)

Stage 0 uses a subset of the shared common/ library:

Module	Purpose
`dasd_io.c`	Low-level CCW I/O (SSCH/TSCH)
`dasd_vtoc.c`	VTOC traversal and dataset lookup
`diag.c`	DIAG 8 console output
`ebcdic.c`	EBCDIC ↔ ASCII conversion
`panic.c`	Disabled-wait on fatal error
`string.c`	Minimal `memcpy`, `memset`, `strcmp`

Stage 1

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/stage2/

1. Purpose

Stage 1 is the full production loader. It is a flat binary linked at 0x20000, loaded there by Stage 0. It performs all hardware detection, kernel loading, integrity verification, page table construction, and the final transfer of control to the kernel.

2. Entry Point (`entry.S` — `stage2_entry`)

stage2_entry:
  1. Save schid from %r2 into a callee-saved register (%r13)
  2. Call zxfl_lowcore_setup() — install disabled-wait new PSWs
  3. SSM 0x00 — mask all interrupts off
  4. Clear BSS with MVCL (pad-fill mode, source length = 0)
  5. Set stack pointer to stage2_stack_top − 160
  6. Restore schid into %r2
  7. Call zxfl01_entry(schid)

SSM 0x00 is issued immediately after zxfl_lowcore_setup installs safe new PSWs. Any interrupt that fires during the loader will hit a known disabled-wait rather than garbage.

BSS is cleared with MVCL in pad-fill mode (source length = 0, pad byte = 0x00). This is safe in 64-bit mode and faster than a byte loop for large BSS sections.

3. Main Function (`entry.c` — `zxfl01_entry`)

Execution order:

Step	Action
1	STFLE — store facility list into `proto.stfle_fac[]`
2	CR setup — clear I/O, external, machine-check masks in CR0; zero CR6 and CR14
3	Device probe — `probe_ipl_device()`: ECKD Sense ID first, then FBA; populates `ipl_dev_type` and `ipl_dev_model`
4	Parmfile — read `ETC.ZXFOUNDATION.PARM`; parse `syssize=`
5	Nucleus load — `dasd_find_dataset_extents` + `zxfl_load_elf64`
6	ZXVL — structural lock check, handshake, SHA-256 segment checksums
7	Memory probe — write-pattern test at 1 MB granularity up to `syssize` or 512 MB
8	Module loading — load each `sysmodule=` dataset as a flat binary after the kernel image
9	System detection — `zxfl_system_detect`: STSI (manufacturer, model, LPAR), SIGP Sense (CPU map), STCK (TOD)
10	Protocol finalization — magic, version, binding token, stack canaries, CR snapshots
11	MMU + jump — `zxfl_mmu_setup_and_jump`: build page tables, translate pointers, `LPSWE` to kernel entry

4. Linker Script (`stage2.ld`)

The binary is linked at 0x20000 as a flat ELF. The post-build step strips it to a raw binary with objcopy -O binary.

5. Stack

A 32 KB static array in BSS. The kernel receives a pointer to the top of this stack in %r15 and in proto->kernel_stack_top (HHDM virtual). The kernel must switch to its own stack before consuming more than ~8 KB.

6. Shared Library (`common/`)

Stage 1 uses the full common/ library:

Module	Purpose
`dasd_io.c`	Low-level CCW I/O
`dasd_vtoc.c`	VTOC traversal
`dasd_eckd.c`	ECKD device driver
`dasd_fba.c`	FBA device driver
`dasd_tape.c`	Tape device driver
`elfload.c`	ELF64 segment loader
`mmu.c`	Bootloader page table builder
`lowcore.c`	Lowcore / new PSW setup
`zxvl_verify.c`	ZXVL integrity checks
`parmfile.c`	Parmfile parser
`stfle.c`	STFLE facility detection
`system.c`	STSI, SIGP Sense, STCK
`diag.c`, `ebcdic.c`, `panic.c`, `string.c`	Utilities

DASD Subsystem

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/dasd_*.c

1. Overview

ZXFL supports three DASD device types. The correct driver is selected automatically by probing the IPL device with Sense ID and Read Device Characteristics (RDC) CCWs.

Type	Driver	Typical device
ECKD	`dasd_eckd.c`	3390 (most common)
FBA	`dasd_fba.c`	9336
Tape	`dasd_tape.c`	3480, 3490, 3590

2. Low-Level I/O (`dasd_io.c`)

All device access goes through a single CCW submission layer:

dasd_do_io(schid, ccw_chain, sense_buf)
  │
  ├─ Build ORB pointing to ccw_chain
  ├─ SSCH(schid, ORB)
  ├─ Wait for I/O interrupt (disabled-wait loop on TSCH)
  ├─ TSCH(schid, IRB) → check device end status
  └─ Return status or panic on unrecoverable error

All CCW data buffers are static BSS arrays, ensuring they reside below 0x80000000 (31-bit CDA constraint).

3. ECKD Driver (`dasd_eckd.c`)

ECKD (Extended Count Key Data) is the standard format for IBM 3390 DASD. Addressing is by cylinder, head, and record number (C/H/R).

Key operations:

Operation	CCW command	Description
Sense ID	`0xE4`	Identify device type and model
Read Device Characteristics	`0x64`	Obtain geometry (cylinders, heads, sectors)
Seek	`0x07`	Position to cylinder/head
Search ID Equal	`0x31`	Find record by C/H/R
Read Count Key Data	`0x86`	Read a full record

Track reads use a Seek → Search → Read CCW chain. The search CCW loops (via TIC — Transfer in Channel) until the target record is found.

4. FBA Driver (`dasd_fba.c`)

FBA (Fixed Block Architecture) devices use linear block addressing. Each block is 512 bytes.

Key operations:

Operation	CCW command	Description
Sense ID	`0xE4`	Identify device
Define Extent	`0x63`	Set the block range for the following operation
Locate Record	`0x43`	Specify starting block and count
Read	`0x42`	Transfer data

5. Tape Driver (`dasd_tape.c`)

Tape support is provided for environments where the kernel is stored on a 3480/3490/3590 tape cartridge. Tape is read sequentially; there is no random access.

Key operations: Sense ID, Rewind, Read Block, Forward Space File.

6. Device Selection

At Stage 1 startup, probe_ipl_device() issues a Sense ID CCW to the IPL subchannel. The returned device type code selects the driver:

device_type == 0x3390  →  ECKD
device_type == 0x9336  →  FBA
device_type == 0x3480
              0x3490
              0x3590   →  Tape
otherwise              →  panic("unsupported IPL device")

VTOC

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/dasd_vtoc.c

1. What Is the VTOC?

The Volume Table of Contents (VTOC) is the directory of a z/Architecture DASD volume. It is an IBM-defined on-disk structure that maps dataset names to their physical extents (cylinder/head ranges on ECKD, or block ranges on FBA).

The VTOC begins at a fixed location recorded in the DASD label (Format-4 DSCB at cylinder 0, head 0, record 3 on ECKD). ZXFL reads the VTOC to locate the kernel and loader datasets by name.

2. DSCB Types

The VTOC consists of Data Set Control Blocks (DSCBs), each 140 bytes. ZXFL uses two types:

Type	Format	Purpose
Format-1	F1DSCB	Dataset name, creation date, first extent
Format-3	F3DSCB	Additional extents (overflow from F1)
Format-4	F4DSCB	VTOC descriptor — location and size of VTOC itself

3. Dataset Lookup

dasd_find_dataset(schid, name, &ext)
  │
  ├─ Read F4DSCB (C=0, H=0, R=3) → get VTOC start C/H and size
  ├─ For each DSCB in VTOC:
  │    ├─ Read record
  │    ├─ Check format byte
  │    ├─ If F1DSCB: compare DS1DSNAM (44-byte EBCDIC name) to target
  │    └─ If match: extract extent list from DS1EXT1..DS1EXT3
  └─ Return first extent (cylinder/head start + end)

Dataset names are stored in EBCDIC on disk. ZXFL converts the search name from ASCII to EBCDIC before comparison using ebcdic_ascii_to_ebcdic().

4. Extent Structure

Each extent describes a contiguous range of tracks:

struct extent {
    uint16_t  cyl_start;   // starting cylinder
    uint16_t  head_start;  // starting head
    uint16_t  cyl_end;     // ending cylinder (inclusive)
    uint16_t  head_end;    // ending head (inclusive)
};

A dataset may span up to three extents in its F1DSCB, with additional extents in a chained F3DSCB. ZXFL follows the F3 chain if the dataset requires more than three extents.

5. Sequential Read

After locating a dataset's extents, dasd_read_next() reads tracks sequentially:

for each extent:
    for each track in [cyl_start/head_start .. cyl_end/head_end]:
        Seek → Search R=1 → Read all records on track → append to buffer

The read stops when the buffer is full or all extents are exhausted.

ELF64 Loader

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/elfload.c

1. Overview

zxfl_load_elf64 loads the kernel ELF64 image from DASD into physical memory. It processes only PT_LOAD program headers; all other segment types are ignored.

2. Load Sequence

zxfl_load_elf64(schid, dataset_name, load_base_out)
  │
  ├─ Read ELF header (first 64 bytes)
  ├─ Validate: magic 0x7F 'E' 'L' 'F', EI_CLASS=2 (64-bit),
  │            EI_DATA=2 (big-endian), e_machine=0x16 (s390)
  ├─ Read program header table (e_phoff, e_phnum entries)
  ├─ For each PT_LOAD segment:
  │    ├─ Compute physical load address:
  │    │    pa = p_paddr − CONFIG_KERNEL_VIRT_OFFSET
  │    ├─ Read p_filesz bytes from file offset p_offset → pa
  │    └─ Zero-fill [pa + p_filesz, pa + p_memsz)
  └─ Return load_min (lowest p_paddr seen, stripped of HHDM offset)

3. Address Computation

The kernel is linked with virtual addresses in the HHDM range (p_vaddr ≥ 0xFFFF800000000000). The physical load address is derived by subtracting CONFIG_KERNEL_VIRT_OFFSET:

$$pa = p_paddr - \texttt{CONFIG_KERNEL_VIRT_OFFSET}$$

The loader does not use p_vaddr directly; it uses p_paddr to avoid ambiguity when the linker script sets AT() addresses.

4. Constraints

The kernel ELF must be ET_EXEC (executable, not shared object).
e_machine must be 0x16 (EM_S390). Any other value causes an immediate panic.
All PT_LOAD segments must have p_paddr ≥ CONFIG_KERNEL_VIRT_OFFSET. A segment below the HHDM offset is rejected.
The kernel entry point (e_entry) must be ≥ 0xFFFF800000040000 (HHDM + 256 KB). The loader enforces this before the final jump.
The total loaded image (all PT_LOAD segments) must fit within the memory probed by the write-pattern test.

5. BSS Zeroing

Segments where p_memsz > p_filesz have a BSS tail. The loader zeros this region with memset immediately after reading the file data. This ensures the kernel's BSS is clean before any ZXVL verification.

Bootloader MMU & HHDM

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/mmu.c

1. Purpose

Before transferring control to the kernel, Stage 1 must enable DAT (Dynamic Address Translation) and establish the virtual address space the kernel expects. This involves building a 5-level page table hierarchy with two mappings:

Mapping	Virtual range	Physical range	Purpose
Identity	`[0x0, RAM)`	`[0x0, RAM)`	Allows the loader itself to continue executing after DAT is enabled
HHDM	`[HHDM_BASE, HHDM_BASE + RAM)`	`[0x0, RAM)`	The kernel's primary view of physical memory

HHDM_BASE = 0xFFFF800000000000 (CONFIG_KERNEL_VIRT_OFFSET).

2. Page Table Allocation

The bootloader allocates page tables from a bump allocator backed by a contiguous physical region immediately after the kernel image. The region base is the first 1 MB-aligned address after kernel_phys_end, floored at 32 MB. The end of this region is recorded in proto->pgtbl_pool_end.

The kernel PMM must mark [pool_base, pgtbl_pool_end) as reserved during initialization.

3. Build Sequence

zxfl_mmu_setup_and_jump(proto, entry_point)
  │
  ├─ Allocate R1 table (16 KB, zero-filled)
  ├─ For each 4 KB page in [0, RAM):
  │    ├─ Map VA = PA         (identity)
  │    └─ Map VA = PA + HHDM  (HHDM)
  ├─ Build ASCE: R1_phys | DT=11 | TL=2048
  ├─ Load ASCE into CR1 (LCTL)
  ├─ Translate all proto pointer fields to HHDM virtual
  ├─ Set PSW.DAT = 1 in the new PSW
  └─ LPSWE → entry_point (DAT on, interrupts masked)

Large pages (EDAT-1 / EDAT-2) are used if the corresponding STFLE facility is present, reducing the number of page table entries required.

4. Pointer Translation

All pointer fields in zxfl_boot_protocol_t that reference physical memory are translated to HHDM virtual addresses before the jump:

$$va = pa + \texttt{CONFIG_KERNEL_VIRT_OFFSET}$$

This includes mem_map_addr, kernel_entry, kernel_stack_top, cmdline_addr, and lowcore_phys. The kernel must not attempt to dereference any protocol pointer as a physical address.

5. State at Kernel Entry

Resource	State
DAT	On — CR1 holds the ASCE built by the loader
Interrupts	Masked — all interrupt classes disabled
`%r2`	HHDM virtual address of `zxfl_boot_protocol_t`
`%r15`	HHDM virtual address of initial stack top (32 KB)
All other GPRs	Undefined

Boot Protocol

Document Revision: 26h1.0
Protocol version: ZXFL_VERSION_4 (0x00000004)

1. Overview

The kernel receives a pointer to zxfl_boot_protocol_t in %r2 at entry. All pointer fields are HHDM virtual addresses. The struct is version 4.

The kernel must validate proto->magic == ZXFL_MAGIC (0x5A58464C, "ZXFL") before using any other field. A mismatch indicates the wrong value is in %r2 or the loader did not complete correctly.

2. Header Fields

Field	Type	Value / Description
`magic`	`u32`	`0x5A58464C` ("ZXFL")
`version`	`u32`	`0x00000004`
`flags`	`u32`	Bitmask of `ZXFL_FLAG_*` (see §8)
`binding_token`	`u64`	`ZXVL_SEED ^ stfle_fac[0] ^ ipl_schid`

3. Loader Identity

Field	Type	Description
`loader_major`	`u16`	Major version (1)
`loader_minor`	`u16`	Minor version (0)
`loader_timestamp`	`u32`	Build time encoded as `HHMMSSZx`

4. IPL Device

Field	Type	Description
`ipl_schid`	`u32`	Subchannel ID of the IPL device
`ipl_dev_type`	`u16`	Device type from Sense ID (e.g. `0x3390`)
`ipl_dev_model`	`u16`	Device model from Sense ID

5. Kernel Image

Field	Type	Description
`kernel_phys_start`	`u64`	Physical base of loaded kernel
`kernel_phys_end`	`u64`	Physical end (exclusive), after modules
`kernel_entry`	`u64`	ELF entry point (HHDM virtual)

6. Memory Map

Field	Type	Description
`mem_total_bytes`	`u64`	Total usable + kernel RAM
`mem_map_addr`	`u64`	HHDM virtual address of `zxfl_mem_region_t[]`
`mem_map_count`	`u32`	Number of valid entries

Each zxfl_mem_region_t entry is defined as:

Field	Type	Description
`base`	`u64`	Physical base address of the region
`length`	`u64`	Length of the region in bytes
`type`	`u32`	ZXFL_MEM_* constant
`numa_node`	`u8`	Logical NUMA node ID this memory region belongs to

7. Page Table Pool

Field	Type	Description
`pgtbl_pool_end`	`u64`	Physical end of bootloader page-table bump pool

Pool base is the first 1 MB-aligned address after kernel_phys_end, floored at 32 MB. The kernel PMM must mark [pool_base, pgtbl_pool_end) as reserved.

8. Kernel Stack

Field	Type	Description
`kernel_stack_top`	`u64`	HHDM virtual address of initial stack top (32 KB)

The kernel should switch to its own stack as early as possible and treat this region as reserved.

9. Control Register Snapshots

Field	Type	Description
`cr0_snapshot`	`u64`	CR0 at time of kernel jump
`cr1_snapshot`	`u64`	CR1 (ASCE) at time of jump
`cr13_snapshot`	`u64`	CR13 at time of jump

10. SMP / CPU Map

Field	Type	Description
`cpu_map[]`	`zxfl_cpu_info_t[128]`	Up to 128 CPU entries
`cpu_count`	`u32`	Valid entries in `cpu_map`
`bsp_cpu_addr`	`u16`	CPU address of the boot processor

Each zxfl_cpu_info_t:

Field	Type	Description
`cpu_addr`	`u16`	CPU address (0–65535)
`type`	`u8`	`ZXFL_CPU_TYPE_*` constant
`state`	`u8`	`ZXFL_CPU_ONLINE` or `ZXFL_CPU_STOPPED`
`numa_node`	`u8`	Logical NUMA node ID derived from physical book/socket
`drawer_id`	`u8`	Drawer physical identifier from STSI 15.1.x
`book_id`	`u8`	Book physical identifier from STSI 15.1.x
`socket_id`	`u8`	Socket physical identifier from STSI 15.1.x
`chip_id`	`u8`	Chip physical identifier from STSI 15.1.x
`thread_id`	`u8`	Thread physical identifier from STSI 15.1.x

Valid when ZXFL_FLAG_SMP is set.

11. System Identification

Populated from STSI when ZXFL_FLAG_SYSINFO is set:

Field	Description
`manufacturer[16]`	ASCII, e.g. `"IBM"`
`type[4]`	Machine type, e.g. `"2964"`
`model[16]`	Model identifier
`sequence[16]`	Machine serial number
`plant[4]`	Manufacturing plant code
`lpar_name[8]`	LPAR name (STSI 2.2.2); empty on bare metal
`lpar_number`	LPAR number
`cpus_total`	Total CPUs in CEC
`cpus_configured`	Configured CPUs
`cpus_standby`	Standby CPUs
`capability`	CPU capability rating

12. Modules

Up to 16 modules loaded from sysmodule= parmfile entries:

Field	Description
`modules[i].name[32]`	Dataset name (NUL-terminated)
`modules[i].phys_start`	Physical load address
`modules[i].size_bytes`	Size in bytes

13. Flags

Flag	Bit	Meaning
`ZXFL_FLAG_SMP`	0	`cpu_map[]` is valid
`ZXFL_FLAG_MEM_MAP`	1	`mem_map` is valid
`ZXFL_FLAG_CMDLINE`	2	`cmdline_addr` is valid
`ZXFL_FLAG_LOWCORE`	3	`lowcore_phys` is valid
`ZXFL_FLAG_STFLE`	4	`stfle_fac[]` is valid
`ZXFL_FLAG_SYSINFO`	5	`sysinfo` is valid
`ZXFL_FLAG_TOD`	6	`tod_boot` is valid

14. Binding Token

The binding token ties the boot session to the specific hardware and IPL device:

$$\texttt{binding_token} = \texttt{ZXVL_SEED} \oplus \texttt{stfle_fac[0]} \oplus \texttt{ipl_schid}$$

The kernel must recompute this value and compare it to proto->binding_token. A mismatch means the protocol was tampered with or the kernel is running on unexpected hardware.

The binding token is also used as a component of the ZXVL handshake nonce and the stack frame canary. See ZXVL Verification.

ZXVL Verification

Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/zxvl_verify.c

1. Overview

ZXVL (ZXVerifiedLoad) is the integrity verification layer embedded in the ZXFL bootloader. It prevents arbitrary payloads from being loaded as the kernel nucleus. Three mechanisms are applied in sequence after ELF loading, before DAT is enabled.

2. Structural Lock

The kernel must embed a .zxfl_lock section at fixed offsets from its physical load base (load_min):

Offset from `load_min`	Content
`0x70000`	High 32 bits of lock key: `0xCCBBCC35`
`0x70004`	Sentinel: `0x5A58464C` ("ZXFL")
`0x71000`	Low 32 bits of lock key: `0xE5664311`

The loader verifies:

$$(\texttt{key} \oplus \texttt{ZXVL_LOCK_MASK}) = \texttt{ZXVL_LOCK_EXPECTED}$$

where:

$\texttt{key} = (\texttt{hi} \ll 32) \mid \texttt{lo}$
$\texttt{ZXVL_LOCK_MASK} = \texttt{0x3C1E0F8704B2D596}$
$\texttt{ZXVL_LOCK_EXPECTED} = \texttt{0xF0A5C3B2E1D49687}$

A missing sentinel or wrong key causes an immediate panic — the loader refuses to execute the image.

3. Handshake

The kernel must place a callable function stub at load_min + 0x0 (the very first byte of the loaded image). The stub must implement:

$$f(\texttt{nonce}) = \text{rotl}_{17}(\texttt{nonce}) + \texttt{ZXVL_HS_RESPONSE}$$

where $\text{rotl}_{17}(x) = (x \ll 17) \mid (x \gg 47)$ and $\texttt{ZXVL_HS_RESPONSE} = \texttt{0xDEADBEEF0BADF00D}$.

The loader calls the stub with:

$$\texttt{nonce} = \texttt{ZXVL_SEED} \oplus \texttt{binding_token}$$

$$\texttt{binding_token} = \texttt{ZXVL_SEED} \oplus \texttt{stfle_fac[0]} \oplus \texttt{schid}$$

This ties the handshake to the specific hardware and IPL device. A kernel image that passes on one machine will not pass on another with different STFLE facilities or a different subchannel ID.

4. SHA-256 Segment Checksums

After the handshake, zxvl_verify_nucleus_checksums reads the zxvl_checksum_table_t from load_min + 0x80000 and verifies each entry:

$$\text{SHA-256}(\texttt{phys_start}, \texttt{size}) = \texttt{entry.digest}$$

Any mismatch causes an immediate panic. The table is patched into the kernel ELF by zxsign at build time. Any modification to a PT_LOAD segment after the build — including by a malicious bootloader or storage attack — is detected here.

5. Binding Token

The binding token is stored in proto->binding_token and used in two places:

Handshake nonce (above).
Stack frame canary: frame[1] = ZXVL_FRAME_MAGIC_B ^ binding_token.

The canary value is unique per hardware configuration. A canary extracted from one system cannot be replayed on another.

The kernel must recompute the binding token on entry and compare it to proto->binding_token. See Boot Protocol §14.

Checksum Protocol

Document Revision: 26h1.1

1. Purpose

The checksum protocol ensures that the kernel image loaded into memory matches the image that was built and signed. It operates at two points:

Point	Actor	Action
Build time	`zxsign`	Compute SHA-256 per `PT_LOAD` segment; patch into `.zxvl_checksums`
Boot time (loader)	`zxvl_verify_nucleus_checksums`	Recompute and compare before DAT is enabled
Boot time (kernel)	`verify_kernel_checksums`	Recompute and compare from HHDM after DAT is enabled

The double verification (loader + kernel) ensures that neither a compromised loader nor a post-load memory modification can go undetected.

2. Table Location

The checksum table is placed in the .zxvl_checksums ELF section, which is emitted as a dedicated PT_LOAD segment with p_flags = ZXVL_PFLAGS_CKSUM (0x00200004).

The loader discovers the table's physical address by scanning the ELF program header table for a segment with that exact p_flags value. The physical address is stored in zxfl_boot_protocol_t::cksum_table_phys and passed to the kernel. No hardcoded offsets are used.

3. Table Format

See zxsign §3 for the full zxvl_checksum_table_t layout.

Key fields:

Field	Value
`magic`	`0x5A58564C` ("ZXVL")
`version`	`0x00000001`
`algo`	`0x00000001` (SHA-256)
`count`	Number of verified segments

4. Excluded Segments

The segment containing .zxvl_checksums itself is excluded from the checksum computation. Hashing the table while building it would be circular. zxsign identifies and skips this segment automatically.

5. Kernel Re-verification

After the kernel initializes the PMM and VMM, verify_kernel_checksums re-reads the table from the HHDM virtual address and recomputes SHA-256 for each segment. This catches:

Memory corruption between loader verification and kernel execution.
A loader that passed verification but then modified segments before the jump.

A mismatch at this stage calls panic("sys: kernel segment checksum mismatch — image tampered").

How to Load Your Kernel with ZXFL

Document Revision: 26h1.0

for most up-to-date information, see ZXFL Barebones

This guide walks through every step required to produce a kernel image that ZXFL will accept and execute. Read the Boot Protocol and ZXVL Verification pages first for background.

Overview

ZXFL imposes five requirements on the kernel image before it will execute it:

Valid ELF64 for s390x, ET_EXEC, all PT_LOAD segments in the HHDM range.
Structural lock section at fixed offsets.
Handshake stub at the physical load base.
SHA-256 checksum table at load_min + 0x80000, patched by zxsign.
Boot protocol validation on entry.

Step 1 — Link for the HHDM

All PT_LOAD segments must have virtual addresses at or above CONFIG_KERNEL_VIRT_OFFSET (0xFFFF800000000000). ZXFL computes the physical load address by subtracting this offset from p_paddr:

pa = p_paddr - 0xFFFF800000000000

No AT() override is needed. Because there is no LMA override in the linker script, p_paddr equals p_vaddr, and the loader strips the HHDM offset to get the physical address.

A minimal linker script skeleton (modelled on arch/s390x/init/link.ld):

ENTRY(my_kernel_entry)

PHDRS {
    nucleus       PT_LOAD FLAGS(7);
    checksums_seg PT_LOAD FLAGS(4);
}

SECTIONS {
    /* Handshake stub — must be the first code at the physical load base */
    .zxfl_hs 0xFFFF800000100000 : {
        KEEP(*(.zxfl_hs))
    } :nucleus

    .text 0xFFFF800000100400 : {
        KEEP(*(.text.my_kernel_entry))
        *(.text .text.*)
    } :nucleus

    .rodata : ALIGN(8) { *(.rodata .rodata.*) } :nucleus
    .data   : ALIGN(8) { *(.data   .data.*)   } :nucleus

    /* Structural lock — fixed virtual offsets from load base */
    .zxfl_lock 0xFFFF800000170000 : {
        KEEP(*(.zxfl_lock))
    } :nucleus

    .bss : ALIGN(4096) {
        *(.bss .bss.*) *(COMMON)
    } :nucleus

    /* Checksum table — fixed virtual offset from load base */
    .zxvl_checksums 0xFFFF800000180000 : {
        KEEP(*(.zxvl_checksums))
    } :checksums_seg
}

The entry point (e_entry) must be at or above 0xFFFF800000040000 (HHDM + 256 KB). ZXFL rejects images with a lower entry point.

Step 2 — Embed the Structural Lock

The lock constants can be placed directly in the linker script (as ZXFoundation™ does), or in a C translation unit:

/* In the linker script — simplest approach */
.zxfl_lock 0xFFFF800000170000 : {
    LONG(0xCCBBCC35)   /* hi */
    LONG(0x5A58464C)   /* sentinel "ZXFL" */
    . = . + 0x1000 - 8;
    LONG(0xE5664311)   /* lo */
} :nucleus

The loader verifies: ((hi << 32 | lo) ^ 0x3C1E0F8704B2D596) == 0xF0A5C3B2E1D49687.

Step 3 — Implement the Handshake Stub

The stub must be the very first code at the physical load base. It receives a nonce in %r2 and must return the response in %r2. ZXVL_HS_RESPONSE = 0xDEADBEEF0BADF00D.

    .machinemode zarch
    .section .text.handshake, "ax"
    .globl __zxfl_handshake_stub
.equ ZXFL_SEED_HI, 0xA5F0C3E1
.equ ZXFL_SEED_LO, 0xB2D49687
.equ HS_RESPONSE_HI,  0xDEADBEEF
.equ HS_RESPONSE_LO,  0x0BADF00D

__zxfl_handshake_stub:
    llihf   %r0, ZXFL_SEED_HI
    iilf    %r0, ZXFL_SEED_LO
    xgr     %r2, %r0
    lgr     %r0, %r2
    sllg    %r0, %r0, 17
    srlg    %r1, %r2, 47
    ogr     %r0, %r1
    llihf   %r1, HS_RESPONSE_HI
    iilf    %r1, HS_RESPONSE_LO
    lgr     %r2, %r0
    agr     %r2, %r1
    br      %r14

The stub must not clobber %r14 (return address) or %r15 (stack pointer). It must be callable with BRASL and return via BR %r14.

Step 4 — Reserve the Checksum Table

Declare the checksum table section. It is zero at link time; zxsign patches it after linking:

__attribute__((section(".zxvl_checksums"), used))
static volatile zxvl_checksum_table_t zxvl_cksum_table = { 0 };

Step 5 — Run `zxsign`

After linking, run the host tool on the ELF:

zxsign my_kernel.elf

This computes SHA-256 for each PT_LOAD segment (excluding .zxvl_checksums itself) and patches the table in-place. The ELF is now ready for DASD.

Step 6 — Write to DASD

Write the kernel ELF to the DASD volume as dataset CORE.ZXFOUNDATION.NUCLEUS. In sysres.conf:

DATASET CORE.ZXFOUNDATION.NUCLEUS  my_kernel.elf

See Build Targets for the full dasdload invocation.

Step 7 — Handle the Boot Protocol on Entry

Your kernel entry point receives zxfl_boot_protocol_t *boot in %r2. Minimum required validation:

[[noreturn]] void my_kernel_entry(zxfl_boot_protocol_t *boot) {
    if (!boot || boot->magic != ZXFL_MAGIC)
        for (;;) __asm__("nop");

    uint64_t expected = ZXVL_COMPUTE_TOKEN(boot->stfle_fac[0], boot->ipl_schid);
    if (boot->binding_token != expected)
        for (;;) __asm__("nop");

    if (boot->version != ZXFL_VERSION_4)
        for (;;) __asm__("nop");

    /* proceed */
}

All pointer fields in the protocol are HHDM virtual addresses. Do not treat them as physical addresses.

Checklist

#	Requirement	Enforced by
1	ELF64, `ET_EXEC`, `e_machine = 0x16` (EM_S390)	Loader ELF validation
2	All `PT_LOAD` `p_vaddr >= 0xFFFF800000000000`	Loader address check
3	`e_entry >= 0xFFFF800000040000`	Loader entry check
4	Structural lock at `load_min + 0x70000`	`zxvl_verify`
5	Handshake stub at `load_min + 0x0`	`zxvl_verify`
6	Checksum table at `load_min + 0x80000`, patched by `zxsign`	`zxvl_verify`
7	`boot->magic` validated on entry	Kernel
8	`boot->binding_token` validated on entry	Kernel

ZXFoundation™ Kernel Design

Document: ZXF-KRN-DESIGN-001 Revision: 26h1.0 Status: Draft Date: 2026-05-09 Author: ZXFoundation™ Core Team

Document Scope

This document is the master architectural specification for the ZXFoundation™ kernel. It defines the design of every major subsystem — capability system, memory architecture, IPC, domain model, scheduler, time, trap handling, fault recovery, and the long-term implementation roadmap.

This document does not reference source files or API signatures. Those belong in per-subsystem reference documents. This document defines what the kernel is and why it is designed that way. Pseudocode and diagrams are used where precision is required.

1. Architectural Philosophy

1.1 Design Axioms

ZXFoundation™ is a capability-based object microkernel for IBM z/Architecture. Six axioms govern every design decision:

Minimal Trusted Computing Base. The kernel enforces only what cannot be enforced elsewhere: memory isolation, capability validity, and CPU scheduling. Everything else is a server domain.
Capability-First. No resource may be accessed without a valid capability. There is no ambient authority. A thread that holds no capabilities can do nothing.
No Implicit Trust. Server domains are untrusted by default, including system-provided ones. Trust is established by capability grant, not by identity or position in a hierarchy.
z/Architecture Native. The kernel exploits z/Architecture hardware features — DAT, storage keys, SIGP, TOD clock, CPU timer, channel subsystem — directly. No portability layer is maintained.
SysV ABI Only. The kernel defines its own system call surface. No POSIX compatibility layer exists or is planned. The SysV calling convention (GPRs 2–7 for arguments, GPR 2 for return) is the sole ABI.
Extreme Redundancy. The kernel must not panic on a faulting server domain or a recoverable hardware error. Fault containment and recovery are first-class design requirements, not afterthoughts.

1.2 Threat Model

Threat	Mitigation
Untrusted user domain reads kernel memory	Separate DAT address space per domain; kernel ASCE never loaded in user state
Untrusted domain forges a capability	Capabilities are kernel-managed integers; user space never constructs them
Faulting server domain corrupts kernel state	Server domains run in user state; a fault traps to the kernel, not into it
Hardware storage error corrupts a page	Machine-check recovery classifies and isolates the affected frame
Capability leak via IPC	Capability transfer is move-semantics; sender loses the capability atomically
Denial of service via busy loop	Scheduler enforces quanta; CPU timer interrupt is non-maskable by user state

1.3 Kernel / User Boundary

The kernel runs exclusively in supervisor state (PSW problem-state bit = 0). All server domains and user processes run in problem state (PSW bit 8 = 1).

The boundary is enforced by z/Architecture hardware:

DAT translates user virtual addresses through a per-domain ASCE (CR1 is loaded with the domain's ASCE on context switch).
Storage keys restrict memory access to pages owned by the domain.
Privileged instructions (LPSWE, SPX, SIGP, SSCH, etc.) trap to the kernel when executed in problem state.

1.4 Layered Architecture

┌─────────────────────────────────────────────────────────────────┐
│  User Processes  (problem state, own ASCE, own capability table) │
├─────────────────────────────────────────────────────────────────┤
│  Server Domains  (problem state, own ASCE, own capability table) │
│  [ block I/O | filesystem | network | console | device mgr ]    │
├─────────────────────────────────────────────────────────────────┤
│  Kernel TCB  (supervisor state, kernel ASCE)                    │
│  ┌──────────┬──────────┬──────────┬──────────┬───────────────┐  │
│  │ Capability│  IPC     │ Scheduler│  Memory  │ Trap / Syscall│  │
│  │  System  │ Subsystem│          │  Manager │   Dispatch    │  │
│  └──────────┴──────────┴──────────┴──────────┴───────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  KOMS · PMM · VMM · Slab · SMP · RCU · Sync Primitives  │   │
│  └──────────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────┤
│  z/Architecture Hardware                                        │
│  [ DAT · Storage Keys · SIGP · TOD · CPU Timer · CSS · MCCK ]  │
└─────────────────────────────────────────────────────────────────┘

2. Capability System

2.1 Definition

A capability is an unforgeable, kernel-managed token that grants a specific set of rights to a specific kernel object. Possession of a capability is both necessary and sufficient to exercise the rights it encodes. There is no access control list, no ambient authority, and no privilege escalation path outside of explicit capability grant.

2.2 Capability Token Structure

A capability token is a 64-bit opaque integer. User space treats it as an integer handle into its own capability table. The kernel interprets the internal encoding; user space never constructs or decodes it.

 63      56 55      40 39      24 23       0
 ┌────────┬──────────┬──────────┬──────────┐
 │  type  │  rights  │   gen    │  index   │
 │  8 bit │  16 bit  │  16 bit  │  24 bit  │
 └────────┴──────────┴──────────┴──────────┘

Field	Width	Meaning
`type`	8	Object type (maps to `kobj_type_t::type_id`)
`rights`	16	Bitmask of granted rights
`gen`	16	Generation counter; incremented on revocation
`index`	24	Index into the kernel's global object table

The gen field enables generation-based revocation: when a capability is revoked, the kernel increments the generation counter on the target object. Any token whose gen field does not match the current object generation is invalid, regardless of index or rights.

2.3 Rights Model

Rights are type-specific. The following rights are defined at the kernel level; subsystems may define additional type-specific rights in the upper 8 bits.

Bit	Name	Meaning
0	`CAP_READ`	Read the object's state
1	`CAP_WRITE`	Modify the object's state
2	`CAP_EXEC`	Execute / invoke the object
3	`CAP_GRANT`	Derive and transfer a capability to this object
4	`CAP_REVOKE`	Revoke derived capabilities
5	`CAP_MAP`	Map the object's memory into an address space
6	`CAP_DESTROY`	Destroy the object
7–15	reserved / type-specific

Derivation rule: A derived capability may only have a subset of the parent's rights. Rights can never be amplified. A domain that holds CAP_READ | CAP_GRANT may derive a capability with CAP_READ only.

2.4 Capability Table

Each domain owns a capability table — a flat, kernel-managed array of capability slots. The table is allocated at domain creation with a fixed capacity. User space references capabilities by their slot index (a small integer handle).

Domain Capability Table
┌───────┬──────────────────────────────────────────────┐
│ Slot  │ Capability Token (64-bit, kernel-interpreted) │
├───────┼──────────────────────────────────────────────┤
│   0   │ Self capability (CAP_READ | CAP_WRITE)        │
│   1   │ IPC endpoint capability (CAP_EXEC)            │
│   2   │ Memory region capability (CAP_READ | CAP_MAP) │
│   3   │ (empty)                                       │
│  ...  │  ...                                          │
│  N-1  │ (empty)                                       │
└───────┴──────────────────────────────────────────────┘

The capability table is allocated from a dedicated slab cache backed by pages with a non-zero s390x storage key. This provides hardware-enforced isolation: a domain cannot read another domain's capability table even if it obtains a pointer to it, because the storage key check will fault.

2.5 Capability Lifecycle

                    cap_mint(type, rights, object)
                              │
                              ▼
                    ┌─────────────────┐
                    │  CAPABILITY     │
                    │  VALID          │◄──── cap_derive(parent, subset_rights)
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
         cap_transfer    cap_revoke    object destroyed
              │              │              │
              ▼              ▼              ▼
       moved to         gen++ on        all tokens
       receiver's       object;         with this
       table            all tokens      index become
                        with old gen    invalid
                        invalid

2.6 Core Operations (Pseudocode)

// Mint a new capability for an existing kernel object.
// Called only from kernel context; never directly by user space.
cap_mint(object, rights):
    slot = cap_table_alloc(current_domain.cap_table)
    token.type   = object.type_id
    token.rights = rights
    token.gen    = object.cap_gen
    token.index  = object.global_index
    current_domain.cap_table[slot] = token
    return slot

// Derive a capability with reduced rights.
// Syscall: cap_derive(src_slot, new_rights) -> dst_slot
cap_derive(src_slot, new_rights):
    token = cap_lookup(current_domain, src_slot)
    assert token.rights & CAP_GRANT
    assert (new_rights & ~token.rights) == 0   // no amplification
    dst_slot = cap_table_alloc(current_domain.cap_table)
    new_token = token
    new_token.rights = new_rights
    current_domain.cap_table[dst_slot] = new_token
    return dst_slot

// Revoke all capabilities derived from an object.
// Increments the generation counter; all existing tokens become stale.
cap_revoke(object):
    atomic_inc(object.cap_gen)
    // No table scan needed: stale tokens fail at cap_lookup time.

// Look up and validate a capability slot.
// Returns the target object pointer, or fails.
cap_lookup(domain, slot):
    assert slot < domain.cap_table.capacity
    token = domain.cap_table[slot]
    assert token.type != CAP_TYPE_INVALID
    object = global_object_table[token.index]
    assert object != null
    assert object.cap_gen == token.gen    // generation check
    return object, token.rights

2.7 KOMS Integration

Every kobject_t is a capability target. The KOMS type_id field maps directly to the capability token type field. The KOMS global object table (indexed by token.index) is the authoritative registry of all live kernel objects.

The capability system does not replace KOMS reference counting. A valid capability implies the object is alive (generation check passes only while the object is alive). When an object is destroyed, its generation is incremented, invalidating all capabilities before the final koms_put.

┌─────────────────────────────────────────────────────┐
│  Capability System                                  │
│  token.index ──────────────────────────────────┐   │
│  token.gen   ──── generation check ────────┐   │   │
└────────────────────────────────────────────│───│───┘
                                             │   │
┌────────────────────────────────────────────│───│───┐
│  KOMS                                      │   │   │
│  global_object_table[index] ───────────────┘   │   │
│  kobject_t::cap_gen ───────────────────────────┘   │
│  kobject_t::ref (kref_t) — independent lifetime    │
└─────────────────────────────────────────────────────┘

3. Memory Architecture

Memory is the most critical subsystem in ZXFoundation™. Every other subsystem depends on it. This section defines strict requirements and invariants for every memory layer. Violations of these requirements are kernel panics, not recoverable errors.

3.1 Physical Memory Manager (PMM)

3.1.1 Zone Model

Physical memory is partitioned into two zones at boot time. The partition is permanent; zones are never merged or resized after pmm_init.

Zone	Range	Purpose
`ZONE_DMA`	`[0, 16 MB)`	Channel I/O buffers (31-bit CDA constraint)
`ZONE_NORMAL`	`[16 MB, RAM limit)`	General kernel and domain allocations

The 16 MB boundary is a hardware constraint: the Channel Data Address (CDA) field in a CCW is 31 bits. All I/O buffers submitted to the channel subsystem must reside below 0x80000000. ZONE_DMA covers this range conservatively.

3.1.2 Buddy Allocator

Each zone maintains a buddy allocator with orders 0 through MAX_ORDER (10), covering block sizes from 4 KB (order 0) to 4 MB (order 10).

Zone free lists (per order):

Order 0  (4 KB):  [pfn_a] → [pfn_b] → [pfn_c] → ∅
Order 1  (8 KB):  [pfn_d] → ∅
Order 2  (16 KB): ∅
...
Order 10 (4 MB):  [pfn_e] → ∅

Buddy invariants (non-negotiable):

Every free block is buddy-aligned: pfn % (1 << order) == 0.
Coalescing is mandatory on every free. If a block's buddy is also free, they are merged into a block of order+1, recursively up to MAX_ORDER.
A block may only be freed at the same order it was allocated. Mismatched order corrupts the buddy tree and is a kernel panic.
Free blocks are poisoned with PF_POISON. Any allocation that returns a non-poisoned block indicates a double-allocation bug.

3.1.3 Per-CPU Page Cache

Order-0 (4 KB) allocations are served from a per-CPU cache to avoid zone lock contention on the hot path.

Per-CPU cache (one per zone per CPU):

  count = 7
  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
  │pfn_0│pfn_1│pfn_2│pfn_3│pfn_4│pfn_5│pfn_6│  -  │  -  │
  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
  ← count                                    PCP_HIGH=16 →

  Refill: when count == 0, acquire zone lock, pop PCP_BATCH=8 pages.
  Drain:  when count > PCP_HIGH, acquire zone lock, push PCP_BATCH pages.

The per-CPU cache is accessed with IRQs disabled. No spinlock is needed because the cache is strictly per-CPU and IRQ handlers that allocate memory must use ZX_GFP_ATOMIC, which bypasses the per-CPU cache and draws directly from the zone's atomic reserve.

3.1.4 Atomic Reserve

Each zone holds PMM_ATOMIC_RESERVE = 64 pages back from the buddy allocator. These pages are only accessible to callers that pass ZX_GFP_ATOMIC. This ensures that hard-IRQ context allocations (e.g., channel I/O completion handlers) always succeed even under memory pressure.

Strict requirement: ZX_GFP_ATOMIC must only be used from hard-IRQ context. Using it from process context to bypass memory pressure is prohibited and will be detected by a context check in debug builds.

3.1.5 PMM Allocation Flow

pmm_alloc_page(gfp):
    if gfp & ZX_GFP_ATOMIC:
        goto zone_alloc          // bypass per-CPU cache
    if order == 0:
        page = pcp_pop(current_cpu, zone)
        if page: return page
        pcp_refill(current_cpu, zone)
        return pcp_pop(current_cpu, zone)
zone_alloc:
    acquire zone.lock (irqsave)
    for order in [requested_order .. MAX_ORDER]:
        pfn = free_area_pop(zone, order)
        if pfn != INVALID:
            split down to requested_order
            release zone.lock
            if gfp & ZX_GFP_ZERO: zero_page(pfn)
            return pfn_to_page(pfn)
    if gfp & ZX_GFP_ATOMIC and zone.atomic_reserve > 0:
        // draw from reserve
        ...
    release zone.lock
    return nullptr              // OOM

3.1.6 PMM Strict Requirements

#	Requirement
PMM-1	`pmm_free_page/pages` must never be called on a page not in `PF_BUDDY` state. Double-free is a kernel panic.
PMM-2	The order passed to `pmm_free_pages` must match the order used at allocation.
PMM-3	Allocation from hard-IRQ context requires `ZX_GFP_ATOMIC`. Any other flag in IRQ context is a kernel panic.
PMM-4	`zx_mem_map[]` is allocated during `pmm_init` and never freed. It must not be modified after init except by the PMM itself.
PMM-5	The per-CPU cache must be drained to the zone before a CPU goes offline.
PMM-6	`ZONE_DMA` and `ZONE_NORMAL` boundaries are immutable after `pmm_init`.

3.2 Virtual Memory Manager (VMM)

3.2.1 Address Space Layout

Virtual Address Space (64-bit z/Architecture, 5-level DAT)

0x0000_0000_0000_0000 ┌──────────────────────────────────────┐
                      │  User / Domain space                 │
                      │  (per-domain ASCE, problem state)    │
0x0000_7FFF_FFFF_FFFF └──────────────────────────────────────┘
                        [ translation exception — unmapped ]
0xFFFF_8000_0000_0000 ┌──────────────────────────────────────┐
                      │  HHDM — Higher-Half Direct Map       │
                      │  PA 0x0 → VA 0xFFFF_8000_0000_0000   │
                      │  Mapped with EDAT-1 (1 MB pages)     │
0xFFFF_C000_0000_0000 ├──────────────────────────────────────┤
                      │  vmalloc / ioremap region            │
                      │  Virtually contiguous, phys-discontig│
0xFFFF_E000_0000_0000 ├──────────────────────────────────────┤
                      │  Kernel image + BSS + static data    │
0xFFFF_FFFF_FFFF_FFFF └──────────────────────────────────────┘

The HHDM offset 0xFFFF_8000_0000_0000 places the kernel in R1 entry 2047 (the topmost Region-First entry), cleanly separating kernel (R1[2047]) from user space (R1[0..2046]) at the highest table level.

3.2.2 vm_space_t and VMA Tree

Each address space is represented by a vm_space_t. The kernel has one (kernel_vm_space). Each domain has its own, created at domain birth and destroyed at domain death.

VMAs are indexed by an augmented RB-tree keyed on vm_start. Each node carries subtree_max_end — the maximum vm_end in its subtree — enabling O(log n) free-gap search for vmalloc and O(1) overlap detection.

VMA Tree (augmented RB-tree):

                  [0xC000, 0xE000, max_end=0xF000]
                 /                                 \
  [0xA000, 0xB000, max_end=0xB000]    [0xE000, 0xF000, max_end=0xF000]

  Each node: vm_start (key), vm_end, subtree_max_end, vm_prot, rb_node

Locking model:

Readers call vmm_find_vma inside rcu_read_lock(). Fully lockless. The RCU-protected tree guarantees that a reader always sees a consistent snapshot, even while a writer is modifying the tree.
Writers acquire aug_root.lock (spinlock, irqsave) before any insert, remove, or augmentation update.

A per-CPU hint cache stores the last-found VMA per CPU. On a cache hit (the faulting address falls within the cached VMA), the tree walk is skipped entirely — O(1) on the hot page-fault path.

3.2.3 VMM Strict Requirements

#	Requirement
VMM-1	All VMA modifications must hold `aug_root.lock` (spinlock, irqsave).
VMM-2	All VMA reads must be inside `rcu_read_lock()`.
VMM-3	VMAs must not overlap. `vmm_insert_vma` rejects overlapping ranges.
VMM-4	`vm_start` and `vm_end` must be page-aligned (4 KB boundary).
VMM-5	A `vm_space_t` must not be destroyed while any VMA remains mapped.
VMM-6	The kernel ASCE (CR1) must never be loaded into a domain's address space.
VMM-7	EDAT large pages (1 MB, 2 GB) must not be used for user domain mappings without an explicit `CAP_MAP` capability granting large-page access.
VMM-8	`vmm_remove_vma` must unmap all backing pages and perform a TLB invalidation (IPTE/IDTE) before returning.

3.2.4 Domain Address Space Creation

When a new domain is created, the kernel allocates a fresh vm_space_t and a new R1 page table. The kernel HHDM mapping is not shared into domain address spaces. Domains have no visibility into kernel virtual addresses.

Domain address space creation:

  alloc vm_space_t
  alloc R1 table (16 KB, order=2, ZONE_NORMAL)
  initialize all R1 entries as invalid (Z_I_BIT set)
  set vm_space.pgtbl_root = phys(R1)
  set vm_space.asce = encode_asce(phys(R1), DT=R1, TL=2048)
  // Domain's ASCE is loaded into CR1 on context switch to this domain.
  // Kernel ASCE remains in a separate register save area.

3.3 Slab and Object Allocator

3.3.1 Magazine-Depot Model

The slab allocator uses a magazine-depot architecture for per-CPU caching of fixed-size objects.

Per-CPU layer (no lock needed, IRQs disabled):
  ┌──────────────────────────────────────────┐
  │  Hot magazine  [obj0│obj1│obj2│...│objN] │  ← alloc/free here
  │  Cold magazine [obj0│obj1│...          ] │  ← swap with hot when full/empty
  └──────────────────────────────────────────┘
           ↕ swap (acquire depot lock)
Global depot layer (spinlock):
  ┌──────────────────────────────────────────┐
  │  Full magazines:  [mag_a][mag_b][mag_c]  │
  │  Empty magazines: [mag_d][mag_e]         │
  └──────────────────────────────────────────┘
           ↕ slab page allocation (acquire zone lock)
PMM (buddy allocator)

Allocation: pop from hot magazine. If empty, swap hot/cold. If cold also empty, fetch a full magazine from the depot. If depot has none, allocate a new slab page from PMM and populate a magazine.

Free: push to hot magazine. If full, swap hot/cold. If cold also full, return the cold magazine to the depot as a full magazine.

3.3.2 Storage Key Isolation

Each slab cache may be created with a non-zero s390x storage key. Pages backing that cache are assigned the specified key. A domain that does not hold the matching key in its PSW access key field will receive a protection exception if it attempts to access those pages.

Capability table pages use a dedicated storage key (key 1 by convention). This provides hardware-enforced isolation: even if a domain obtains a pointer to another domain's capability table, the storage key check will fault before any data is read.

3.3.3 Slab Strict Requirements

#	Requirement
SLAB-1	`kmem_cache_alloc` must not be called from hard-IRQ context unless the cache was created with atomic support. Use `kmalloc(ZX_GFP_ATOMIC)` from IRQ context.
SLAB-2	`kmem_cache_free` must only be called with a pointer returned by `kmem_cache_alloc` on the same cache. Cross-cache free is undefined behavior.
SLAB-3	Freed objects are poisoned with a sentinel pattern. Re-use before alloc is detected in debug builds.
SLAB-4	`kmem_cache_destroy` must only be called after all objects have been returned. Outstanding objects at destroy time is a kernel panic.

3.4 Capability Memory

Capability tables are the most security-sensitive data structure in the kernel. They receive special treatment beyond the standard slab rules.

3.4.1 Allocation

Capability tables are allocated from a dedicated slab cache:

Storage key: 1 (non-zero, distinct from general kernel data at key 0).
GFP flags: ZX_GFP_NORMAL only. Capability tables are never allocated from the atomic reserve.
Pages are marked PF_PINNED immediately after allocation. They are never reclaimed, swapped, or migrated.

3.4.2 Lifetime

A capability table is created atomically with its domain. It is destroyed atomically when the domain dies. The destruction sequence is:

domain_destroy(domain):
    // 1. Freeze the domain: no new capabilities may be minted into it.
    domain.state = DOMAIN_DYING
    // 2. Revoke all capabilities in the table.
    for slot in domain.cap_table:
        if cap_table[slot].type != CAP_TYPE_INVALID:
            cap_revoke_slot(domain, slot)
    // 3. Free the table pages.
    kmem_cache_free(cap_table_cache, domain.cap_table)
    // 4. Drop the domain kobject reference.
    koms_put(domain.kobj)

Step 2 increments the generation counter on every object the domain held capabilities to. This atomically invalidates all derived capabilities that other domains may have received from this domain.

3.4.3 Capability Memory Strict Requirements

#	Requirement
CAP-MEM-1	Capability table pages must be `PF_PINNED`. They are never reclaimed.
CAP-MEM-2	Capability table pages use storage key 1. General kernel data uses key 0.
CAP-MEM-3	Capability table destruction must complete before the domain's `vm_space_t` is torn down.
CAP-MEM-4	No capability token may be stored in user-accessible memory. The kernel never copies a raw token to user space.

3.5 Memory for IPC

IPC memory is designed to minimize allocation on the critical path.

3.5.1 Synchronous IPC — Zero Allocation

Small synchronous messages (up to 8 × 64-bit registers) are passed entirely in CPU registers. The kernel performs a direct thread switch: the sender's GPRs 2–9 become the receiver's GPRs 2–9. No kernel buffer is allocated. No memory is touched beyond the two threads' kernel stacks.

3.5.2 Asynchronous Queue — Fixed-Capacity Ring Buffer

Each IPC endpoint that supports async messaging owns a fixed-capacity ring buffer, allocated from the slab at endpoint creation time. The capacity is specified at creation and never changes.

Async message queue (ring buffer):

  head ──►  ┌──────────────────────────────────────────┐
            │  msg[0]: tag | regs[8] | caps[4]         │
            │  msg[1]: tag | regs[8] | caps[4]         │
            │  msg[2]: (empty)                         │
            │  ...                                     │
  tail ──►  │  msg[N-1]: (empty)                       │
            └──────────────────────────────────────────┘
  capacity = N (fixed at endpoint creation)
  each message slot = 136 bytes (8 + 8×8 + 4×8)

The ring buffer is allocated with ZX_GFP_NORMAL and is never reallocated. If the queue is full, the send operation returns ERR_QUEUE_FULL to the sender. The sender is responsible for retry or backpressure.

3.5.3 Shared Memory — Zero-Copy Large Transfer

For bulk data transfer, the sender grants a CAP_MAP capability on a VMA. The receiver maps the VMA into its own address space via vmm_insert_vma. No kernel buffer is involved. The physical pages are shared between the two address spaces via DAT table entries pointing to the same physical frames.

Shared memory transfer:

  Sender domain                    Receiver domain
  vm_space_t                       vm_space_t
  ┌──────────────────┐             ┌──────────────────┐
  │ VMA [A, B)       │             │ VMA [C, D)       │
  │ prot: R/W        │             │ prot: R (derived)│
  └────────┬─────────┘             └────────┬─────────┘
           │ DAT entries                    │ DAT entries
           └──────────────┬─────────────────┘
                          ▼
                  Physical frames [P0, P1, ...]

The receiver's mapping uses the rights from the CAP_MAP capability. If the capability grants only CAP_READ, the receiver's DAT entries are read-only. A write attempt generates a protection exception in the receiver's domain, not a kernel panic.

4. IPC Subsystem

4.1 Design Goals

IPC is the primary communication mechanism between all domains. Because ZXFoundation™ is a microkernel, IPC performance directly determines system throughput. The design targets:

Synchronous fastpath latency: < 1 µs on z/Architecture (single hop, no contention, small message).
Async queue throughput: limited only by memory bandwidth and ring buffer capacity.
Zero kernel allocation on the synchronous fastpath.
Capability transfer atomicity: a capability moved in a message is never visible in both sender and receiver simultaneously.

4.2 IPC Endpoint

An IPC endpoint is a kernel object (kobject_t, type KOBJ_TYPE_ENDPOINT). It is the rendezvous point for IPC. A domain that wishes to receive messages creates an endpoint and publishes a capability to it.

Endpoint state:

  ENDPOINT_IDLE      — no sender or receiver waiting
  ENDPOINT_RECV_WAIT — a receiver thread is blocked, waiting for a message
  ENDPOINT_SEND_WAIT — one or more sender threads are queued (async overflow)

An endpoint is addressed exclusively by capability. A domain that does not hold a capability to an endpoint cannot send to or receive from it.

4.3 Synchronous Fastpath

The synchronous fastpath is the primary IPC mechanism. It is used when the receiver is already blocked on the endpoint.

Synchronous IPC fastpath:

  Sender                    Kernel                    Receiver
    │                          │                          │
    │  ipc_call(ep_cap,        │                          │
    │    regs[0..7])           │                          │
    ├─────────────────────────►│                          │
    │                          │  cap_lookup(ep_cap)      │
    │                          │  endpoint.state ==       │
    │                          │    RECV_WAIT?  YES       │
    │                          │                          │
    │                          │  copy regs[0..7] to      │
    │                          │  receiver kernel stack   │
    │                          │                          │
    │                          │  transfer caps (if any)  │
    │                          │  from sender table to    │
    │                          │  receiver table          │
    │                          │                          │
    │                          │  direct thread switch:   │
    │  [blocked]               │  sender → BLOCKED        │
    │                          │  receiver → RUNNING      │
    │                          ├─────────────────────────►│
    │                          │                          │  regs[0..7]
    │                          │                          │  available
    │                          │                          │
    │                          │  receiver calls          │
    │                          │  ipc_reply(regs[0..7])   │
    │                          │◄─────────────────────────┤
    │                          │  direct thread switch:   │
    │                          │  receiver → BLOCKED      │
    │◄─────────────────────────┤  sender → RUNNING        │
    │  regs[0..7] = reply      │                          │

The direct thread switch bypasses the scheduler run queue entirely. The kernel saves the sender's context, restores the receiver's context, and returns to user space in the receiver. This is the seL4-style fastpath.

Fastpath conditions (all must hold; any failure falls back to slow path):

Endpoint state is RECV_WAIT.
Message fits in 8 registers (no large payload).
At most 4 capability handles transferred.
Receiver thread is on the same CPU (avoids cross-CPU IPI on fastpath).

4.4 Asynchronous Queue Fallback

When the fastpath conditions are not met, the message is enqueued in the endpoint's ring buffer and the sender continues without blocking.

Async send path:

  ipc_send_async(ep_cap, msg):
      endpoint = cap_lookup(ep_cap, CAP_EXEC)
      acquire endpoint.lock (spinlock, irqsave)
      if ring_buffer_full(endpoint.queue):
          release endpoint.lock
          return ERR_QUEUE_FULL
      ring_buffer_enqueue(endpoint.queue, msg)
      if endpoint.state == RECV_WAIT:
          // Wake the receiver.
          thread_wake(endpoint.waiting_receiver)
          endpoint.state = ENDPOINT_IDLE
      release endpoint.lock
      return OK

  ipc_recv(ep_cap):
      endpoint = cap_lookup(ep_cap, CAP_EXEC)
      acquire endpoint.lock
      if ring_buffer_empty(endpoint.queue):
          endpoint.state = RECV_WAIT
          endpoint.waiting_receiver = current_thread
          release endpoint.lock
          thread_block()          // deschedule; woken by sender
          // On wake: message is in thread's IPC buffer
          return OK
      msg = ring_buffer_dequeue(endpoint.queue)
      release endpoint.lock
      return msg

4.5 Message Structure

Every IPC message has the same fixed structure regardless of path:

IPC Message (136 bytes):

  ┌──────────────────────────────────────────────────────────┐
  │  tag      [63:0]   — message type / protocol identifier  │
  ├──────────────────────────────────────────────────────────┤
  │  regs[0]  [63:0]   ─┐                                    │
  │  regs[1]  [63:0]    │                                    │
  │  ...                │  8 × 64-bit data words             │
  │  regs[7]  [63:0]   ─┘                                    │
  ├──────────────────────────────────────────────────────────┤
  │  caps[0]  [63:0]   ─┐                                    │
  │  caps[1]  [63:0]    │  4 × capability handles            │
  │  caps[2]  [63:0]    │  (slot indices in sender's table)  │
  │  caps[3]  [63:0]   ─┘                                    │
  └──────────────────────────────────────────────────────────┘
  Total: 1 + 8 + 4 = 13 × 8 = 104 bytes of payload
         + 4 bytes padding = 136 bytes per slot

4.6 Capability Transfer

Capabilities included in a message (caps[0..3]) are transferred with move semantics: the kernel atomically removes the capability from the sender's table and inserts it into the receiver's table. The sender's slot is cleared. The capability is never simultaneously visible in both tables.

cap_transfer(sender, receiver, sender_slot):
    acquire sender.cap_table.lock
    acquire receiver.cap_table.lock   // always in address order to avoid deadlock
    token = sender.cap_table[sender_slot]
    assert token.type != CAP_TYPE_INVALID
    dst_slot = cap_table_alloc(receiver.cap_table)
    receiver.cap_table[dst_slot] = token
    sender.cap_table[sender_slot] = CAP_INVALID
    release receiver.cap_table.lock
    release sender.cap_table.lock
    return dst_slot

4.7 IPC and KOMS

IPC endpoints are kobject_t instances registered in the KOMS namespace under the owning domain's subtree. A domain may publish an endpoint by name, allowing other domains to discover it via koms_ns_find_get and then request a capability from a trusted broker.

KOMS namespace (IPC endpoints):

  koms_root_ns
  └── "domains"
      ├── "block-io"
      │   └── "ep.request"   ← IPC endpoint kobject
      ├── "filesystem"
      │   └── "ep.request"
      └── "console"
          └── "ep.write"

5. Process and Domain Model

5.1 Fundamental Units

ZXFoundation™ defines two fundamental execution units:

Domain: the unit of isolation. Owns an address space (vm_space_t), a capability table, and one or more threads. Analogous to a process in a monolithic kernel, but the kernel makes no distinction between a "driver domain" and an "application domain."
Thread: the unit of scheduling. Belongs to exactly one domain. Has a kernel stack, a saved register set (irq_frame_t), and a scheduling state. Threads within the same domain share the domain's address space and capability table.

5.2 Domain Lifecycle

                    domain_create()
                          │
                          ▼
                  ┌───────────────┐
                  │   CREATING    │  — address space allocated,
                  └───────┬───────┘    capability table allocated,
                          │            initial thread created
                          ▼
                  ┌───────────────┐
                  │    RUNNING    │◄──── threads scheduled normally
                  └───────┬───────┘
                          │
              ┌───────────┼───────────┐
              │           │           │
         domain_kill   unhandled   watchdog
              │         fault       timeout
              │           │           │
              ▼           ▼           │
        ┌──────────┐ ┌──────────┐    │
        │  DYING   │ │ FAULTED  │◄───┘
        └────┬─────┘ └────┬─────┘
             │            │
             │     supervisor domain
             │     decides: restart or kill
             │            │
             │     ┌──────┴──────┐
             │     │             │
             │  restart        kill
             │     │             │
             │     ▼             │
             │ ┌──────────┐      │
             │ │RESTARTING│      │
             │ └────┬─────┘      │
             │      │            │
             │      ▼            ▼
             │  ┌────────┐  ┌──────┐
             └─►│  DEAD  │  │ DEAD │
                └────────┘  └──────┘

5.3 Domain Structure

A domain is a kobject_t of type KOBJ_TYPE_DOMAIN. It embeds:

Domain object:

  kobject_t         kobj          — KOMS base (lifecycle, namespace, events)
  vm_space_t        space         — address space (ASCE, VMA tree)
  cap_table_t       cap_table     — capability table
  list_head_t       threads       — list of owned threads
  spinlock_t        lock          — protects state transitions
  domain_state_t    state         — CREATING/RUNNING/FAULTED/RESTARTING/DEAD
  uint32_t          domain_id     — globally unique identifier
  kobject_t        *supervisor    — domain that receives fault events (may be null)
  uint64_t          heartbeat_seq — watchdog sequence number

5.4 Thread Structure

A thread is a kobject_t of type KOBJ_TYPE_THREAD. It embeds:

Thread object:

  kobject_t         kobj          — KOMS base
  domain_t         *domain        — owning domain (non-null, immutable)
  irq_frame_t       saved_regs    — GPRs, FPRs, PSW (saved on context switch)
  uint64_t          kernel_stack  — kernel stack top (virtual address)
  thread_state_t    state         — RUNNABLE/RUNNING/BLOCKED/DEAD
  sched_entity_t    sched         — scheduler run queue linkage
  uint32_t          priority      — scheduling priority class
  uint64_t          cpu_mask      — CPU affinity bitmask
  uint64_t          user_timer    — accumulated user-mode CPU time (ns)
  uint64_t          sys_timer     — accumulated kernel-mode CPU time (ns)

5.5 Fault Containment

When a domain faults (unhandled program check, protection exception, or watchdog timeout), the kernel:

Suspends all threads in the domain (sets state to BLOCKED).
Sets domain state to FAULTED.
Fires KOBJ_EVENT_DOMAIN_FAULT on the domain's kobject.
If the domain has a registered supervisor, delivers an IPC message to the supervisor's fault endpoint containing the fault code and domain ID.
The supervisor decides: call domain_restart or domain_kill.

If no supervisor is registered, the kernel kills the domain immediately. The kernel itself never panics due to a domain fault.

Fault containment flow:

  Domain D faults
       │
       ▼
  kernel suspends D's threads
  D.state = FAULTED
  koms_event_fire(D, KOBJ_EVENT_DOMAIN_FAULT)
       │
       ├── supervisor registered?
       │         YES                        NO
       │          │                          │
       ▼          ▼                          ▼
  IPC message to supervisor          domain_kill(D)
  { fault_code, domain_id }
       │
       ├── supervisor calls domain_restart(D)
       │         │
       │         ▼
       │   D.state = RESTARTING
       │   reset address space
       │   reset capability table
       │   restart initial thread
       │   D.state = RUNNING
       │
       └── supervisor calls domain_kill(D)
                 │
                 ▼
           D.state = DEAD
           destroy address space
           destroy capability table
           koms_put(D)

5.6 Server Domains

A server domain is a domain that provides a service to other domains. It is distinguished from a user domain only by convention and registration:

It registers one or more IPC endpoints in the KOMS namespace under a well-known path (e.g., "domains/block-io/ep.request").
It registers a supervisor domain (typically the system manager domain) that will restart it on fault.
It registers a heartbeat capability with the kernel watchdog.

The kernel has no built-in concept of "driver" or "system service." All server domains are equal in privilege. Their authority derives entirely from the capabilities they hold.

5.7 KOMS Domain Hierarchy

koms_root_ns
└── "domains"
    ├── "system-manager"    ← supervisor for all server domains
    │   ├── "ep.fault"      ← receives fault events
    │   └── threads/
    │       └── "main"
    ├── "block-io"
    │   ├── "ep.request"
    │   └── threads/
    │       └── "worker-0"
    ├── "filesystem"
    │   ├── "ep.request"
    │   └── threads/
    │       └── "worker-0"
    └── "user-shell"
        └── threads/
            └── "main"

6. Scheduler

6.1 Design Goals

ZXFoundation™ targets throughput/batch workloads: long-running server domains, high CPU utilization, and minimal context-switch overhead. The scheduler is not designed for sub-millisecond interactive latency. It is designed to keep all CPUs busy and to minimize the overhead of scheduling decisions on the hot path.

6.2 Priority Classes

The scheduler defines three priority classes, processed in strict order:

Class	Value	Quantum	Use case
`SCHED_REALTIME`	0 (highest)	1 ms	Watchdog thread, IPC notification threads
`SCHED_BATCH`	1	10 ms	Server domains, user processes
`SCHED_IDLE`	2 (lowest)	unbounded	Idle loop (runs only when no other work)

A SCHED_REALTIME thread always preempts a SCHED_BATCH or SCHED_IDLE thread. A SCHED_BATCH thread always preempts SCHED_IDLE. Within a class, scheduling is round-robin.

The 10 ms batch quantum is chosen to match the z/Architecture TOD clock resolution and to amortize context-switch overhead over a meaningful amount of work. Server domains that perform I/O will voluntarily yield (block on IPC receive) long before the quantum expires.

6.3 Per-CPU Run Queues

Each CPU maintains three run queues, one per priority class. Run queues are doubly-linked lists of sched_entity_t nodes embedded in thread objects.

Per-CPU scheduler state (one per CPU):

  ┌─────────────────────────────────────────────────────────┐
  │  CPU N                                                  │
  │                                                         │
  │  current_thread ──► [thread currently running]          │
  │                                                         │
  │  rq[SCHED_REALTIME]: [t_a] ↔ [t_b] ↔ ∅                │
  │  rq[SCHED_BATCH]:    [t_c] ↔ [t_d] ↔ [t_e] ↔ ∅        │
  │  rq[SCHED_IDLE]:     [idle_thread] ↔ ∅                 │
  │                                                         │
  │  rq_lock (spinlock, irqsave)                            │
  │  nr_running (total threads across all queues)           │
  └─────────────────────────────────────────────────────────┘

The rq_lock is a per-CPU spinlock. It is held only during run queue manipulation (enqueue, dequeue, pick_next). It is never held across a context switch.

6.4 Scheduling Decision

The scheduler is invoked from three points:

CPU timer interrupt (quantum expiry).
thread_block() — a thread voluntarily deschedules (e.g., IPC receive).
thread_wake() — a thread is made runnable (e.g., IPC send wakes receiver).

schedule():
    acquire rq_lock (irqsave)
    next = pick_next_thread(current_cpu)
    if next == current_thread:
        release rq_lock
        return                  // no switch needed
    prev = current_thread
    current_thread = next
    release rq_lock
    context_switch(prev, next)  // saves prev, restores next, returns in next

pick_next_thread(cpu):
    for class in [SCHED_REALTIME, SCHED_BATCH, SCHED_IDLE]:
        if rq[class] not empty:
            thread = rq[class].head
            list_rotate(rq[class])   // round-robin: move head to tail
            return thread
    return idle_thread              // always non-null

6.5 Context Switch

A context switch saves the outgoing thread's full CPU state and restores the incoming thread's state. On z/Architecture this includes:

16 × 64-bit general-purpose registers (GPRs 0–15)
16 × 64-bit floating-point registers (FPRs 0–15)
Program Status Word (PSW: mask + instruction address)
16 × 32-bit access registers (ARs 0–15)
CPU timer value (STPTC / SPTC)

The kernel stack pointer (GPR 15) is saved in the thread's saved_regs and restored on the next switch. The domain's ASCE is loaded into CR1 when switching between domains.

Context switch sequence:

  context_switch(prev, next):
      // Save prev state to prev.saved_regs
      STMG  R0,R15, prev.saved_regs.gprs
      STFPC prev.saved_regs.fpc
      STPTC prev.saved_regs.cpu_timer
      // Update time accounting
      prev.sys_timer += (STCK() - lowcore.sys_enter_timer)
      // Switch address space if domains differ
      if prev.domain != next.domain:
          LCTLG CR1, next.domain.space.asce
          // TLB is tagged by ASCE; no explicit flush needed on z/Arch
      // Restore next state
      LPTC  next.saved_regs.cpu_timer
      LFPC  next.saved_regs.fpc
      LMG   R0,R15, next.saved_regs.gprs
      // lowcore.current_task = next (for fault handler identification)
      lowcore.current_task = next
      lowcore.sys_enter_timer = STCK()
      // Return in next thread's context

6.6 Work Stealing

When a CPU's run queues are empty (only the idle thread is runnable), the CPU attempts to steal work from the busiest CPU.

Work stealing:

  idle_loop(cpu):
      while true:
          victim = find_busiest_cpu()   // scan per-CPU nr_running
          if victim == null or victim.nr_running <= 1:
              arch_cpu_relax()          // DIAG 0x44 (z/Arch yield hint)
              continue
          acquire victim.rq_lock (irqsave)
          acquire cpu.rq_lock (irqsave)   // always in cpu_id order
          steal_half(victim, cpu)
          release cpu.rq_lock
          release victim.rq_lock
          break

Stealing moves half the victim's SCHED_BATCH threads to the idle CPU. SCHED_REALTIME threads are never stolen — they are pinned to their assigned CPU by the IPI mechanism.

6.7 CPU Affinity

A thread may be pinned to a subset of CPUs via its cpu_mask field. The scheduler respects affinity: pick_next_thread skips threads whose cpu_mask does not include the current CPU. Work stealing also respects affinity: a thread is only stolen if the stealing CPU is in the thread's cpu_mask.

Affinity is set at thread creation via a capability-gated syscall. The capability must grant CAP_WRITE on the thread object.

7. Time Subsystem

7.1 Hardware Time Sources

z/Architecture provides three hardware time mechanisms, all per-CPU:

Source	Instruction	Type	Resolution	Use
TOD clock	`STCK` / `STCKF`	Global, monotonic	~0.24 ns (2^-12 µs)	Wall time, `ktime_get`
CPU timer	`SPTC` / `STPTC`	Per-CPU countdown	Same as TOD	Scheduler preemption
Clock comparator	`SCKC` / `STCKC`	Per-CPU absolute	Same as TOD	Sleep / timeout

The TOD clock is a single hardware clock shared across all CPUs. It is monotonic and does not wrap in any practical timeframe (64-bit, ~143 years at full resolution). STCKF reads it without serialization — it is safe from any context including hard-IRQ.

7.2 Kernel Time (ktime_t)

ktime_t is a 64-bit nanosecond count since kernel boot. It is derived from the TOD clock with a boot-time offset computed during pmm_init.

TOD clock value (raw):
  bits 63:0 = TOD units (1 TOD unit = 2^-12 µs ≈ 0.244 ns)

ktime conversion:
  ktime_ns = (tod_raw - tod_boot_offset) * 125 / 512
           = (tod_raw - tod_boot_offset) >> 2  (approximate, 4 ns resolution)

  Exact: 1 TOD unit = 1000/4096 ns
         ktime_ns = tod_delta * 1000 / 4096

ktime_get() reads STCKF and applies the conversion. It is callable from any context, holds no lock, and never sleeps.

7.3 CPU Timer and Scheduler Preemption

The CPU timer is a per-CPU countdown register. When it reaches zero, a CPU timer interrupt fires (external interrupt, subclass 0x1004). The kernel uses this to enforce scheduler quanta.

Quantum setup (on context switch to a new thread):
    quantum_tod = thread.priority == SCHED_REALTIME ? 1_ms_in_tod
                                                    : 10_ms_in_tod
    SPTC -quantum_tod    // load negative value; counts up to zero

CPU timer interrupt handler:
    // Fires when CPU timer reaches zero (overflows from negative to positive)
    sched_tick()         // account time, check if quantum expired
    if quantum_expired:
        schedule()       // pick next thread
    else:
        return           // spurious or early; reload timer

7.4 Clock Comparator and Timer Wheel

The clock comparator fires an external interrupt when the TOD clock reaches a programmed absolute value. The kernel uses this for sleep and timeout operations.

The timer wheel is a per-CPU hierarchical structure with 8 levels and 64 slots per level. Each slot covers a time range; the resolution doubles at each level.

Timer wheel (per CPU):

  Level 0: 64 slots × 1 ms  = 64 ms range   (fine-grained)
  Level 1: 64 slots × 64 ms = 4 s range
  Level 2: 64 slots × 4 s   = 256 s range
  ...
  Level 7: 64 slots × ...   = years range    (coarse)

  Each slot: list of timer_t objects expiring in that window

  On clock comparator interrupt:
      advance current slot pointer
      fire all timers in the current slot
      if level 0 wraps: cascade from level 1, etc.
      program clock comparator for next non-empty slot

Timer callbacks execute in softirq context — after the hard-IRQ handler returns, before returning to user space. They must not block, must not acquire spinlocks held by hard-IRQ handlers, and must complete in bounded time.

7.5 Time Accounting

Per-thread time accounting uses the lowcore timing fields:

Kernel entry (SVC, PGM, EXT, IO):
    lowcore.sys_enter_timer = STCK()

Kernel exit (return to user space):
    elapsed = STCK() - lowcore.sys_enter_timer
    current_thread.sys_timer += elapsed
    lowcore.exit_timer = STCK()

User time (updated on kernel entry):
    user_elapsed = lowcore.sys_enter_timer - lowcore.exit_timer
    current_thread.user_timer += user_elapsed

7.6 Time Strict Requirements

#	Requirement
TIME-1	`ktime_get()` must be callable from any context including hard-IRQ. It reads `STCKF` directly — no lock, no sleep.
TIME-2	Timer callbacks execute in softirq context. They must not block or acquire locks held by hard-IRQ handlers.
TIME-3	The CPU timer must be reloaded on every context switch. A thread must never run beyond its quantum without a timer interrupt.
TIME-4	The clock comparator must be reprogrammed after every timer wheel advance to the next non-empty slot.
TIME-5	`tod_boot_offset` is computed once during `pmm_init` and never modified.

8. Trap and System Call Architecture

8.1 Interrupt Classes

z/Architecture defines six hardware interrupt classes. Each has a dedicated new PSW slot in the lowcore and a dedicated entry point in the kernel.

Class	Lowcore offset	Trigger	Kernel handler
`RESTART`	`0x01A0`	SIGP RESTART (AP bringup)	`restart_handler`
`EXTERNAL`	`0x01B0`	CPU timer, clock comparator, SIGP, service call	`ext_handler`
`SVC`	`0x01C0`	`SVC n` instruction (system call)	`svc_handler`
`PROGRAM`	`0x01D0`	Page fault, protection exception, illegal instruction	`pgm_handler`
`MCCK`	`0x01E0`	Machine check (hardware error)	`mcck_handler`
`IO`	`0x01F0`	Channel subsystem I/O completion	`io_handler`

8.2 Entry Path

All interrupt classes share the same entry structure:

Hardware interrupt fires:
    1. Hardware saves old PSW to lowcore (e.g., svc_old_psw at 0x0140).
    2. Hardware saves interrupt parameters to lowcore
       (e.g., svc_code at 0x008A for SVC).
    3. Hardware loads new PSW from lowcore (e.g., svc_new_psw at 0x01C0).
    4. Execution begins at the kernel entry stub.

Kernel entry stub (assembly):
    STMG  R0,R15, lowcore.save_area_sync   // save all GPRs
    // Build irq_frame_t on kernel stack:
    //   gprs[16], psw (from lowcore old PSW), ilc, code
    LG    R15, lowcore.kernel_stack        // switch to kernel stack
    BRASL R14, <C handler>                 // call C dispatcher
    // On return: restore GPRs, LPSWE to return PSW
    LMG   R0,R15, frame.gprs
    LPSWE frame.psw

The irq_frame_t on the kernel stack is the canonical representation of the interrupted context. It is used by the fault handler, the debugger, and the context switch path.

8.3 SVC — System Call Dispatch

ZXFoundation™ defines its own system call table. There is no POSIX compatibility layer. The SVC number is in lowcore.svc_code (16-bit). Arguments follow the SysV ABI: GPRs 2–7. Return value in GPR 2.

Every system call that operates on a kernel object takes a capability handle as its first argument (GPR 2). The kernel validates the capability before performing any operation. An invalid or insufficient capability returns ERR_CAP_INVALID immediately.

SVC dispatch:

  svc_handler(frame):
      svc_nr = lowcore.svc_code & 0xFF
      if svc_nr >= ZX_SYSCALL_MAX:
          return ERR_INVALID_SYSCALL
      cap_handle = frame.gprs[2]
      object, rights = cap_lookup(current_domain, cap_handle)
      if object == null:
          return ERR_CAP_INVALID
      return syscall_table[svc_nr](object, rights, frame)

ZXFoundation™ v1 system call surface (~32 syscalls):

Number	Name	Capability type	Description
0	`zx_cap_derive`	any	Derive a capability with reduced rights
1	`zx_cap_transfer`	any + `CAP_GRANT`	Transfer a capability via IPC message
2	`zx_cap_revoke`	any + `CAP_REVOKE`	Revoke all derived capabilities
3	`zx_domain_create`	domain factory	Create a new domain
4	`zx_domain_kill`	domain + `CAP_DESTROY`	Kill a domain
5	`zx_domain_restart`	domain + `CAP_WRITE`	Restart a faulted domain
6	`zx_thread_create`	domain + `CAP_WRITE`	Create a thread in a domain
7	`zx_thread_start`	thread + `CAP_EXEC`	Start a thread at a given address
8	`zx_thread_exit`	—	Terminate the calling thread
9	`zx_ipc_call`	endpoint + `CAP_EXEC`	Synchronous IPC call
10	`zx_ipc_recv`	endpoint + `CAP_EXEC`	Block waiting for a message
11	`zx_ipc_reply`	—	Reply to a synchronous call
12	`zx_ipc_send`	endpoint + `CAP_EXEC`	Async send (non-blocking)
13	`zx_mem_map`	VMA + `CAP_MAP`	Map a VMA into the calling domain
14	`zx_mem_unmap`	VMA + `CAP_WRITE`	Unmap a VMA
15	`zx_mem_alloc`	domain + `CAP_WRITE`	Allocate anonymous memory
16	`zx_endpoint_create`	domain + `CAP_WRITE`	Create an IPC endpoint
17	`zx_endpoint_destroy`	endpoint + `CAP_DESTROY`	Destroy an endpoint
18	`zx_time_get`	—	Read `ktime_t` (no capability needed)
19	`zx_sleep`	—	Sleep for a duration
20	`zx_yield`	—	Voluntarily yield the CPU
21	`zx_watchdog_register`	domain + `CAP_WRITE`	Register a heartbeat capability
22	`zx_watchdog_heartbeat`	watchdog cap	Signal liveness to the watchdog
23–31	reserved		Future use

8.4 PGM — Program Check Handler

The program check handler dispatches on lowcore.pgm_code:

pgm_handler(frame):
    code = lowcore.pgm_code
    addr = lowcore.trans_exc_code   // faulting virtual address (if applicable)

    switch code:
        case PGM_TRANSLATION_EXCEPTION:   // page fault
            vma = vmm_find_vma(current_domain.space, addr)
            if vma == null:
                goto domain_fault         // no mapping → domain fault
            page = pmm_alloc_page(ZX_GFP_NORMAL)
            if page == null:
                goto domain_fault         // OOM → domain fault
            mmu_map_page(current_domain.space, addr, page, vma.vm_prot)
            return                        // retry the faulting instruction

        case PGM_PROTECTION_EXCEPTION:    // write to read-only page, or key mismatch
            goto domain_fault

        case PGM_PRIVILEGED_OPERATION:    // user tried a privileged instruction
            goto domain_fault

        case PGM_SPECIFICATION_EXCEPTION: // alignment or format error
            goto domain_fault

        default:
            goto domain_fault

domain_fault:
    domain_suspend(current_domain)
    deliver_fault_event(current_domain, code, addr)
    schedule()                            // switch to another thread

A program check in kernel context (PSW problem-state bit = 0 at the time of the fault) is always a kernel panic. The kernel must not generate translation exceptions or protection exceptions in its own address space.

8.5 EXT — External Interrupt Handler

ext_handler(frame):
    code = lowcore.ext_int_code

    switch code:
        case EXT_CPU_TIMER (0x1004):
            sched_tick()
            if quantum_expired: schedule()

        case EXT_CLOCK_COMPARATOR (0x1005):
            timer_wheel_advance(current_cpu)
            program_clock_comparator(next_expiry)

        case EXT_SERVICE_CALL (0x2401):
            sclp_service_call_handler()   // SCLP response (console, hardware info)

        case EXT_SIGP_EMERGENCY (0x1201):
            ipi_handler()                 // cross-CPU IPI (TLB shootdown, CPU offline)

        default:
            // Unknown external interrupt: log and ignore.

8.6 IO — Channel Subsystem Interrupt Handler

io_handler(frame):
    schid.sch_no = lowcore.subchannel_nr
    schid.ssid   = lowcore.subchannel_id >> 16

    // Read the Interrupt Response Block (IRB) via TSCH.
    TSCH schid, irb

    // Look up the IRQ descriptor for this subchannel.
    desc = irq_lookup_by_schid(schid)
    if desc == null:
        return                  // spurious; no handler registered

    // Dispatch to the registered handler.
    // The handler is typically the block-I/O server domain's IPC endpoint.
    desc.handler(desc, &irb)

The I/O handler is intentionally minimal. It reads the IRB and dispatches to a registered handler. The handler is responsible for notifying the appropriate server domain via IPC. The kernel does not interpret I/O completion data.

9. Machine-Check Recovery and Watchdog

9.1 Machine-Check Classification

When a machine-check interrupt fires, lowcore.mcck_interruption_code classifies the error. The kernel classifies each error as recoverable or unrecoverable:

Error class	Recoverable?	Action
Storage error (corrected)	Yes	Log; mark page suspect; continue
Storage error (uncorrected)	No	Offline affected frames; migrate domains
CPU malfunction	No	Offline CPU; migrate its domains
Timing facility error	Yes	Re-sync TOD; log
External damage	No	Kernel panic (hardware integrity lost)

9.2 Machine-Check Recovery Flow

mcck_handler(frame):
    code = lowcore.mcck_interruption_code

    if code & MCCK_SD:              // system damage — unrecoverable
        goto kernel_panic

    if code & MCCK_ST:              // storage error
        addr = lowcore.failing_storage_address
        page = phys_to_page(addr)
        if code & MCCK_ST_CORRECTED:
            pmm_mark_suspect(page)  // log; keep in service
        else:
            pmm_offline_page(page)  // remove from buddy; migrate domains
            domain_migrate_from_page(page)

    if code & MCCK_CPU:             // CPU malfunction
        cpu_offline(current_cpu)    // SIGP STOP self after migration
        domain_migrate_all(current_cpu)
        SIGP STOP, current_cpu_addr

    // Recoverable: return to interrupted context.
    LPSWE frame.psw

9.3 CPU Offline and Domain Migration

When a CPU is taken offline (due to MCCK or operator request):

cpu_offline(cpu):
    // 1. Stop accepting new work.
    cpu.state = CPU_OFFLINE_PENDING
    // 2. Drain the run queue to other CPUs.
    acquire cpu.rq_lock
    for each thread in cpu.rq[SCHED_BATCH]:
        target = find_least_loaded_cpu(thread.cpu_mask)
        enqueue(target.rq[SCHED_BATCH], thread)
    release cpu.rq_lock
    // 3. Notify domains whose threads were migrated.
    for each migrated_thread:
        koms_event_fire(migrated_thread.domain, KOBJ_EVENT_DOMAIN_MIGRATE)
    // 4. Stop the CPU.
    cpu.state = CPU_OFFLINE
    SIGP STOP, cpu.cpu_addr

9.4 Domain Watchdog

The kernel maintains a per-CPU watchdog thread at SCHED_REALTIME priority. Each server domain that registers with the watchdog receives a heartbeat capability. The domain must call zx_watchdog_heartbeat within a configured interval (default: 5 seconds).

Watchdog state machine (per registered domain):

  WATCHDOG_OK ──── heartbeat received ────► WATCHDOG_OK
       │
       │ interval elapsed without heartbeat
       ▼
  WATCHDOG_WARN ──── heartbeat received ──► WATCHDOG_OK
       │
       │ second interval elapsed
       ▼
  WATCHDOG_FAULT
       │
       ▼
  domain_fault(domain)   // triggers fault containment flow (Section 5.5)

The watchdog thread runs on a dedicated CPU (CPU 0 by convention) and is never migrated. It is the only SCHED_REALTIME thread that the kernel creates at boot time.

9.5 Kernel Self-Check (syschk)

The existing zx_system_check() infrastructure is extended with severity levels:

Severity	Action
`ZX_SYSCHK_WARNING`	Log to kernel ring buffer; continue
`ZX_SYSCHK_DEGRADED`	Disable the affected subsystem; log; continue
`ZX_SYSCHK_CORE_CORRUPT`	Disabled-wait PSW (kernel panic)

ZX_SYSCHK_CORE_CORRUPT is reserved for conditions where kernel data structures are known to be corrupted and continued execution would cause silent data loss or security violations. All other conditions should use WARNING or DEGRADED to maximize availability.

9.6 Storage Key Protection

Each domain is assigned a non-zero s390x storage key at creation time. All pages mapped into the domain's address space are assigned that key. The domain's PSW access key field is set to match.

A domain that attempts to access a page with a mismatched storage key receives a protection exception (PGM code 0x04). This is handled as a domain fault (Section 8.4) — the domain is suspended, not the kernel.

This provides a hardware-enforced memory isolation layer that operates independently of DAT. Even if a bug in the kernel's page table management accidentally maps a page from domain A into domain B's address space, the storage key check will prevent domain B from reading or writing it.

10. Long-Term Implementation Roadmap

10.1 Overview

The roadmap is organized into seven phases. Each phase has a clear prerequisite, a defined deliverable, and a set of subsystems it unlocks. Phases are sequential within a dependency chain but may overlap where dependencies permit.

Phase dependency graph:

  [Phase 1: TCB Hardening]
          │
          ▼
  [Phase 2: Capability Foundation]
          │
          ▼
  [Phase 3: Domain and IPC]
          │
     ┌────┴────┐
     ▼         ▼
[Phase 4:  [Phase 6:
 Server     Memory
 Domain     Completion]
 Infra]
     │
     ▼
[Phase 5: First Server Domains]
     │
     ▼
[Phase 7: Hardening and Observability]

10.2 Phase 1 — TCB Hardening

Prerequisite: Current state (PMM, VMM, slab, KOMS, IRQ, SMP, sync all functional).

Deliverables:

Trap/entry completion: Full irq_frame_t save/restore for all six interrupt classes. SVC, PGM, EXT, IO, MCCK, RESTART handlers dispatch to C. Return path restores full CPU state via LPSWE.
Time subsystem: TOD clock read (STCKF), ktime_t type and ktime_get(). CPU timer setup and quantum enforcement. Clock comparator setup. Timer wheel (8 levels, 64 slots). ktime_sleep().
Scheduler — BATCH class: Per-CPU run queues. schedule(), thread_block(), thread_wake(). Context switch (GPR/FPR/PSW save- restore). CPU timer interrupt → sched_tick(). Work stealing. Idle thread per CPU.

Unlocks: Phase 2 (capability system requires a running scheduler to test domain creation).

10.3 Phase 2 — Capability Foundation

Prerequisite: Phase 1 complete.

Deliverables:

Capability token: 64-bit structure, type/rights/gen/index fields. cap_mint, cap_derive, cap_revoke, cap_lookup.
Capability table: Slab cache with storage key 1. Per-domain flat array. cap_table_alloc, cap_table_free. PF_PINNED pages.
KOMS extension: kobject_t gains cap_gen (generation counter) and global_index (object table index). Global object table (flat array, spinlock-protected). koms_init_obj registers in table. koms_put at zero increments cap_gen before freeing.
Syscalls 0–2: zx_cap_derive, zx_cap_transfer, zx_cap_revoke. SVC dispatch table. Capability validation on every syscall entry.

Unlocks: Phase 3 (domain creation requires capability tables).

10.4 Phase 3 — Domain and IPC

Prerequisite: Phase 2 complete.

Deliverables:

Domain object: domain_t kobject type. vm_space_t creation per domain. Capability table allocation at domain birth. Domain lifecycle state machine. domain_create, domain_kill.
Thread object: thread_t kobject type. Kernel stack allocation. thread_create, thread_start, thread_exit. Integration with scheduler (enqueue on thread_start).
SVC entry — capability validation: Every syscall validates its capability argument before proceeding. ERR_CAP_INVALID on failure.
IPC sync fastpath: zx_ipc_call, zx_ipc_recv, zx_ipc_reply. Direct thread switch. Register-passing (GPRs 2–9). Fastpath conditions enforced.
IPC async queue: Ring buffer slab allocation. zx_ipc_send. Enqueue/dequeue. Receiver wake on enqueue.
Syscalls 3–17: Full domain, thread, memory, and endpoint syscalls.

Unlocks: Phase 4 and Phase 6 (both depend on working domains and IPC).

10.5 Phase 4 — Server Domain Infrastructure

Prerequisite: Phase 3 complete.

Deliverables:

Fault containment: domain_suspend, deliver_fault_event. Fault event IPC to supervisor domain. domain_restart, domain_kill from supervisor.
Domain watchdog: Watchdog thread at SCHED_REALTIME. Heartbeat capability. zx_watchdog_register, zx_watchdog_heartbeat. Two-strike fault trigger.
MCCK recovery: Storage error classification. pmm_offline_page. CPU offline and domain migration. KOBJ_EVENT_DOMAIN_MIGRATE.
Storage key assignment: Per-domain key allocation. Page key assignment on vmm_insert_vma. PSW access key set on context switch.
System manager domain: The first server domain, started by the kernel at boot. Receives fault events for all other server domains. Implements restart policy.

Unlocks: Phase 5 (server domains require fault containment to be safe).

10.6 Phase 5 — First Server Domains

Prerequisite: Phase 4 complete.

Deliverables:

Console server: Wraps DIAG 0x08 / SCLP. Exposes ep.write endpoint. Accepts zx_ipc_send with a string payload. Replaces printk for user-visible output.
Channel I/O server: Wraps CSS interrupt dispatch. Accepts subchannel registration from other domains. Exposes ep.request for I/O submission. Returns I/O completion via IPC reply.
Block I/O server: Built on channel I/O server. Implements ECKD (DASD) read/write. Exposes ep.request with a block I/O protocol.
Filesystem server (minimal): Built on block I/O server. Implements a read-only flat filesystem (sufficient to load user programs). Exposes ep.open, ep.read.

Unlocks: Phase 7 (hardening requires a running system to test against).

10.7 Phase 6 — Memory Management Completion

Prerequisite: Phase 3 complete (can proceed in parallel with Phase 4/5).

Deliverables:

Demand paging: PGM translation exception → vmm_find_vma → pmm_alloc_page → mmu_map_page → retry. Anonymous and file-backed VMAs.
Copy-on-write: VM_COW flag on shared VMAs. Write protection fault → page copy → remap. Used for domain cloning (fork-like semantics).
Page reclaim: LRU list per zone. Reclaim under memory pressure (triggered when ZONE_NORMAL.free_pages < LOW_WATERMARK). Reclaim selects cold anonymous pages; writes dirty pages to swap device.
Swap: Capability-gated swap device via channel I/O server. Swap page table entries. pmm_swap_out, pmm_swap_in.

Unlocks: Phase 7 (full memory management required for production use).

10.8 Phase 7 — Hardening and Observability

Prerequisite: Phases 4, 5, and 6 complete.

Deliverables:

KOMS attribute bus: Expose domain/thread/memory statistics as KOMS attributes. Readable via zx_attr_get syscall with a capability.
Kernel ring buffer: Fixed-size circular log buffer. Capability-gated read via ep.klog endpoint. Replaces printk for kernel diagnostics.
Capability audit log: Every cap_mint, cap_derive, cap_revoke, and cap_transfer is logged to a dedicated ring buffer. Readable by the system manager domain.
Syscall fuzz harness: Host-side tool that generates random syscall sequences and validates that the kernel never panics (only returns error codes) on invalid inputs.
SMP stress test: Multi-domain IPC stress test exercising the fastpath, work stealing, and domain fault/restart under load.

10.9 Milestone Summary

Phase	Key Deliverable	Unlocks
1	Trap, time, scheduler	Capability system
2	Capability tokens and tables	Domain creation
3	Domains, threads, IPC	Server domains, memory completion
4	Fault containment, watchdog, MCCK	First server domains
5	Console, block I/O, filesystem	Full system
6	Demand paging, CoW, reclaim, swap	Production memory management
7	Observability, audit, hardening	Production readiness

End of ZXF-KRN-DESIGN-001 Rev 26h1.0

Kernel Overview

Document Revision: 26h1.0

1. Entry Contract

The kernel receives control from ZXFL with the following guaranteed state:

Resource	State
DAT	On — CR1 holds the ASCE built by the loader
Interrupts	Masked — all interrupt classes disabled
`%r2`	HHDM virtual address of `zxfl_boot_protocol_t`
`%r15`	HHDM virtual address of initial stack top (32 KB loader stack)
All other GPRs	Undefined

The kernel entry point is zxfoundation_global_initialize(zxfl_boot_protocol_t *boot). The first action must be to validate boot->magic == ZXFL_MAGIC. Any other use of the protocol before this check is undefined behavior.

2. Subsystem Table

Subsystem	Source location	Status
Early init	`zxfoundation/init/`	Active
PMM	`zxfoundation/memory/pmm.c`	Active
VMM	`zxfoundation/memory/vmm.c`	Active
Slab	`zxfoundation/memory/slab.c`	Active
kmalloc	`zxfoundation/memory/kmalloc.c`	Active
Heap	`zxfoundation/memory/heap.c`	Active
MMU	`arch/s390x/mmu/mmu.c`	Active
Per-CPU	`arch/s390x/cpu/percpu.c`	Active
qspinlock	`arch/s390x/cpu/qspinlock.c`	Active
Mutex	`zxfoundation/sync/mutex.c`	Active
RW Lock	`zxfoundation/sync/rwlock.c`	Active
Semaphore	`zxfoundation/sync/semaphore.c`	Active
Wait queue	`zxfoundation/sync/waitqueue.c`	Active
RCU	`zxfoundation/sync/rcu.c`	Active
SRCU	`zxfoundation/sync/srcu.c`	Active
kobject	`zxfoundation/object/kobject.c`	Active
printk	`zxfoundation/sys/printk.c`	Active
panic	`zxfoundation/sys/panic.c`	Active
Trap	`arch/s390x/trap/`	Active
SMP	`arch/s390x/cpu/smp.c`	Active
Scheduler	`zxfoundation/sched/`	Active
IRQ	`arch/s390x/irq/`	Stub
Time	`arch/s390x/time/`	Stub

Early Initialization

Document Revision: 26h1.0
Source: zxfoundation/init/main.c

1. Initialization Sequence

zxfoundation_global_initialize performs early initialization in strict order before enabling interrupts or starting APs:

Step	Action	Notes
1	`zxfl_lowcore_setup()`	Install kernel new PSWs in the BSP lowcore
2	`diag_setup()` + `printk_initialize()`	Enable console output
3	Validate `boot->magic == ZXFL_MAGIC`	Panic if wrong
4	Validate `boot->binding_token`	Recompute and compare; panic on mismatch
5	`validate_stack_frame()`	Verify ZXVL stack canaries
6	`verify_kernel_checksums()`	Re-verify SHA-256 segment digests from HHDM
7	Print machine/LPAR/CPU info	If `ZXFL_FLAG_SYSINFO` / `ZXFL_FLAG_SMP` set
8	`percpu_init_bsp()`	Initialize BSP per-CPU block at prefix+0x200
9	`arch_cpu_features_init(boot)`	Detect STFLE facilities, populate feature flags
10	`rcu_init()`	Initialize RCU subsystem
11	`pmm_init(boot)`	Register usable memory regions; reserve loader/kernel/pool
12	`mmu_init()`	Install 8 KB VA-0 lowcore window; scrub identity map; inherit EDAT-1/2 state. Order is mandatory — see §4.
13	`vmm_init()`	Set up vmalloc region
14	`slab_init()`	Initialize slab caches
15	`kmalloc_init()`	Initialize kmalloc size classes
16	`trap_init()`	Install program-check new PSW; enable trap handler
17	`smp_init()`	Start all APs (SIGP sequence); each AP calls `trap_init()`
18	`sched_init()`	BSP becomes idle (PID 0); spawns `kernel_init` (PID 1)

2. Security Checks (Steps 3–6)

These checks run before any subsystem is initialized. A failure at any point calls panic(), which loads a disabled-wait PSW.

Binding token (step 4): The kernel recomputes ZXVL_COMPUTE_TOKEN(stfle_fac[0], ipl_schid) and compares it to boot->binding_token. This ties the running kernel to the specific hardware and IPL device — a protocol struct copied from another machine will fail here.

Stack frame (step 5): The loader writes a two-word canary at boot->kernel_stack_top. The kernel verifies frame[0] == ZXVL_FRAME_MAGIC_A and frame[1] == ZXVL_FRAME_MAGIC_B ^ binding_token. A mismatch indicates stack corruption or an unauthorized loader.

Checksum re-verification (step 6): The kernel re-reads the zxvl_checksum_table_t from kernel_phys_start + ZXVL_CKSUM_TABLE_OFFSET (via HHDM) and recomputes SHA-256 for each PT_LOAD segment. This catches any modification to the kernel image between loader verification and kernel execution.

3. PMM Reservation (Step 10)

pmm_init registers all ZXFL_MEM_USABLE regions from the boot protocol memory map, then marks the following ranges as reserved:

Range	Reason
`[0, 1 MB)`	Lowcore + loader code
`[kernel_phys_start, kernel_phys_end)`	Kernel image
`[pool_base, pgtbl_pool_end)`	Bootloader page table pool
Each module's `[phys_start, phys_start + size)`	Loaded modules

4. MMU Initialization Ordering Invariant (Step 12)

mmu_init() takes ownership of the bootloader ASCE and replaces the bootloader's 8 GB identity map with a precise 8 KB window at VA 0. This operation has a strict, unbreakable ordering requirement rooted in z/Architecture hardware behavior.

Why VA 0 Must Always Be Mapped

Every interrupt handler entry stub (trap_pgm_entry, trap_ext_entry, etc.) begins with:

lg  %r1, LC_ASYNC_STACK(0)   // effective VA = 0x0350

The zero base register is not an error — it is the only way to load a value before registers have been saved. Because DAT is active when this runs, VA 0x350 must be translated successfully. If the mapping is absent even for one instruction cycle while interrupts are unmasked, a program-check fires, SAVE_FRAME tries to load from VA 0x350 again, and the CPU enters an infinite Region-first-translation exception (0x0039) death loop.

Required Sequence in `mmu_init()`

 Step 1: mmu_map_page(VA 0x0000 → PA 0x0000)   // build mapping first
 Step 2: mmu_map_page(VA 0x1000 → PA 0x1000)   // both pages of the lowcore
 Step 3: scrub r1[1..2046]                      // revoke identity map
 Step 4: mmu_flush_tlb_local()                  // make scrub visible to CPU

Steps 1–2 must precede steps 3–4. The new 8 KB mapping is committed into the live R1 table before any identity entry is removed, so VA 0x350 is always valid.

Can This Be Avoided by Enabling DAT Earlier?

No. The requirement is not a consequence of when DAT is enabled; it comes from how SAVE_FRAME accesses the lowcore. Even if ZXFL enabled DAT internally and passed the kernel a fully virtual address space, the kernel's entry.S would still execute lg %r1, 0x350(0) and still require VA 0x350 to be mapped. This is standard z/Architecture operating system design — Linux s390x, z/VM, and z/OS all maintain an equivalent lowcore window at virtual address 0 for the same reason. See docs/src/kernel/trap.md for the full architectural rationale.

System Check (syschk)

Document Revision: 26h1.3
Status: Active

1. Overview

The System Check subsystem (syschk) is the kernel's mechanism for halting the system when a condition is detected from which execution cannot safely continue.

The halt path acquires no locks, calls no kernel subsystems, and dereferences no kernel data structures. It is safe to call from any context: exception handlers, IRQ handlers, early init, or a state where kernel memory is corrupt.

2. Error Code Encoding

Every system check is identified by a 16-bit code with three fields:

 15      12 11       8 7             0
 ┌────────┬──────────┬───────────────┐
 │ CLASS  │  DOMAIN  │     TYPE      │
 │  4 b   │   4 b    │     8 b       │
 └────────┴──────────┴───────────────┘

Field	Bits	Purpose
CLASS	15–12	Severity class
DOMAIN	11–8	Originating subsystem
TYPE	7–0	Specific condition within the domain

2.1 Severity Classes

Class	Value	Behavior
FATAL	0xF	Always halts
CRITICAL	0xC	Always halts
WARNING	0x3	Always halts

All classes halt unconditionally. The class field exists for post-mortem triage, not for runtime branching.

2.2 Domains

Domain	Value	Subsystem
CORE	0x0	Core kernel / initialization
MEM	0x1	Memory subsystem
SYNC	0x2	Synchronization primitives
ARCH	0x3	Architecture / hardware
SCHED	0x4	Scheduler
IO	0x5	I/O subsystem

3. Halt Sequence

zx_system_check(code, msg)
        │
        ▼
  arch_local_irq_disable()
        │
        ▼
  g_halting set? ──YES──► arch_sys_halt()
        │
        │ NO
        ▼
  g_halting = 1
        │
        ▼
  write zx_crash_record_t to lowcore + 0x1400
  (magic, code, PSW snapshot, reason string)
        │
        ▼
  raw SIGP STOP loop over g_cpu_map[]
  (boot protocol array; no percpu_areas lookup)
  CC=2 retried; CC=3 skipped
        │
        ▼
  arch_sys_halt()  ← disabled-wait PSW; machine stops

4. Crash Record

Before halting, the issuing CPU writes a zx_crash_record_t to a fixed offset (0x1400) within the BSP lowcore. The lowcore is a fixed physical address, always mapped, and accessible regardless of kernel heap or DAT state.

Offset  Size  Field
------  ----  -----
0x00    8     magic  (0x5A584352554E4348 "ZXCRUNCH")
0x08    2     code   (zx_syschk_code_t)
0x0A    6     pad
0x10    8     psw_mask  (EPSW at time of syschk)
0x18    8     psw_addr  (0; not available from EPSW)
0x20    128   msg    (NUL-terminated reason string)

The record is read post-mortem by a debugger or operator console. It is not printed to the console during the halt sequence.

5. Re-entrancy

If a second system check fires on any CPU while a halt is already in progress, the re-entrant call detects g_halting immediately after IRQ disable and proceeds directly to arch_sys_halt(). The crash record is not overwritten.

g_halting is a volatile int, not an atomic. If the memory subsystem is corrupt, atomic operations cannot be trusted.

6. SMP Teardown

The halt path iterates g_cpu_map[] — the boot protocol's CPU map, registered at init time via zx_syschk_register_cpu_map(). This array is loader-written, physically contiguous, and never freed. It does not depend on percpu_areas[] or any kernel allocator.

sigp() is a single inline assembly instruction. It acquires no locks. CC=2 (busy) is retried in a tight loop. CC=3 (not operational) is skipped.

7. WARNING-Class Codes

WARNING codes halt unconditionally. There is no filter mechanism. If a subsystem needs to log a recoverable condition, it should call printk directly and not use zx_system_check.

8. Revision History

Revision	Change
26h1.3	Removed filter API; all classes halt unconditionally; crash record written to lowcore; raw SIGP loop; no printk on halt path
26h1.2	Re-entrant guard moved first; SMP teardown before printk; static BSS message buffer
26h1.1	Initial release

Per-CPU Data

Document Revision: 26h1.3 Sources: include/arch/s390x/cpu/lowcore.h, include/zxfoundation/percpu.h, arch/s390x/cpu/percpu.c

1. Layout

Each CPU's prefix area (lowcore) is a monolithic 8 KB block (two contiguous physical pages). The physical address of this block is loaded into the hardware prefix register via SPX. The prefix register transparently remaps absolute address 0x0000–0x1FFF to the CPU's own physical lowcore for all absolute-mode accesses.

The layout unifies hardware-assigned fields and software-defined per-CPU data into a single structure (zx_lowcore_t):

Physical Prefix Area (8 KB)
┌──────────────────────────────┐ 0x000
│  Hardware Lowcore            │   PSWs, interrupt codes, save areas (PoP §4)
├──────────────────────────────┤ 0x400  ← LC_PERCPU_OFFSET
│  Software Per-CPU Block      │   prefix_base, cpu_id, lock_depth,
│  (zx_percpu_t percpu)        │   MCS nodes, RCU state, PCP caches
├──────────────────────────────┤ 0x1200
│  Hardware Save Areas         │   GPRs, FPRs, CRs, ARs
└──────────────────────────────┘ 0x2000

2. Access — Current CPU

To access the current CPU's own per-CPU data, the kernel uses zx_lowcore(), which returns the HHDM-mapped pointer to the active lowcore. Because the prefix register already routes absolute-address-0 to this CPU's physical lowcore, and the HHDM maps physical 0 to CONFIG_KERNEL_VIRT_OFFSET, zx_lowcore() always resolves to the correct CPU without needing the prefix register value at all.

Macro	Description
`percpu_get(field)`	Read a field from the current CPU's `percpu` block
`percpu_set(field, val)`	Write a field to the current CPU's `percpu` block
`percpu_inc(field)`	Increment a field in place
`percpu_dec(field)`	Decrement a field in place
`percpu_ptr_to(field)`	Pointer to a field in the current CPU's block

3. Access — Other CPUs (`zx_lowcore_cpu`)

3.1 The Hardware Prefix Aliasing Bug

Accessing another CPU's lowcore by index into a global pointer array is deceptively dangerous on s390x. Consider the global array __percpu_areas_raw[] where:

__percpu_areas_raw[0] = HHDM pointer to BSP lowcore = CONFIG_KERNEL_VIRT_OFFSET + 0
__percpu_areas_raw[1] = HHDM pointer to AP-1 lowcore = CONFIG_KERNEL_VIRT_OFFSET + P

When AP-1 (whose prefix register is P) reads a value from address CONFIG_KERNEL_VIRT_OFFSET + 0 (i.e., the BSP's HHDM lowcore), the MMU translates it to physical address 0. The prefix register then remaps physical 0 to physical P — so AP-1 silently reads its own lowcore, not the BSP's.

Symmetrically, when AP-1 reads from CONFIG_KERNEL_VIRT_OFFSET + P, the MMU translates it to physical P. The prefix register remaps physical P to physical 0 — so AP-1 silently reads the BSP's lowcore.

The result: every AP's cross-CPU lowcore lookup is silently swapped with the BSP's. IPI delivery, RCU quiescent-state tracking, and PMM per-CPU page caches all operated on the wrong CPU's data. The system "mostly worked" because the perfect symmetry of the swap caused IPIs to still reach all CPUs, masking the corruption.

3.2 The Safe Accessor: `zx_lowcore_cpu(cpu)`

__percpu_areas_raw[] must never be accessed directly. Use zx_lowcore_cpu(cpu) defined in include/zxfoundation/percpu.h, which applies an inverse prefix swap in software:

#define zx_lowcore_cpu(cpu)                                                    \
    ({                                                                          \
        zx_lowcore_t *__lc = __percpu_areas_raw[(cpu)];                        \
        zx_lowcore_t *__res = __lc;                                             \
        if (__lc) {                                                             \
            uint64_t __target_real = (uint64_t)__lc - CONFIG_KERNEL_VIRT_OFFSET;\
            uint64_t __my_prefix   = zx_lowcore()->percpu.prefix_base;         \
            if (__target_real == __my_prefix)                                   \
                __res = (zx_lowcore_t *)CONFIG_KERNEL_VIRT_OFFSET;             \
            else if (__target_real == 0)                                        \
                __res = (zx_lowcore_t *)(CONFIG_KERNEL_VIRT_OFFSET + __my_prefix);\
        }                                                                       \
        __res;                                                                  \
    })

How it works: if the target's physical address matches my_prefix, the hardware would have swapped it to 0, so we manually redirect to HHDM + 0 (the BSP). If the target's physical address is 0, the hardware would have swapped it to my_prefix, so we redirect to HHDM + my_prefix. Any other CPU is unaffected (no swap applies).

The cross-CPU access macros all go through this accessor:

Macro	Description
`percpu_get_on(cpu, field)`	Read from another CPU's `percpu` block
`percpu_set_on(cpu, field, val)`	Write to another CPU's `percpu` block
`percpu_ptr_on(cpu, field)`	Pointer to a field in another CPU's block

4. Initialization

Function	When Called	Effect
`percpu_init_bsp()`	Once, early in `main.c`	Registers BSP lowcore (physical `0x0`) in `__percpu_areas_raw[0]`
`percpu_init_ap(cpu_id, cpu_addr, node)`	Once per AP in `smp_init()`	Allocates 8 KB (order-1), installs prefix via `SPX`, registers in `__percpu_areas_raw[cpu_id]`

5. Fields (`zx_percpu_t`)

Field	Type	Purpose
`prefix_base`	`uint64_t`	Physical address of this CPU's lowcore (used by `zx_lowcore_cpu`)
`cpu_id`	`uint16_t`	Logical CPU ID (0 = BSP)
`cpu_addr`	`uint16_t`	z/Arch CPU address (`STAP` result); used for `SIGP`
`lock_depth`	`uint32_t`	qspinlock nesting depth
`lock_nodes[MAX_LOCK_DEPTH]`	`mcs_node_t[]`	MCS queue nodes for qspinlock
`rcu_gp_seq`	`uint64_t`	RCU grace-period sequence (written by BSP)
`rcu_qs_seq`	`uint64_t`	RCU quiescent-state sequence (written by this CPU)
`in_rcu_read_side`	`uint8_t`	1 if inside `rcu_read_lock()`
`ipi_pending_count`	`uint32_t`	Pending IPI completion counter
`ap_stack_top`	`uint64_t`	Initial AP stack pointer (physical, set before SIGP Restart)
`pcp[ZONE_MAX]`	`pmm_pcplist_t[]`	Per-CPU PMM order-0 page caches, one per memory zone

6. Assembly Offsets

Key lowcore offsets used by entry.S and head64.S are defined as named constants in include/arch/s390x/cpu/lowcore.h and verified at compile time by _Static_assert:

Constant	Value	Field
`LC_ASYNC_STACK`	`0x0350`	`zx_lowcore_t::async_stack`
`LC_MCCK_STACK`	`0x0368`	`zx_lowcore_t::mcck_stack`
`LC_KERNEL_STACK`	`0x0348`	`zx_lowcore_t::kernel_stack`
`LC_RESTART_STACK`	`0x0360`	`zx_lowcore_t::restart_stack`
`LC_KERNEL_ASCE`	`0x0388`	`zx_lowcore_t::kernel_asce`
`LC_PERCPU_OFFSET`	`0x0400`	`zx_lowcore_t::percpu`
`LC_CPU_ID_OFFSET`	`0x0408`	`zx_percpu_t::cpu_id` (within percpu block)

Interrupt Subsystem

Document Revision: 26h1.0
Subsystem: arch/s390x/trap, zxfoundation/irq

1. Overview

The interrupt subsystem handles all four z/Architecture interrupt classes delivered to the kernel: program check, external, I/O, and machine check. It is structured in two layers:

Architecture layer (arch/s390x/trap/) — low-level entry stubs and class-specific C handlers that decode hardware state from the lowcore.
Generic layer (zxfoundation/irq/) — a flat IRQ descriptor table that routes decoded interrupt codes to registered handlers.

Supervisor calls (SVC) are reserved for the future syscall layer and are not dispatched through this subsystem.

2. Interrupt Delivery on z/Architecture

When an interrupt fires, the hardware atomically:

Saves the current PSW into the class-specific old PSW slot in the lowcore (prefix area).
Writes interrupt parameters into fixed lowcore fields.
Loads the class-specific new PSW slot, transferring control to the kernel entry stub.

Hardware fires interrupt
        │
        ▼
  Save current PSW → lowcore old PSW slot (0x0130/0x0150/0x0160/0x0170)
        │
        ▼
  Write interrupt parameters to lowcore (pgm_code, ext_int_code, …)
        │
        ▼
  Load new PSW slot (0x01B0/0x01D0/0x01E0/0x01F0) → entry stub

The new PSW slots are installed by zx_lowcore_setup_late() after DAT is enabled. Before that point they hold disabled-wait sentinels.

3. Lowcore Interrupt Slots

Class	Old PSW	New PSW	Parameter fields
External	`0x0130`	`0x01B0`	`ext_int_code` (0x0086)
Program check	`0x0150`	`0x01D0`	`pgm_code` (0x008E)
Machine check	`0x0160`	`0x01E0`	`mcck_interruption_code` (0x00E8)
I/O	`0x0170`	`0x01F0`	`subchannel_nr` (0x00BA)

4. Entry Stubs (`arch/s390x/trap/entry.S`)

Each entry stub performs the following sequence without touching any kernel data structure:

entry stub
  │
  ├─ Load dedicated stack pointer from lowcore
  │    async_stack (0x0350) for PGM / EXT / IO
  │    mcck_stack  (0x0368) for MCCK
  │
  ├─ Allocate 160-byte ABI save area + 160-byte interrupt frame
  │
  ├─ Store GPRs r0–r15 into frame.gprs[0..15]
  │
  ├─ Copy old PSW (mask + addr) from lowcore into frame.psw_mask/psw_addr
  │
  ├─ Set %r2 = &frame  (first argument to C handler)
  │
  ├─ BRASL → C handler (do_pgm_check / do_ext_interrupt / …)
  │
  └─ Restore GPRs r0–r14, LPSWE from frame.psw_mask

The machine-check stub uses a separate stack (mcck_stack) so that the handler runs even if the async stack is corrupt.

4.1 Interrupt Frame Layout

Offset  Size  Field
------  ----  -----
0x00    128   gprs[0..15]   — GPRs at interrupt time
0x80    8     psw_mask      — old PSW mask word
0x88    8     psw_addr      — old PSW instruction address

Total: 160 bytes (IRQ_FRAME_SIZE).

5. IRQ Number Space

The generic layer uses a 16-bit IRQ number partitioned by interrupt class:

0x0000 – 0x00FF   Program check codes  (pgm_code & 0x7FFF)
0x0100 – 0x01FF   External codes       (ext_int_code)
0x0200 – 0x02FF   I/O subchannel numbers (subchannel_nr & 0xFF)
0x0300 – 0x03FF   Machine-check sub-codes (mcic >> 56)

The descriptor table has ZX_IRQ_NR_MAX = 0x400 entries.

6. IRQ Descriptor Table (`zxfoundation/irq/`)

The table is a flat, statically-allocated BSS array. Each entry holds:

A handler function pointer (irq_handler_t).
An opaque data pointer forwarded to the handler.
flags (ZX_IRQF_SHARED, ZX_IRQF_DISABLED).
A count field incremented on every dispatch.

6.1 Dispatch Path

C handler (do_pgm_check / do_ext_interrupt / …)
  │
  ├─ Read hardware code from lowcore
  ├─ Compute irq = ZX_IRQ_BASE_* + code
  └─ irq_dispatch(irq, frame)
        │
        ├─ Bounds check irq < ZX_IRQ_NR_MAX
        ├─ Increment desc->count
        └─ Call desc->handler (or default handler if NULL)

6.2 Default Handler Behavior

IRQ range	Default action
PGM (0x0–0xFF)	`zx_system_check(ARCH_UNHANDLED_TRAP)` — fatal
EXT (0x100–0x1FF)	`printk` + drop
IO (0x200–0x2FF)	`printk` + drop
MCCK (0x300–0x3FF)	`zx_system_check(ARCH_MCHECK)` — fatal

7. Machine-Check Special Case

Before dispatching, do_mcck_interrupt checks the system damage bit (bit 0) of the MCIC. If set, zx_system_check() is called immediately — the descriptor table itself may reside in damaged storage and cannot be trusted.

8. Registration API

irq_register(irq, handler, data, flags)  → 0 or -1
irq_unregister(irq)
irq_dispatch(irq, frame)
irq_get_desc(irq)                        → const irq_desc_t *

irq_register and irq_unregister are not SMP-safe at this revision. They must be called during single-threaded initialization or with external serialization.

9. Revision History

Revision	Change
26h1.0	Initial release

Memory Management

Document Revision: 26h1.0

ZXFoundation™'s memory management is organized in four layers:

┌──────────────────────────────────────────┐
│  kmalloc / kfree  (general-purpose)      │
├──────────────────────────────────────────┤
│  Slab allocator   (fixed-size caches)    │
├──────────────────────────────────────────┤
│  VMM              (virtual address space)│
├──────────────────────────────────────────┤
│  PMM              (physical frames)      │
├──────────────────────────────────────────┤
│  MMU              (hardware DAT tables)  │
└──────────────────────────────────────────┘

Page	Contents
PMM	Zone-aware buddy allocator, page descriptors
VMM	Virtual address space, VMA red-black tree, vmalloc
Slab & Kmalloc	Fixed-size object caches, general allocator

Physical Memory Manager (PMM)

Document Revision: 26h1.0
Source: zxfoundation/memory/pmm.c

1. Zones

Zone	Physical range	Purpose
`ZONE_DMA`	`[0, 16 MB)`	Channel I/O buffers (31-bit CDA constraint)
`ZONE_NORMAL`	`[16 MB, RAM limit)`	General kernel allocations

Allocations without ZX_GFP_DMA are served from ZONE_NORMAL first. If ZONE_NORMAL is exhausted and ZX_GFP_DMA_FALLBACK is set, the PMM falls back to ZONE_DMA.

2. Buddy Allocator

Free physical frames are managed in a buddy system. Block sizes are powers of two, from order 0 (4 KB) to order 10 (4 MB). Each order has a free list of blocks.

Allocation — walk the free list at the requested order. If empty, split a block from the next higher order. Repeat until a block is found or all orders are exhausted.

Deallocation — compute the buddy PFN (pfn ^ (1 << order)). If the buddy is free at the same order, coalesce and recurse upward.

Free list links use PFN-based intrusive fields (buddy_next) rather than virtual pointers, ensuring correctness across HHDM translations.

3. Page Descriptor (`zx_page_t`)

Each physical frame has a 32-byte descriptor. The descriptor array is mapped contiguously in the HHDM. 32 bytes places 128 descriptors per 4 KB frame — a deliberate cache-line optimization.

Field	Description
`refcount`	Atomic reference count; 0 = free
`order`	Current buddy order of this block
`flags`	Zone membership, compound page markers
`buddy_next`	PFN of next free block in the buddy list

4. GFP Flags

Flag	Meaning
`ZX_GFP_NORMAL`	Standard allocation from `ZONE_NORMAL`
`ZX_GFP_DMA`	Must allocate from `ZONE_DMA`
`ZX_GFP_DMA_FALLBACK`	Try `ZONE_NORMAL`, fall back to `ZONE_DMA`
`ZX_GFP_ZERO`	Zero-fill the allocated pages

5. SMP Safety & Per-CPU Lists (PCP)

Each zone has a dedicated ticket spinlock. To reduce contention, order-0 pages are cached in Per-CPU Lists (PCP).

Allocation: CPUs pull from local PCP first without locking (IRQs disabled).
Drain: Global operations (like pmm_reserve_range) trigger a global PCP drainage via SIGP Emergency Signals (IPI) to all other CPUs. This ensures no CPU holds a 'stale' cached page that should be reserved.

6. HHDM Side Reinforcement

The Direct Physical Mapping (HHDM) is validated during initialization:

Validation: pmm_verify_hhdm() checks translation consistency against the loader's memory map. It verifies that every usable physical page is correctly mapped to its HHDM virtual counterpart.
EDAT Compliance: Verifies Enhanced-DAT (EDAT-1/2) 1 MB and 2 GB page usage to optimize memory performance and reduce TLB pressure.
Consistency: The loader must ensure that the mapping covers the entire physical memory range described in the boot protocol, rounding up to the nearest Region-3 or Segment boundary as required by the z/Architecture DAT structure.

7. Initialization

pmm_init(boot) is called once during early init:

Walk boot->mem_map[] and register all ZXFL_MEM_USABLE regions.
Mark reserved ranges via Surgical Reservation:
- Lowcore/Artifacts: [0, 1 MB) is always reserved to protect lowcore and loader leftovers.
- Kernel Image: [kernel_phys_start, kernel_phys_end) is marked as critical.
- Page Table Pool: [kernel_phys_end, pgtbl_pool_end) is reserved to protect active DAT tables.
- PMM Metadata: The zx_mem_map descriptor array itself.
Insert all non-reserved USABLE frames into the buddy free lists.

[!IMPORTANT] Surgical Reservation prevents "Zone Exhaustion" bugs where a large bootloader page pool could otherwise wipe out all available frames in ZONE_DMA (under 16 MB).

Virtual Memory Manager (VMM)

Document Revision: 26h1.0
Source: zxfoundation/memory/vmm.c

1. Address Space Regions

Region	Base	Purpose
HHDM	`0xFFFF800000000000`	Linear physical memory map (built by loader, read-only to VMM)
vmalloc	`0xFFFFC00000000000`	Dynamically mapped kernel memory

2. Virtual Memory Areas (VMAs)

Each allocated virtual range is described by a vm_area_t:

Field	Description
`va_start`	Start of virtual range (page-aligned)
`va_end`	End of virtual range (exclusive)
`flags`	`VM_READ`, `VM_WRITE`, `VM_EXEC`
`rb_node`	Red-Black Tree node for $O(\log n)$ lookup

VMAs are indexed in a Red-Black Tree (rbtree.h). A one-entry MRU cache in vm_space_t provides an $O(1)$ fast path for sequential access patterns.

3. vmalloc

vmm_alloc(size, flags) reserves a contiguous virtual range in the vmalloc region and maps it with PMM-allocated frames:

vmm_alloc(size, flags)
  │
  ├─ Round size up to page boundary
  ├─ Bump-allocate virtual range from vmalloc region
  ├─ Insert VMA into red-black tree
  ├─ For each page in range:
  │    ├─ pmm_alloc_page(flags)
  │    └─ mmu_map_page(kernel_pgtbl, va, pa, prot)
  └─ Return va_start

Frames backing a vmalloc range are not required to be physically contiguous.

4. Large-Object Heap (`kheap`)

For allocations larger than 8 KB, kheap_alloc calls vmm_alloc to back the range with PMM frames. A 64-bit HEAP_MAGIC canary guards the allocation header against buffer underflows.

5. MMU Integration

The VMM calls mmu_map_page (4 KB), mmu_map_large_page (1 MB, if EDAT-1 available), or mmu_map_huge_page (2 GB, if EDAT-2 available) to install PTEs. TLB coherency is handled automatically by the IPTE instruction — no software IPI is required.

Slab Allocator & kmalloc

Document Revision: 26h1.1 Source: zxfoundation/memory/slab.c, zxfoundation/memory/kmalloc.c

1. Slab Allocator

The slab allocator provides fixed-size object caches to amortize the cost of frequent small allocations (VMAs, sync primitives, capability tables, etc.). It uses a magazine-depot architecture for lock-free per-CPU fast paths and SMP-safe bulk operations through the depot.

1.1 Architecture

kmem_cache_t
  ├─ obj_size          (8-byte aligned)
  ├─ storage_key       (s390x storage key for all backing pages)
  ├─ depot_lock        (spinlock protecting the depot lists)
  ├─ full_mags         (depot: magazines with MAG_SIZE objects ready)
  ├─ empty_mags        (depot: magazines ready to be refilled)
  ├─ partial_slabs     (slab pages with free objects remaining)
  ├─ full_slabs        (slab pages fully allocated)
  └─ cpu_mags[MAX_CPUS] (per-CPU active magazine pointer)

Each magazine holds up to MAG_SIZE (31) object pointers. Each slab is one PMM page; the slab header, free-index stack, and object array are all embedded within that page.

1.2 Fast Path (per-CPU, no lock)

alloc:
  IRQs disabled
  if cpu_mag.count > 0 → pop and return
  else → magazine_swap(fill) → pop and return

free:
  IRQs disabled
  if cpu_mag.count < MAG_SIZE → push and return
  else → magazine_swap(drain) → push and return

IRQs are disabled for the duration of the fast path. No lock is taken; the per-CPU magazine is accessed exclusively.

1.3 Slow Path (depot, with lock)

magazine_swap acquires depot_lock. Two sub-paths:

Fill (need objects):

1. full_mags non-empty?
      yes → promote to CPU slot immediately (fast fill)
       no → obtain empty shell from empty_mags (or alloc from mag_cache)
            → cache_refill_magazine (may drop+reacquire depot_lock for PMM)
            → move filled shell to full_mags → promote to CPU slot

Drain (returning a full CPU magazine):

1. Push CPU magazine to full_mags
2. Pull empty shell from empty_mags into CPU slot (or set to nullptr)

1.4 Slab Refill & Lock Discipline

cache_refill_magazine is called with depot_lock held. When a new slab page must be allocated from the PMM:

drop depot_lock
  pmm_alloc_page()      ← PMM zone lock acquired/released here
reacquire depot_lock
re-validate partial_slabs (another CPU may have added one in the window)

This ensures the PMM zone lock and depot_lock are never held simultaneously, eliminating the lock-inversion hazard present in earlier revisions.

1.5 Node Lifecycle

Magazine nodes cycle between:

empty_mags ──fill──▶ (detached, being filled) ──▶ full_mags ──promote──▶ cpu_mag
cpu_mag ──drain──▶ full_mags   empty_mags ◀── (pulled empty shell)

list_del_init is used for all magazine-node removals so nodes are always in a self-pointing state when not on a list, making re-insertion safe without re-initialization.

2. kmalloc

kmalloc(size) routes requests to the appropriate slab cache based on size class.

Size range	Backing
≤ 8 KB	Slab cache (power-of-two class)
> 8 KB	`vmalloc` → `vmm_alloc`

kfree(ptr) returns the object to its originating cache. A header embedded before each allocation records the cache pointer and a canary for use-after-free detection.

3. Initialization Order

pmm_init()      ← must run first; slab needs PMM pages
slab_init()     ← bootstraps cache_cache and mag_cache from a single PMM page
kmalloc_init()  ← registers size-class caches via kmem_cache_create
vmm_notify_slab_ready() ← switches VMM early allocator to kmalloc

4. Strict Requirements

ID	Requirement
SLAB-1	`kmem_cache_alloc` must not be called from hard-IRQ context unless the cache was created with atomic support. Use `kmalloc(ZX_GFP_ATOMIC)` from IRQ context.
SLAB-2	`kmem_cache_free` must only be called with a pointer returned by `kmem_cache_alloc` on the same cache. Cross-cache free is undefined behavior.
SLAB-3	`kmem_cache_destroy` must only be called after all objects have been returned. Outstanding objects at destroy time trigger a kernel panic.
SLAB-4	`depot_lock` must never be held when calling into the PMM or any allocator that may itself acquire a zone lock. Use the lock-drop protocol in `cache_refill_magazine`.

SMP

Document Revision: 26h1.0
Source: arch/s390x/cpu/

1. CPU Detection

The bootloader detects CPUs by issuing SIGP Sense (order 0x01) to each address in [0, ZXFL_CPU_MAP_MAX). A condition code of 3 means "not operational" — the address is unoccupied. CC 0, 1, or 2 means the CPU exists and is recorded in proto->cpu_map[].

The BSP address is read with STAP (Store CPU Address).

At kernel entry, proto->cpu_count contains the number of detected CPUs and proto->bsp_cpu_addr identifies the boot processor.

2. AP State at Handover

All APs are in the stopped state when the kernel receives control. The bootloader never starts APs. The kernel BSP is responsible for starting each AP:

Step	Action
1	Allocate a private prefix area (4 KB, page-aligned) for the AP
2	Allocate a private stack for the AP
3	Install interrupt new PSWs in the AP's prefix area
4	`SIGP Initial CPU Reset` — clear the AP's state
5	`SIGP Set Prefix` — point the AP's prefix register at its private lowcore
6	`SIGP Restart` — start the AP at the restart new PSW in its prefix area

Note: AP startup is not yet implemented. The current kernel halts after BSP initialization.

3. Per-CPU Data

Each CPU requires its own:

Prefix area (4 KB) — private lowcore with correct new PSWs. Set via SPX.
Stack — the AP must not use the BSP stack or the loader stack.
Per-CPU variables — accessed via the prefix register offset (analogous to %gs on x86).

4. TLB Coherency

z/Architecture hardware handles TLB coherency automatically via the IPTE (Invalidate Page Table Entry) instruction. IPTE atomically clears a PTE and broadcasts a TLB purge to all CPUs that have the affected ASCE loaded. No software IPI is required for TLB shootdowns.

mmu_ipte(va):
    ipte %r0, va    ← serialising, hardware-broadcast

PTLB (Purge TLB) flushes the entire local TLB and should only be used during address-space teardown. For single-page invalidation in a running SMP kernel, always use IPTE.

5. SIGP Reference

Order	Code	Use
Sense	`0x01`	Query CPU state
External Call	`0x02`	Send external interrupt to CPU
Emergency Signal	`0x03`	Send emergency signal
Initial CPU Reset	`0x06`	Clear CPU state before restart
Set Prefix	`0x0D`	Set prefix register on target CPU
Store Status	`0x0E`	Save CPU registers to prefix area
Set Architecture	`0x12`	Switch to z/Architecture mode
Restart	`0x06` + Restart PSW	Start AP at restart new PSW

PSW Manager

Document Revision: 26h1.0
Subsystem: arch/s390x/cpu/psw

1. Overview

The PSW (Program Status Word) manager provides a single, authoritative definition of all z/Architecture PSW mask constants and new-PSW lowcore offsets. Prior to this subsystem, constants were duplicated across zxconfig.h and lowcore.h under different names, and assembly files hardcoded incorrect bit patterns.

All consumers — C translation units, assembly files, the ZXFL loader, and the kernel — include a single header: arch/s390x/cpu/psw.h.

2. PSW Mask Word Layout

The z/Architecture PSW is 16 bytes. The first 8 bytes are the mask word; the second 8 bytes are the instruction address.

Bit  0     PER mask
Bit  5     DAT (address translation enable)
Bit  6     I/O interrupt mask
Bit  7     External interrupt mask
Bit 12     Machine-check mask
Bit 14     Wait state
Bit 15     Problem state (user mode)
Bits 16-17 Address space control (ASC)
Bit 31     EA — required for 64-bit addressing
Bit 32     BA — required for 64-bit addressing

Bits not listed above are reserved and must be zero. Setting a reserved bit causes a Specification Exception when the PSW is loaded via LPSWE.

3. Defined Constants

3.1 Bit Masks

Constant	Value	Description
`PSW_BIT_DAT`	`0x0400000000000000`	Address translation enable
`PSW_BIT_IO`	`0x0200000000000000`	I/O interrupt mask
`PSW_BIT_EXT`	`0x0100000000000000`	External interrupt mask
`PSW_BIT_MCCK`	`0x0008000000000000`	Machine-check mask
`PSW_BIT_WAIT`	`0x0002000000000000`	Wait state
`PSW_BIT_PSTATE`	`0x0001000000000000`	Problem state (user mode)
`PSW_BIT_HOME_SPACE`	`0x0000C00000000000`	Home space addressing mode
`PSW_BIT_EA`	`0x0000000100000000`	Extended addressing (64-bit)
`PSW_BIT_BA`	`0x0000000080000000`	Basic addressing (64-bit)

3.2 Composite Masks

Constant	Value	Description
`PSW_ARCH_BITS`	`0x0000000180000000`	EA\|BA — 64-bit mode, no other bits set
`PSW_MASK_KERNEL`	`0x0000000180000000`	Supervisor, DAT off, all interrupts disabled
`PSW_MASK_KERNEL_DAT`	`0x0400C00180000000`	Supervisor, DAT on (Home Space), all interrupts disabled
`PSW_MASK_DISABLED_WAIT`	`0x0002000180000000`	Wait state, DAT off, all interrupts disabled

3.3 New PSW Lowcore Offsets

These are the physical offsets within the lowcore (prefix area) where the hardware loads the PSW on each interrupt class (PoP SA22-7832 §4.3.3).

Constant	Offset	Interrupt class
`PSW_LC_RESTART`	`0x01A0`	Restart
`PSW_LC_EXTERNAL`	`0x01B0`	External
`PSW_LC_SVC`	`0x01C0`	Supervisor call
`PSW_LC_PROGRAM`	`0x01D0`	Program check
`PSW_LC_MCCK`	`0x01E0`	Machine check
`PSW_LC_IO`	`0x01F0`	I/O

Note: These offsets are distinct from the old PSW save slots (0x0120–0x0170) and from the interrupt parameter area (0x0080–0x00C0).

4. Boot Initialization

The ZXFL loader prepares the memory tables, registers the Home Space ASCE in CR13 and the Primary Space ASCE in CR1, and directly transitions to DAT-on mode using a PSW_MASK_KERNEL_DAT PSW target before passing control to the kernel.

Thus, the kernel boots with DAT active and executes completely in Home-Space. The legacy psw_install_new_psws() and zx_lowcore_setup_pre_dat() methods have been removed because the pre-DAT boot window is bypassed by the loader.

During early kernel initialization, zx_lowcore_setup_late() is called to install the live interrupt handler entry points directly into the HHDM-mapped lowcore.

Synchronization Primitives

Document Revision: 26h1.0
Source: zxfoundation/sync/, include/zxfoundation/spinlock.h, include/zxfoundation/atomic.h

1. Atomic Operations

include/zxfoundation/atomic.h provides atomic_t (32-bit) and atomic64_t (64-bit) types with the standard load/store/add/sub/cmpxchg operations, implemented using z/Architecture's CS (Compare and Swap) and CSG (Compare and Swap, 64-bit) instructions.

2. Spinlock

include/zxfoundation/spinlock.h provides a ticket spinlock. Ticket spinlocks guarantee FIFO ordering, preventing starvation on highly contended locks.

Function	Description
`spin_lock(lock)`	Acquire; busy-wait with `DIAG 44` (yield hint)
`spin_unlock(lock)`	Release
`spin_lock_irqsave(lock, flags)`	Acquire + disable interrupts, save PSW mask
`spin_unlock_irqrestore(lock, flags)`	Release + restore PSW mask

irqsave/irqrestore variants are required whenever a lock may be acquired from both process context and interrupt context.

3. Mutex

zxfoundation/sync/mutex.c — a sleeping mutex backed by a wait queue. Suitable for contexts where sleeping is permitted (not interrupt handlers).

Function	Description
`mutex_lock(m)`	Acquire; sleep if contended
`mutex_trylock(m)`	Non-blocking acquire; returns 0 on failure
`mutex_unlock(m)`	Release; wake one waiter

4. Reader-Writer Lock

zxfoundation/sync/rwlock.c — allows multiple concurrent readers or one exclusive writer.

Function	Description
`rwlock_read_lock(rw)`	Acquire shared read access
`rwlock_read_unlock(rw)`	Release read access
`rwlock_write_lock(rw)`	Acquire exclusive write access
`rwlock_write_unlock(rw)`	Release write access

5. Semaphore

zxfoundation/sync/semaphore.c — counting semaphore.

Function	Description
`sem_init(s, count)`	Initialize with initial count
`sem_wait(s)`	Decrement; sleep if count is 0
`sem_post(s)`	Increment; wake one waiter

6. Wait Queue

zxfoundation/sync/waitqueue.c — a list of sleeping tasks waiting for a condition.

Function	Description
`waitqueue_init(wq)`	Initialize
`waitqueue_wait(wq, condition)`	Sleep until `condition` is true
`waitqueue_wake_one(wq)`	Wake the first waiter
`waitqueue_wake_all(wq)`	Wake all waiters

7. RCU

zxfoundation/sync/rcu.c — Read-Copy-Update. Currently a stub; rcu_read_lock/rcu_read_unlock are no-ops and synchronize_rcu returns immediately.

RCU and SRCU

Document Revision: 26h1.1
Source: zxfoundation/sync/rcu.c, zxfoundation/sync/srcu.c

1. RCU

Read-Copy-Update for a non-preemptive kernel. A quiescent state (QS) occurs whenever a CPU is not inside an rcu_read_lock() section.

Read Side

Function	Description
`rcu_read_lock()`	Enter read-side critical section (compiler barrier only)
`rcu_read_unlock()`	Exit read-side critical section
`rcu_dereference(p)`	Safely read an RCU-protected pointer
`rcu_assign_pointer(p, v)`	Safely publish a new pointer

Write Side

Function	Description
`call_rcu(head, fn)`	Register a callback for after the next grace period
`synchronize_rcu()`	Block until all pre-existing readers have completed, then drain callbacks
`rcu_report_qs()`	Report a quiescent state for the current CPU

Grace Period Mechanism

synchronize_rcu():
  1. Increment gp_seq
  2. Broadcast new gp_seq to all per-CPU rcu_gp_seq fields
  3. Spin until every CPU's rcu_qs_seq == gp_seq
  4. Drain callback list

rcu_report_qs() must be called from the idle loop and any long-running non-read-side context.

2. SRCU

Sleepable RCU — allows read-side critical sections to sleep. Each SRCU domain (srcu_struct_t) is independent.

Read Side

Function	Description
`srcu_read_lock(s)`	Enter SRCU read section; returns slot index
`srcu_read_unlock(s, idx)`	Exit SRCU read section

Write Side

Function	Description
`synchronize_srcu(s)`	Wait for all pre-existing readers; may spin
`call_srcu(s, head, fn)`	Synchronize then invoke callback

Two-Slot Mechanism

Active slot: s->idx (0 or 1)

srcu_read_lock:   increment pcpu[cpu].c[s->idx]
srcu_read_unlock: decrement pcpu[cpu].c[idx]

synchronize_srcu:
  1. Flip s->idx (new readers use new slot)
  2. Wait until sum of pcpu[*].c[old_idx] == 0
  3. Increment gp_seq

Initialization

DEFINE_SRCU(my_domain);          // static
srcu_init(&my_domain);           // runtime

Kernel Object Management System

Document: ZXF-KRN-KOMS-001
Revision: 1.0
Status: Released

1. Purpose

The Kernel Object Management System (KOMS) is the unified abstraction layer for all reference-counted kernel objects. It defines a single base type, kobject_t, that any subsystem may embed to obtain lifecycle management, naming, attribute storage, event delivery, and hierarchical organization at no additional per-subsystem cost.

2. Architectural Position

KOMS sits immediately above the memory allocator and synchronization primitives, and below all subsystems that manage named, reference-counted resources.

┌─────────────────────────────────────────────────────┐
│  Subsystems  (IRQ, VMM, Device, Task, File, …)      │
├─────────────────────────────────────────────────────┤
│  KOMS  (koms.h / koms.c)                            │
├──────────────┬──────────────┬───────────────────────┤
│  kmalloc /   │  spinlock /  │  RCU                  │
│  slab        │  rwlock      │                       │
└──────────────┴──────────────┴───────────────────────┘

KOMS is initialized once, after kmalloc_init(), before any subsystem that registers a type or allocates a managed object.

3. Core Concepts

3.1 kobject_t

Every managed object embeds kobject_t as its first member. The base object carries:

An atomic reference counter (kref_t).
A mandatory operations table (kobject_ops_t) with a release callback.
A lifecycle state (KOBJECT_UNINITIALIZED, KOBJECT_ALIVE, KOBJECT_DEAD).
A static name string.
A 32-bit type identifier.
A 32-bit flags word.
Intrusive list nodes for parent/child hierarchy, namespace membership, attributes, and event listeners.
An embedded spinlock_t protecting the mutable extension fields.
An rcu_head_t for deferred free.

The kobject_container() macro recovers the containing struct from a kobject_t * pointer using compile-time offset arithmetic.

3.2 Type Registry

A kobj_type_t descriptor is registered once at boot per object class. It carries:

Field	Purpose
`type_id`	Globally unique 32-bit identifier
`name`	Human-readable string for diagnostics
`obj_size`	`sizeof` of the containing struct
`cache`	Optional dedicated slab cache
`kobj_ops`	Mandatory ops table (must provide `release`)
`type_ops`	Optional extended vtable (`init`, `destroy`, `ns_add`, `ns_remove`)

After koms_init() the registry is append-only and read locklessly.

3.3 Namespace

A kobj_ns_t is an RCU-protected hash table of kobject_t pointers, keyed by name. Namespaces form a tree rooted at koms_root_ns.

koms_root_ns
├── "irq"
│   ├── "ext-0x40"
│   └── "pgm-0x0d"
├── "vmm"
│   └── "kernel"
└── "device"
    └── "dasd-0"

Reads use rcu_read_lock() and are fully lockless. Writes acquire the namespace's write_lock (spinlock, irqsave).

3.4 Attributes

Attributes are kobj_attr_t nodes linked into kobject_t::attrs. Each attribute has a name and optional get/set callbacks. The attribute list is protected by kobject_t::lock.

3.5 Event Bus

Events are typed (kobj_event_type_t) and carry a payload union. Listeners (kobj_listener_t) are registered per-object with an optional event-type bitmask filter. Dispatch snapshots the listener list under the object lock, then calls each listener without the lock, preventing deadlocks on re-entrant dispatch. Events propagate up the parent chain automatically.

4. Lifecycle

         koms_alloc()
              │
              ▼
        [refcount = 0]
              │
        koms_init_obj()
              │
              ▼
        KOBJECT_ALIVE  ◄──── koms_get()
        [refcount = 1]
              │
        koms_put() × N
              │
        [refcount = 0]
              │
              ▼
         KOBJECT_DEAD
              │
         ops->release()
              │
              ▼
          koms_free()

koms_freeze() sets KOBJ_FLAG_FROZEN, causing koms_get_unless_dead() to fail without affecting existing references. This enables controlled teardown: freeze the object, wait for all external references to drain, then drop the final reference.

5. Allocation Strategy

koms_alloc(type, gfp)
    │
    ├─ type->cache != nullptr ──► kmem_cache_alloc(type->cache, gfp | ZERO)
    │
    └─ type->cache == nullptr ──► kzalloc(type->obj_size, gfp)

koms_free() dispatches symmetrically. The KOBJ_FLAG_KOMS_ALLOC flag distinguishes heap-allocated objects from statically embedded ones.

6. Thread Safety Summary

Operation	Mechanism
Reference count	Lock-free (CS instruction)
Attribute list	`kobject_t::lock` (spinlock, irqsave)
Listener list	`kobject_t::lock` (spinlock, irqsave)
Child list	`kobject_t::lock` (spinlock, irqsave)
Namespace reads	`rcu_read_lock()` (lockless)
Namespace writes	`kobj_ns_t::write_lock` (spinlock, irqsave)
Type registry reads	Lockless (append-only after boot)
Type registry writes	`type_registry_lock` (spinlock, irqsave)

7. Integration Guide

To integrate a subsystem with KOMS:

Embed kobject_t as the first member of the subsystem struct.
Define a kobject_ops_t with a release callback that calls koms_free().
Optionally define a kobj_type_ops_t for init/destroy hooks.
Define and register a kobj_type_t from the subsystem's init function.
Allocate objects with koms_alloc() and initialize with koms_init_obj().
Use koms_get() / koms_put() for reference management.
Optionally register in a namespace with koms_ns_add().

8. Initialization Order

KOMS must be initialized after kmalloc_init() and before any subsystem that calls koms_type_register() or koms_alloc().

pmm_init → cma_init → mmu_init → vmm_init → slab_init → kmalloc_init
    → koms_init → smp_init → [subsystem inits]

Red-Black Tree

Document Revision: 26h1.1
Source: lib/rbtree.c, include/lib/rbtree.h

1. Overview

ZXFoundation™ provides a layered intrusive red-black tree library. Each layer is a strict superset of the one below it; callers of lower layers require no modification when higher layers are added.

Layer	Type	Concurrency
0 — Core	`rb_root_t`	None (caller-managed)
1 — Augmented	`rb_root_aug_t`	None (caller-managed)
2 — RCU-protected	`rcu_rb_root_t`	Lockless readers, serialised writers
2A — RCU-augmented	`rcu_rb_root_aug_t`	Lockless readers, serialised writers + propagation
3 — Per-CPU cached	`rb_pcpu_cache_t`	O(1) fast path per CPU

The tree is intrusive: the caller embeds rb_node_t (or rb_node_aug_t) inside its own struct and recovers the container with rb_entry(). The colour bit is packed into bit 0 of the parent pointer, keeping rb_node_t at exactly 24 bytes.

2. Node Layout

rb_node_t (24 bytes)
┌──────────────────────────┐
│ left             (8 B)   │  pointer to left child
│ right            (8 B)   │  pointer to right child
│ parent_and_color (8 B)   │  parent ptr | colour bit (bit 0)
└──────────────────────────┘

rb_node_aug_t (32 bytes)
┌──────────────────────────┐
│ node  (rb_node_t, 24 B)  │  must be at offset 0 — cast-compatible
│ subtree_max_gap  (8 B)   │  maintained by propagate callback
└──────────────────────────┘

All rb_node_t pointers are 8-byte aligned on s390x, so bit 0 of any valid pointer is always zero and is free for colour storage.

3. Layer 0 — Core

The core layer provides O(log n) insert, erase, and traversal with no locking. All operations are iterative (bounded stack depth).

Insert Protocol

walk tree → find (parent, link)
rb_link_node(node, parent, link)
rb_insert_fixup(tree, node)

Erase

rb_erase(tree, node)

Traversal

rb_first(tree)   →  minimum node
rb_last(tree)    →  maximum node
rb_next(node)    →  in-order successor
rb_prev(node)    →  in-order predecessor

rb_for_each(pos, tree)
rb_for_each_entry(pos, tree, member)

Container Recovery

rb_entry(ptr, type, member)
rb_entry_safe(ptr, type, member)   ← null-safe variant

4. Layer 1 — Augmented

The augmented layer adds a rb_aug_callbacks_t to rb_root_aug_t. After every structural change (insert, erase, rotation), propagate is invoked bottom-up from the affected node to the root.

Callers embed rb_node_aug_t instead of rb_node_t and maintain a per-node subtree aggregate in subtree_max_gap.

Callbacks

propagate(node)          recompute node->subtree_max_gap from children
copy(dst, src)           copy aggregate when successor replaces deleted node

copy is required when the two-child erase case physically moves the successor into the deleted node's position. Without it the successor would carry a stale aggregate into its new location.

Propagation Order

structural change at node L
        │
        ▼
propagate(L)          ← children already up-to-date
        │
        ▼
propagate(parent(L))
        │
        ▼
        …  (up to root)

API

rb_root_aug_t root = RB_ROOT_AUG_INIT(&my_callbacks);

rb_insert_aug(&root, node, parent, link);
rb_erase_aug(&root, node);

5. Layer 2 — RCU-Protected

rcu_rb_root_t wraps rb_root_t with a write-side spinlock. Readers use the RCU lockless path; writers serialise through the lock and publish pointer updates via rcu_assign_pointer().

Concurrency Model

Reader                          Writer
──────────────────────          ──────────────────────────────
rcu_read_lock()                 spin_lock_irqsave(&root->lock)
  node = rcu_rb_find(...)         rb_erase(...)
  // use node safely              rcu_assign_pointer(root, ...)
rcu_read_unlock()               spin_unlock_irqrestore(...)
                                call_rcu(head, free_fn)

rcu_assign_pointer() issues smp_mb() before the store. rcu_dereference() issues a compiler barrier after each pointer load, preventing the compiler from collapsing multiple loads of the same pointer.

Erase and Grace Period

rcu_rb_erase(root, node, head, free_fn)
    ├─ unlink node under lock
    ├─ rcu_assign_pointer(...)   ← publish updated tree
    └─ call_rcu(head, free_fn)   ← free after grace period

6. Layer 2A — RCU-Augmented

rcu_rb_root_aug_t composes Layer 1 and Layer 2 under a single write lock. The lock covers both rebalancing and aggregate propagation atomically.

Key invariant: readers always observe a tree where subtree_max_gap is consistent with the pointer structure they see, because both are updated under the same lock before rcu_assign_pointer() publishes the result.

Gap Search

rcu_rb_aug_find_gap() performs an O(log n) free-gap search by pruning subtrees whose subtree_max_gap is smaller than the requested size:

find_gap(root, size, align, lo, hi):

  cursor = lo
  n = root

  while n:
    if n.left.subtree_max_gap >= size:
      descend left            ← prune right subtree entirely
      continue

    aligned = align_up(cursor, align)
    if aligned + size <= n.start:
      return aligned          ← gap found left of n

    cursor = max(cursor, n.end)
    n = n.right               ← no gap left of n; try right

  aligned = align_up(cursor, align)
  if aligned + size <= hi:
    return aligned            ← gap after last node

  return 0                    ← no gap found

This replaces the former O(n) linear scan. The caller supplies node_start and node_end accessors, making the search generic over any interval type.

API

rcu_rb_root_aug_t root = RCU_RB_ROOT_AUG_INIT(&my_callbacks);

rcu_rb_aug_insert(&root, node, parent, link);
rcu_rb_aug_erase(&root, node, head, free_fn);

// Under lock or rcu_read_lock():
uint64_t addr = rcu_rb_aug_find_gap(&root, size, align, lo, hi,
                                    node_start_fn, node_end_fn);

7. Layer 3 — Per-CPU Cached

rb_pcpu_cache_t is a per-CPU array of (hint, hint_key) pairs. On a cache hit the search returns in O(1) without touching the tree.

rb_find_cached(root, cache, cmp, arg):

  cpu  = current_cpu()
  hint = cache[cpu].hint

  if hint != NULL && cmp(hint, arg) == 0:
    return hint               ← O(1) fast path

  // full O(log n) walk
  result = tree_walk(root, cmp, arg)
  cache[cpu].hint = result
  return result

The hint is opportunistic — it may be stale. The comparator validates it before the result is returned.

Invalidation

rb_cache_invalidate(cache, node)        O(MAX_CPUS) — call before erase
rb_cache_invalidate_local(cache)        O(1)        — current CPU only

rb_cache_invalidate() must be called before rb_erase() or rcu_rb_aug_erase() on any node in a cached tree to prevent dangling hint pointers.

8. RB-Tree Invariants

The implementation maintains the four standard invariants after every operation:

Every node is RED or BLACK.
The root is BLACK.
Every RED node has two BLACK children.
Every path from a node to a null leaf contains the same number of BLACK nodes.

Insert fixup resolves double-red violations with at most 2 rotations and O(log n) recolourings. Erase fixup resolves double-black violations with at most 3 rotations and O(log n) recolourings. Recolourings do not change pointer structure and are invisible to RCU readers.

9. Constraints

rb_node_aug_t::node must be at offset 0. The _Static_assert in the header enforces this.
rb_aug_callbacks_t::copy may be nullptr only if the caller guarantees no two-child erase will occur. For general use it must be provided.
rb_cache_invalidate() must be called before erasing a node from any cached tree.
rcu_rb_aug_find_gap() may be called under rcu_read_lock() for a best-effort result, or under the write lock for a guaranteed-current result.
synchronize_rcu() may block indefinitely if a CPU never reports a quiescent state. Callers of rcu_rb_aug_erase() must ensure rcu_report_qs() is called from the idle loop and scheduler tick.

Time Subsystem

Document: ZXF-KRN-TIME-001 Revision: 26h1.0 Status: Draft

1. Overview

The time subsystem provides three services to the rest of the kernel:

Monotonic kernel time (ktime_t) — nanoseconds since boot, readable from any context.
Scheduler preemption — CPU timer fires EXT 0x1004 every 10 ms to enforce quanta.
Deferred execution — clock comparator fires EXT 0x1005 to advance the per-CPU timer wheel.

All hardware access (STCKF, SPTC, STPTC, SCKC, STCKC, CR0 manipulation) is confined to arch/s390x/time/tod.c. The portable kernel layer in zxfoundation/time/ calls only the functions declared in include/arch/s390x/time/tod.h.

2. Hardware Sources

z/Architecture provides three per-CPU time mechanisms:

Source	Instruction	Type	Resolution	Kernel use
TOD clock	STCKF	Global, monotonic	~0.244 ns	`ktime_get()`, sleep deadline
CPU timer	SPTC / STPTC	Per-CPU countdown	Same as TOD	Scheduler quantum (10 ms)
Clock comparator	SCKC / STCKC	Per-CPU absolute	Same as TOD	Timer wheel advance

The TOD clock is shared across all CPUs and is monotonic. STCKF reads it without serialization and is safe from hard-IRQ context.

3. TOD Unit Conversion

1 TOD unit = 1000/4096 ns = 125/512 ns

ktime_ns = tod_delta × 125 / 512
tod_units = ns × 512 / 125

Constants used throughout the subsystem:

TOD_1MS  = 4 096 000 units
TOD_10MS = 40 960 000 units
TOD_1S   = 4 096 000 000 units

4. Initialization Sequence

BSP:
  time_init()
    tod_set_boot_offset(STCKF)   ← recorded once; never modified
    timer_wheel_init()           ← per-CPU wheel, level/slot arrays zeroed
    tod_enable_ext_interrupts()  ← CR0 bits 52+53 set
    tod_cpu_timer_set(-10ms)     ← first quantum armed
    tod_clock_comparator_set(now + 1s)  ← safe initial value

Each AP (from ap_startup):
  time_init_ap()
    timer_wheel_init()
    tod_enable_ext_interrupts()
    tod_cpu_timer_set(-10ms)
    tod_clock_comparator_set(now + 1s)

tod_boot_offset is set on the BSP before any AP is started. APs call ktime_get() using the same offset — this is correct because the TOD clock is global.

5. Interrupt Dispatch

The EXT interrupt handler (do_ext_interrupt) intercepts the two time-critical subclasses before the generic irq_dispatch() path:

do_ext_interrupt:
  ext_code = lowcore.ext_int_code
  if ext_code == 0x1004 → time_cpu_timer_handler()   // CPU timer
  if ext_code == 0x1005 → time_clock_comparator_handler()  // clock comparator
  else → irq_dispatch(ZX_IRQ_BASE_EXT + ext_code, frame)

This avoids routing through the irqdesc table, whose 0x0400-entry limit cannot accommodate the full 16-bit EXT subclass space.

6. Timer Wheel

6.1 Structure

8 levels × 64 slots per CPU. Level 0 has 1 ms slot width; each subsequent level is 64× wider.

Level 0: slot = 1 ms,   range = 64 ms
Level 1: slot = 64 ms,  range = ~4 s
Level 2: slot = ~4 s,   range = ~4 min
Level 3: slot = ~4 min, range = ~4.5 h
...
Level 7: slot = ~2 y,   range = ~140 y

6.2 Placement

A timer with expiry delta d from now is placed in the lowest level l such that d < range(l), at slot (current_slot[l] + d/slot_width[l] + 1) % 64.

6.3 Advance

On EXT 0x1005, timer_wheel_advance(now) steps level-0 slot by slot, firing all expired timers. When level 0 completes a full revolution, it cascades timers from level 1 into lower levels, and so on.

6.4 Constraints

All wheel operations require IRQs disabled on the calling CPU.
Callbacks execute in hard-IRQ context. They must not block or acquire locks held by process context.

7. `ktime_sleep()`

Current implementation is a busy-wait:

deadline = STCKF + ns_to_tod(ns)
SCKC(deadline)
while STCKF < deadline: cpu_relax()

This is correct for early boot and short delays. Once the scheduler is operational, this will be replaced with a block/wake implementation using the timer wheel.

8. Strict Requirements

#	Requirement
TIME-1	`ktime_get()` is callable from any context. No lock, no sleep.
TIME-2	Timer callbacks execute in hard-IRQ context. No blocking, no process-context locks.
TIME-3	CPU timer must be reloaded on every `time_cpu_timer_handler()` invocation.
TIME-4	Clock comparator must be reprogrammed after every `timer_wheel_advance()` call.
TIME-5	`tod_boot_offset` is set once in `time_init()` and never modified.
TIME-6	`time_init_ap()` must be called on every AP before the AP enters its idle loop.

Scheduler

Subsystem Stubs

Document Revision: 26h1.1

The following subsystems have source directories and header files but are not yet implemented.

IRQ (`arch/s390x/irq/`)

Handles I/O interrupts from the channel subsystem. The I/O new PSW at lowcore 0x1E0 must point to the I/O interrupt handler. The handler calls TSCH to read the IRB and dispatches to the appropriate device driver.

Status: Stub — new PSW installed as disabled-wait.

Time (`arch/s390x/time/`)

Provides kernel timekeeping using the TOD (Time-of-Day) clock. The TOD clock is a 64-bit counter incremented at 4096 Hz. The boot timestamp is available in proto->tod_boot. The clock comparator interrupt (external interrupt subclass) drives the scheduler tick once the IRQ subsystem is active.

Status: Stub.

Build System Overview

Document Revision: 26h1.0

1. Prerequisites

Tool	Minimum version	Notes	Required
CMake	3.10	Build system generator	true
Compiler and tools	toolchain-specific	See toolchains.md	partly
Ninja	any	Recommended generator	optional
dasdload	any	Needed for image generation (optional)	optional
Hercules	4.x	Helpful for development	optional

2. Output Artifacts

Artifact	Description	Converted from
`core.zxfoundationloader00.sys`	Stage 0 IPL record (tape format)	`zxfl_stage1.elf` → `zxfl_stage1.bin`
`core.zxfoundationloader01.sys`	Stage 1 flat binary	`zxfl_stage2.elf`
`core.zxfoundation.nucleus`	Kernel ELF64 (SHA-256 checksums patched in)	N/A
`sysres.3390`	Hercules 3390 DASD image	N/A
`bin2rec`	Host tool	N/A
`zxsign`	Host tool	N/A

3. CMake Modules

Module	Purpose
`cmake/dependencies.cmake`	Host dependency checks
`cmake/configuration.cmake`	`OPT_LEVEL`, `DSYM_LEVEL` cache variables
`cmake/platform.cmake`	Platform detection
`cmake/standard.cmake`	C standard enforcement
`cmake/hosttools.cmake`	Build `bin2rec` and `zxsign` with host compiler
`cmake/source.cmake`	Kernel source file lists (`ZX_SOURCES_64`)
`cmake/zxfl-compile.cmake`	ZXFL Stage 0 and Stage 1 targets
`cmake/zxfoundation-compile.cmake`	Kernel nucleus target
`cmake/run.cmake`	`dasd` target — generates `sysres.3390`

4. Build Order

CMake enforces the following dependency chain:

tools  (bin2rec, zxsign — host compiler)
  │
  ├─► zxfl_stage1.elf
  │     └─► zxfl_stage1.bin  (objcopy)
  │           └─► core.zxfoundationloader00.sys  (bin2rec)
  │
  ├─► zxfl_stage2.elf
  │     └─► core.zxfoundationloader01.sys  (objcopy)
  │
  └─► core.zxfoundation.nucleus
        └─► zxsign patches .zxvl_checksums in-place
              └─► sysres.3390  (dasdload)

Host tools are always compiled first with ZX_HOST_CC. The kernel and loader are compiled with the cross-compiler.

5. Configuration Variables (non-toolchain-specific, for toolchain-specific, see toolchains.md)

Variable	Default	Description
`OPT_LEVEL`	`2`	`-O` level for all targets
`DSYM_LEVEL`	`0`	`-g` level (0 = no debug info)

Override at configure time:

cmake -B build \
  -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake \
  -DOPT_LEVEL=3

Toolchains

Document Revision: 26h1.0

1. Clang (`cmake/toolchain/zxfoundation-clang.cmake`)

Uses LLVM's built-in cross-compilation support — no separate cross-compiler installation is required on most systems.

cmake -B build \
  -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake \
  -DMARCH_MODE=z14

Role	Tool
C compiler	`clang` (or `clang-$CLANG_VERSION`)
Linker	`ld.lld`
Archiver	`llvm-ar`
objcopy	`llvm-objcopy`
Host CC	`clang`

Set CLANG_VERSION in the environment to select a versioned binary (e.g. CLANG_VERSION=18 → clang-18). If unset, unversioned clang is used.

The target triple --target=s390x-unknown-none-elf is passed as a compile option (not via CMAKE_C_COMPILER_TARGET) to avoid CMake's compiler detection interfering with the freestanding build.

2. GCC (`cmake/toolchain/zxfoundation-gcc.cmake`)

Requires a s390x-ibm-linux-gnu-* cross-compiler toolchain installed on the host.

cmake -B build \
  -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-gcc.cmake

Role	Tool
C compiler	`s390x-ibm-linux-gnu-gcc`
Linker	`s390x-ibm-linux-gnu-ld`
Archiver	`s390x-ibm-linux-gnu-ar`
objcopy	`s390x-ibm-linux-gnu-objcopy`
Host CC	`gcc`

GCC-specific flags added to the kernel target:

Flag	Reason
`-static-libgcc`	Avoid libgcc DSO dependency
`-Wno-array-bounds`	Suppress false positives from GCC's array-bounds analysis on lowcore pointer casts
`-fno-delete-null-pointer-checks`	The kernel legitimately dereferences physical address `0x0` (the lowcore)
`-mzarch`	Force z/Architecture mode

3. Common Compiler Flags

Applied to all targets (loader and kernel):

Flag	Reason
`-ffreestanding`	No hosted C library assumptions
`-nostdlib`	No implicit library linking
`-fno-builtin`	Prevent compiler from substituting builtins with libc calls
`-fno-strict-aliasing`	Kernel code casts between unrelated pointer types
`-fwrapv`	Signed integer overflow wraps (defined behavior)
`-ftrivial-auto-var-init=pattern`	Auto-initialize locals to a poison pattern — catches use-before-init
`-fno-stack-protector`	No `__stack_chk_guard` — freestanding, no libc
`-msoft-float`	No FPU use in kernel
`-mno-vx`	No vector instructions in kernel

Kernel-only additional flag:

Flag	Reason
`-mpacked-stack`	Use packed register save areas (reduces stack frame size)

4. Custom Toolchain

To use a non-standard toolchain, copy one of the provided toolchain files and adjust the compiler/linker paths. The following CMake variables must be set:

Variable	Description
`CMAKE_C_COMPILER`	Path to the C compiler
`CMAKE_LINKER`	Path to the linker
`CMAKE_OBJCOPY`	Path to objcopy
`ZX_HOST_CC`	Host C compiler for building `bin2rec` and `zxsign`
`COMPILER_ID`	`"clang"` or `"gcc"` (selects compiler-specific flag sets)
`TARGET_EMULATION_MODE`	`elf64_s390`
`MARCH_MODE`	Target microarchitecture (e.g. `z10`, `z14`, `z16`)

Build Targets

Document Revision: 26h1.0

`tools`

Builds host-native bin2rec and zxsign using ZX_HOST_CC. This target is an implicit dependency of all other targets — it always runs first.

`zxfl_stage1.elf` → `core.zxfoundationloader00.sys`

Compiles Stage 0. Post-build steps:

objcopy -O binary zxfl_stage1.elf zxfl_stage1.bin — strip ELF headers to raw binary.
bin2rec zxfl_stage1.bin core.zxfoundationloader00.sys — wrap in DASD IPL record format.

The linker script stage1.ld enforces a 12 KB size limit with ASSERT. The build fails if this limit is exceeded.

`zxfl_stage2.elf` → `core.zxfoundationloader01.sys`

Compiles Stage 1. Post-build step:

objcopy -O binary zxfl_stage2.elf core.zxfoundationloader01.sys — flat binary at 0x20000.

`core.zxfoundation.nucleus`

Compiles the kernel. Post-build step:

zxsign core.zxfoundation.nucleus — computes SHA-256 for each PT_LOAD segment and patches the digests into the .zxvl_checksums ELF section in-place.

The kernel linker script is arch/s390x/init/link.ld.

`dasd` → `sysres.3390`

Requires dasdload (from the Hercules package) on PATH.

Remove any existing sysres.3390.
Copy scripts/etc.zxfoundation.parm to the build directory.
Run dasdload -z scripts/sysres.conf sysres.3390 — create a 3390 (compressed) DASD image and write all datasets.
Copy scripts/hercules.cnf to the build directory.

sysres.conf defines the dataset layout: Stage 0, Stage 1, nucleus, and parmfile.

Running

cmake --build build # this build everything including DASD image
hercules -f build/hercules.cnf

In the Hercules console:

ipl 0100

bin2rec

Document Revision: 26h1.0
Source: tools/bin2rec.c

1. Purpose

bin2rec converts a flat binary into the DASD IPL record format required by the Hercules dasdload utility and the z/Architecture channel subsystem.

bin2rec <input.bin> <output.sys>

2. Background

The z/Architecture IPL mechanism reads the first physical record from the IPL device and loads it into memory at address 0x0. The record must be in a specific format: each 80-byte card image contains a header identifying it as a text record (TXT) or end record (END), a load address, a byte count, and 56 bytes of data.

This format originates from the IBM card-punch era — the DASD IPL record format is a direct descendant of the punched-card object deck format.

3. Record Format

Each 80-byte record:

Bytes	Content
0	`0x02` (record type marker)
1–3	`TXT` in EBCDIC (`0xE3 0xE7 0xE3`) or `END` (`0xC5 0xD5 0xC4`)
4	`0x00`
5–7	Load address (24-bit, big-endian)
8–9	`0x00 0x00`
10–11	Byte count (`0x0038` = 56, big-endian)
12–15	`0x00 0x00 0x00 0x00`
16–71	56 bytes of binary data
72–79	`0x00` × 8

The tool reads 56 bytes at a time from the input binary, wraps each chunk in a TXT record, and writes an END record at the end.

4. Limitations

Maximum input size: 32 KB (MAX_REC_SIZE = 32768). This effectively caps stage 1 size at 32 KB.
Load address is 24-bit — intentional. The IPL PSW is a 31-bit ESA/390 PSW; the channel subsystem loads the record into the low 16 MB.

zxsign

Document Revision: 26h1.0
Source: tools/zxsign.c

1. Purpose

zxsign is a post-build host tool that computes SHA-256 digests for each PT_LOAD segment of the kernel ELF and patches them into the .zxvl_checksums section in-place.

zxsign <core.zxfoundation.nucleus>

The file is modified in place. It must be a valid ELF64 file with a .zxvl_checksums section.

2. Operation

Read and validate the ELF64 header (magic, EI_CLASS = ELFCLASS64).
Locate .zxvl_checksums by walking the section header table and the section name string table.
Collect all PT_LOAD program headers. Skip segments with p_filesz = 0 and the segment containing .zxvl_checksums itself (hashing the table while building it would be circular).
For each remaining PT_LOAD segment, read p_filesz bytes from p_offset and compute SHA-256.
Build a zxvl_checksum_table_t with magic 0x5A58564C, version 1, algorithm ZXVL_CKSUM_ALGO_SHA256, and one entry per segment. Physical addresses are computed by stripping CONFIG_KERNEL_VIRT_OFFSET from p_paddr.
Seek to the file offset of .zxvl_checksums and write the complete table in one fwrite.

3. Checksum Table Layout

zxvl_checksum_table_t (packed):
    uint32_t  magic;       // 0x5A58564C
    uint32_t  version;     // 0x00000001
    uint32_t  algo;        // 0x00000001 (SHA-256)
    uint32_t  count;       // number of entries
    entries[16]:
    uint64_t  phys_start  // physical address of segment
    uint64_t  size        // p_filesz
    uint8_t   digest[32]  // SHA-256

The table is located at load_min + ZXVL_CKSUM_TABLE_OFFSET (0x80000) in the loaded kernel. The bootloader reads it from physical memory after loading all ELF segments.

4. Kernel Requirements

The kernel must define a .zxvl_checksums section anchored at the correct virtual address:

__attribute__((section(".zxvl_checksums")))
static volatile zxvl_checksum_table_t zxvl_cksum_table = { 0 };

The linker script must place .zxvl_checksums at HHDM_BASE + 0x80000

ZXFoundation™ Development Guide — Release 26h1